Re: [ANNOUNCE] Nightly maven and package builds for Spark

2015-07-24 Thread Bharath Ravi Kumar
I noticed the last (1.5) build has a timestamp of 16th July. Have nightly
builds been discontinued since then?

Thanks,
Bharath

On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi All,

 This week I got around to setting up nightly builds for Spark on
 Jenkins. I'd like feedback on these and if it's going well I can merge
 the relevant automation scripts into Spark mainline and document it on
 the website. Right now I'm doing:

 1. SNAPSHOT's of Spark master and release branches published to ASF
 Maven snapshot repo:


 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/

 These are usable by adding this repository in your build and using a
 snapshot version (e.g. 1.3.2-SNAPSHOT).

 2. Nightly binary package builds and doc builds of master and release
 versions.

 http://people.apache.org/~pwendell/spark-nightly/

 These build 4 times per day and are tagged based on commits.

 If anyone has feedback on these please let me know.

 Thanks!
 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org




review SPARK-8730

2015-07-24 Thread Eugen Cepoi
Hey,

I've opened a PR to fix ser/de issue of primitives classes in the java
serializer.
I already encountered this problem in different scenarios, so I am bringing
it up.
Would be great if someone wants to have a look at it! :)

https://issues.apache.org/jira/browse/SPARK-8730
https://github.com/apache/spark/pull/7122


Thanks,
Eugen


Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Punyashloka Biswal
Would it make sense to isolate the use of deprecated warnings to a subset
of projects? That way we could turn on more stringent checks for the other
ones.

Punya

On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote:

 Hi all,

 FYI, we just merged a patch that fails a build if there is a scala
 compiler warning (if it is not deprecation warning).

 In the past, many compiler warnings are actually caused by legitimate bugs
 that we need to address. However, if we don't fail the build with warnings,
 people don't pay attention at all to the warnings (it is also tough to pay
 attention since there are a lot of deprecated warnings due to unit tests
 testing deprecated APIs and reliance on Hadoop on deprecated APIs).

 Note that ideally we should be able to mark deprecation warnings as errors
 as well. However, due to the lack of ability to suppress individual warning
 messages in the Scala compiler, we cannot do that (since we do need to
 access deprecated APIs in Hadoop).





Re: PySpark on PyPi

2015-07-24 Thread Jeremy Freeman
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps 
that make it easier to use PySpark as an ordinary python library.

You might want to check out this (https://github.com/minrk/findspark 
https://github.com/minrk/findspark), started by Jupyter project devs, that 
offers one way to facilitate this stuff. I’ve also cced them here to join the 
conversation.

Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in 
an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using 
`from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as 
the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly 
on *both* workers and driver. That said, there’s definitely additional 
configuration / functionality that would require going through the proper 
submit scripts.

 On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com 
 wrote:
 
 I agree with everything Justin just said. An additional advantage of 
 publishing PySpark's Python code in a standards-compliant way is the fact 
 that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way 
 that pip can use. Contrast this with the current situation, where 
 df.toPandas() exists in the Spark API but doesn't actually work until you 
 install Pandas.
 
 Punya
 On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com 
 mailto:justin.u...@gmail.com wrote:
 // + Davies for his comments
 // + Punya for SA
 
 For development and CI, like Olivier mentioned, I think it would be hugely 
 beneficial to publish pyspark (only code in the python/ dir) on PyPI. If 
 anyone wants to develop against PySpark APIs, they need to download the 
 distribution and do a lot of PYTHONPATH munging for all the tools (pylint, 
 pytest, IDE code completion). Right now that involves adding python/ and 
 python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more 
 dependencies, we would have to manually mirror all the PYTHONPATH munging in 
 the ./pyspark script. With a proper pyspark setup.py which declares its 
 dependencies, and a published distribution, depending on pyspark will just be 
 adding pyspark to my setup.py dependencies.
 
 Of course, if we actually want to run parts of pyspark that is backed by Py4J 
 calls, then we need the full spark distribution with either ./pyspark or 
 ./spark-submit, but for things like linting and development, the PYTHONPATH 
 munging is very annoying.
 
 I don't think the version-mismatch issues are a compelling reason to not go 
 ahead with PyPI publishing. At runtime, we should definitely enforce that the 
 version has to be exact, which means there is no backcompat nightmare as 
 suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even 
 if the user got his pip installed pyspark to somehow get loaded before the 
 spark distribution provided pyspark, then the user would be alerted 
 immediately.
 
 Davies, if you buy this, should me or someone on my team pick up 
 https://issues.apache.org/jira/browse/SPARK-1267 
 https://issues.apache.org/jira/browse/SPARK-1267 and 
 https://github.com/apache/spark/pull/464 
 https://github.com/apache/spark/pull/464?
 
 On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot 
 o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com 
 wrote:
 Ok, I get it. Now what can we do to improve the current situation, because 
 right now if I want to set-up a CI env for PySpark, I have to :
 1- download a pre-built version of pyspark and unzip it somewhere on every 
 agent
 2- define the SPARK_HOME env 
 3- symlink this distribution pyspark dir inside the python install dir 
 site-packages/ directory
 and if I rely on additional packages (like databricks' Spark-CSV project), I 
 have to (except if I'm mistaken) 
 4- compile/assembly spark-csv, deploy the jar in a specific directory on 
 every agent
 5- add this jar-filled directory to the Spark distribution's additional 
 classpath using the conf/spark-default file 
 
 Then finally we can launch our unit/integration-tests. 
 Some issues are related to spark-packages, some to the lack of python-based 
 dependency, and some to the way SparkContext are launched when using pyspark.
 I think step 1 and 2 are fair enough
 4 and 5 may already have solutions, I didn't check and considering 
 spark-shell is downloading such dependencies automatically, I think if 
 nothing's done yet it will (I guess ?).
 
 For step 3, maybe just adding a setup.py to the distribution would be enough, 
 I'm not exactly advocating to distribute a full 300Mb spark distribution in 
 PyPi, maybe there's a better compromise ?
 
 Regards, 
 
 Olivier.
 
 Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu 
 mailto:j...@cs.berkeley.edu a écrit :
 Couldn't we have a pip installable pyspark package that just serves as a 
 shim to an existing Spark 

Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Reynold Xin
You can give it a shot, but we will have to revert it for a project as soon
as a project uses a deprecated API somewhere.


On Fri, Jul 24, 2015 at 7:43 AM, Punyashloka Biswal punya.bis...@gmail.com
wrote:

 Would it make sense to isolate the use of deprecated warnings to a subset
 of projects? That way we could turn on more stringent checks for the other
 ones.

 Punya

 On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote:

 Hi all,

 FYI, we just merged a patch that fails a build if there is a scala
 compiler warning (if it is not deprecation warning).

 In the past, many compiler warnings are actually caused by legitimate
 bugs that we need to address. However, if we don't fail the build with
 warnings, people don't pay attention at all to the warnings (it is also
 tough to pay attention since there are a lot of deprecated warnings due to
 unit tests testing deprecated APIs and reliance on Hadoop on deprecated
 APIs).

 Note that ideally we should be able to mark deprecation warnings as
 errors as well. However, due to the lack of ability to suppress individual
 warning messages in the Scala compiler, we cannot do that (since we do need
 to access deprecated APIs in Hadoop).





Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Iulian Dragoș
On Thu, Jul 23, 2015 at 6:08 AM, Reynold Xin r...@databricks.com wrote:

Hi all,

 FYI, we just merged a patch that fails a build if there is a scala
 compiler warning (if it is not deprecation warning).

I’m a bit confused, since I see quite a lot of warnings in semi-legitimate
code.

For instance, @transient (plenty of instances like this in spark-streaming)
might generate warnings like:

abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ :
StreamingContext)
  extends InputDStream[T](ssc_) {

// and the warning is:
no valid targets for annotation on value ssc_ - it is discarded
unused. You may specify targets with meta-annotations, e.g.
@(transient @param)

At least that’s what happens if I build with Scala 2.11, not sure if this
setting is only for 2.10, or something really weird is happening on my
machine that doesn’t happen on others.

iulian


 In the past, many compiler warnings are actually caused by legitimate bugs
 that we need to address. However, if we don't fail the build with warnings,
 people don't pay attention at all to the warnings (it is also tough to pay
 attention since there are a lot of deprecated warnings due to unit tests
 testing deprecated APIs and reliance on Hadoop on deprecated APIs).

 Note that ideally we should be able to mark deprecation warnings as errors
 as well. However, due to the lack of ability to suppress individual warning
 messages in the Scala compiler, we cannot do that (since we do need to
 access deprecated APIs in Hadoop).


  ​
-- 

--
Iulian Dragos

--
Reactive Apps on the JVM
www.typesafe.com


Re: [ANNOUNCE] Nightly maven and package builds for Spark

2015-07-24 Thread Patrick Wendell
Hey Bharath,

There was actually an incompatible change to the build process that
broke several of the Jenkins builds. This should be patched up in the
next day or two and nightly builds will resume.

- Patrick

On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar
reachb...@gmail.com wrote:
 I noticed the last (1.5) build has a timestamp of 16th July. Have nightly
 builds been discontinued since then?

 Thanks,
 Bharath

 On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell pwend...@gmail.com wrote:

 Hi All,

 This week I got around to setting up nightly builds for Spark on
 Jenkins. I'd like feedback on these and if it's going well I can merge
 the relevant automation scripts into Spark mainline and document it on
 the website. Right now I'm doing:

 1. SNAPSHOT's of Spark master and release branches published to ASF
 Maven snapshot repo:


 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/

 These are usable by adding this repository in your build and using a
 snapshot version (e.g. 1.3.2-SNAPSHOT).

 2. Nightly binary package builds and doc builds of master and release
 versions.

 http://people.apache.org/~pwendell/spark-nightly/

 These build 4 times per day and are tagged based on commits.

 If anyone has feedback on these please let me know.

 Thanks!
 - Patrick

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Policy around backporting bug fixes

2015-07-24 Thread Patrick Wendell
Hi All,

A few times I've been asked about backporting and when to backport and
not backport fix patches. Since I have managed this for many of the
past releases, I wanted to point out the way I have been thinking
about it. If we have some consensus I can put it on the wiki.

The trade off when backporting is you get to deliver the fix to people
running older versions (great!), but you risk introducing new or even
worse bugs in maintenance releases (bad!). The decision point is when
you have a bug fix and it's not clear whether it is worth backporting.

I think the following facets are important to consider:
(a) Backports are an extremely valuable service to the community and
should be considered for any bug fix.
(b) Introducing a new bug in a maintenance release must be avoided at
all costs. It over time would erode confidence in our release process.
(c) Distributions or advanced users can always backport risky patches
on their own, if they see fit.

For me, the consequence of these is that we should backport in the
following situations:
- Both the bug and the fix are well understood and isolated. Code
being modified is well tested.
- The bug being addressed is high priority to the community.
- The backported fix does not vary widely from the master branch fix.

We tend to avoid backports in the converse situations:
- The bug or fix are not well understood. For instance, it relates to
interactions between complex components or third party libraries (e.g.
Hadoop libraries). The code is not well tested outside of the
immediate bug being fixed.
- The bug is not clearly a high priority for the community.
- The backported fix is widely different from the master branch fix.

These are clearly subjective criteria, but ones worth considering. I
am always happy to help advise people on specific patches if they want
a soundingboard to understand whether it makes sense to backport.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: non-deprecation compiler warnings are upgraded to build errors now

2015-07-24 Thread Reynold Xin
Jenkins only run Scala 2.10. I'm actually not sure what the behavior is
with 2.11 for that patch.

iulian - can you take a look into it and see if it is working as expected?


On Fri, Jul 24, 2015 at 10:24 AM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:

 On Thu, Jul 23, 2015 at 6:08 AM, Reynold Xin r...@databricks.com wrote:

 Hi all,

 FYI, we just merged a patch that fails a build if there is a scala
 compiler warning (if it is not deprecation warning).

 I’m a bit confused, since I see quite a lot of warnings in semi-legitimate
 code.

 For instance, @transient (plenty of instances like this in
 spark-streaming) might generate warnings like:

 abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : 
 StreamingContext)
   extends InputDStream[T](ssc_) {

 // and the warning is:
 no valid targets for annotation on value ssc_ - it is discarded unused. You 
 may specify targets with meta-annotations, e.g. @(transient @param)

 At least that’s what happens if I build with Scala 2.11, not sure if this
 setting is only for 2.10, or something really weird is happening on my
 machine that doesn’t happen on others.

 iulian


 In the past, many compiler warnings are actually caused by legitimate
 bugs that we need to address. However, if we don't fail the build with
 warnings, people don't pay attention at all to the warnings (it is also
 tough to pay attention since there are a lot of deprecated warnings due to
 unit tests testing deprecated APIs and reliance on Hadoop on deprecated
 APIs).

 Note that ideally we should be able to mark deprecation warnings as
 errors as well. However, due to the lack of ability to suppress individual
 warning messages in the Scala compiler, we cannot do that (since we do need
 to access deprecated APIs in Hadoop).


  ​
 --

 --
 Iulian Dragos

 --
 Reactive Apps on the JVM
 www.typesafe.com




jenkins failing on Kinesis shard limits

2015-07-24 Thread Steve Loughran

Looks like Jenkins is hitting some AWS limits

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38396/testReport/org.apache.spark.streaming.kinesis/KinesisBackedBlockRDDSuite/_It_is_not_a_test_/


Re: review SPARK-8730

2015-07-24 Thread Sean Owen
It looks like you have a number of review comments on the PR that you
have not replied to. The PR does not merge at the moment either.

On Fri, Jul 24, 2015 at 12:03 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote:
 Hey,

 I've opened a PR to fix ser/de issue of primitives classes in the java
 serializer.
 I already encountered this problem in different scenarios, so I am bringing
 it up.
 Would be great if someone wants to have a look at it! :)

 https://issues.apache.org/jira/browse/SPARK-8730
 https://github.com/apache/spark/pull/7122


 Thanks,
 Eugen

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: review SPARK-8730

2015-07-24 Thread Eugen Cepoi
I just got those comments and it doesn't merge since few time, the code
evolved since I opened the pr

2015-07-24 14:12 GMT+02:00 Sean Owen so...@cloudera.com:

 It looks like you have a number of review comments on the PR that you
 have not replied to. The PR does not merge at the moment either.

 On Fri, Jul 24, 2015 at 12:03 PM, Eugen Cepoi cepoi.eu...@gmail.com
 wrote:
  Hey,
 
  I've opened a PR to fix ser/de issue of primitives classes in the java
  serializer.
  I already encountered this problem in different scenarios, so I am
 bringing
  it up.
  Would be great if someone wants to have a look at it! :)
 
  https://issues.apache.org/jira/browse/SPARK-8730
  https://github.com/apache/spark/pull/7122
 
 
  Thanks,
  Eugen