Re: [ANNOUNCE] Nightly maven and package builds for Spark
I noticed the last (1.5) build has a timestamp of 16th July. Have nightly builds been discontinued since then? Thanks, Bharath On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hi All, This week I got around to setting up nightly builds for Spark on Jenkins. I'd like feedback on these and if it's going well I can merge the relevant automation scripts into Spark mainline and document it on the website. Right now I'm doing: 1. SNAPSHOT's of Spark master and release branches published to ASF Maven snapshot repo: https://repository.apache.org/content/repositories/snapshots/org/apache/spark/ These are usable by adding this repository in your build and using a snapshot version (e.g. 1.3.2-SNAPSHOT). 2. Nightly binary package builds and doc builds of master and release versions. http://people.apache.org/~pwendell/spark-nightly/ These build 4 times per day and are tagged based on commits. If anyone has feedback on these please let me know. Thanks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
review SPARK-8730
Hey, I've opened a PR to fix ser/de issue of primitives classes in the java serializer. I already encountered this problem in different scenarios, so I am bringing it up. Would be great if someone wants to have a look at it! :) https://issues.apache.org/jira/browse/SPARK-8730 https://github.com/apache/spark/pull/7122 Thanks, Eugen
Re: non-deprecation compiler warnings are upgraded to build errors now
Would it make sense to isolate the use of deprecated warnings to a subset of projects? That way we could turn on more stringent checks for the other ones. Punya On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay attention at all to the warnings (it is also tough to pay attention since there are a lot of deprecated warnings due to unit tests testing deprecated APIs and reliance on Hadoop on deprecated APIs). Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
Re: PySpark on PyPi
Hey all, great discussion, just wanted to +1 that I see a lot of value in steps that make it easier to use PySpark as an ordinary python library. You might want to check out this (https://github.com/minrk/findspark https://github.com/minrk/findspark), started by Jupyter project devs, that offers one way to facilitate this stuff. I’ve also cced them here to join the conversation. Also, @Jey, I can also confirm that at least in some scenarios (I’ve done it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)` so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are set correctly on *both* workers and driver. That said, there’s definitely additional configuration / functionality that would require going through the proper submit scripts. On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal punya.bis...@gmail.com wrote: I agree with everything Justin just said. An additional advantage of publishing PySpark's Python code in a standards-compliant way is the fact that we'll be able to declare transitive dependencies (Pandas, Py4J) in a way that pip can use. Contrast this with the current situation, where df.toPandas() exists in the Spark API but doesn't actually work until you install Pandas. Punya On Wed, Jul 22, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com mailto:justin.u...@gmail.com wrote: // + Davies for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of PYTHONPATH munging for all the tools (pylint, pytest, IDE code completion). Right now that involves adding python/ and python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more dependencies, we would have to manually mirror all the PYTHONPATH munging in the ./pyspark script. With a proper pyspark setup.py which declares its dependencies, and a published distribution, depending on pyspark will just be adding pyspark to my setup.py dependencies. Of course, if we actually want to run parts of pyspark that is backed by Py4J calls, then we need the full spark distribution with either ./pyspark or ./spark-submit, but for things like linting and development, the PYTHONPATH munging is very annoying. I don't think the version-mismatch issues are a compelling reason to not go ahead with PyPI publishing. At runtime, we should definitely enforce that the version has to be exact, which means there is no backcompat nightmare as suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267. This would mean that even if the user got his pip installed pyspark to somehow get loaded before the spark distribution provided pyspark, then the user would be alerted immediately. Davies, if you buy this, should me or someone on my team pick up https://issues.apache.org/jira/browse/SPARK-1267 https://issues.apache.org/jira/browse/SPARK-1267 and https://github.com/apache/spark/pull/464 https://github.com/apache/spark/pull/464? On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot o.girar...@lateral-thoughts.com mailto:o.girar...@lateral-thoughts.com wrote: Ok, I get it. Now what can we do to improve the current situation, because right now if I want to set-up a CI env for PySpark, I have to : 1- download a pre-built version of pyspark and unzip it somewhere on every agent 2- define the SPARK_HOME env 3- symlink this distribution pyspark dir inside the python install dir site-packages/ directory and if I rely on additional packages (like databricks' Spark-CSV project), I have to (except if I'm mistaken) 4- compile/assembly spark-csv, deploy the jar in a specific directory on every agent 5- add this jar-filled directory to the Spark distribution's additional classpath using the conf/spark-default file Then finally we can launch our unit/integration-tests. Some issues are related to spark-packages, some to the lack of python-based dependency, and some to the way SparkContext are launched when using pyspark. I think step 1 and 2 are fair enough 4 and 5 may already have solutions, I didn't check and considering spark-shell is downloading such dependencies automatically, I think if nothing's done yet it will (I guess ?). For step 3, maybe just adding a setup.py to the distribution would be enough, I'm not exactly advocating to distribute a full 300Mb spark distribution in PyPi, maybe there's a better compromise ? Regards, Olivier. Le ven. 5 juin 2015 à 22:12, Jey Kottalam j...@cs.berkeley.edu mailto:j...@cs.berkeley.edu a écrit : Couldn't we have a pip installable pyspark package that just serves as a shim to an existing Spark
Re: non-deprecation compiler warnings are upgraded to build errors now
You can give it a shot, but we will have to revert it for a project as soon as a project uses a deprecated API somewhere. On Fri, Jul 24, 2015 at 7:43 AM, Punyashloka Biswal punya.bis...@gmail.com wrote: Would it make sense to isolate the use of deprecated warnings to a subset of projects? That way we could turn on more stringent checks for the other ones. Punya On Thu, Jul 23, 2015 at 12:08 AM Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay attention at all to the warnings (it is also tough to pay attention since there are a lot of deprecated warnings due to unit tests testing deprecated APIs and reliance on Hadoop on deprecated APIs). Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
Re: non-deprecation compiler warnings are upgraded to build errors now
On Thu, Jul 23, 2015 at 6:08 AM, Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). I’m a bit confused, since I see quite a lot of warnings in semi-legitimate code. For instance, @transient (plenty of instances like this in spark-streaming) might generate warnings like: abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext) extends InputDStream[T](ssc_) { // and the warning is: no valid targets for annotation on value ssc_ - it is discarded unused. You may specify targets with meta-annotations, e.g. @(transient @param) At least that’s what happens if I build with Scala 2.11, not sure if this setting is only for 2.10, or something really weird is happening on my machine that doesn’t happen on others. iulian In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay attention at all to the warnings (it is also tough to pay attention since there are a lot of deprecated warnings due to unit tests testing deprecated APIs and reliance on Hadoop on deprecated APIs). Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
Re: [ANNOUNCE] Nightly maven and package builds for Spark
Hey Bharath, There was actually an incompatible change to the build process that broke several of the Jenkins builds. This should be patched up in the next day or two and nightly builds will resume. - Patrick On Fri, Jul 24, 2015 at 12:51 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: I noticed the last (1.5) build has a timestamp of 16th July. Have nightly builds been discontinued since then? Thanks, Bharath On Sun, May 24, 2015 at 1:11 PM, Patrick Wendell pwend...@gmail.com wrote: Hi All, This week I got around to setting up nightly builds for Spark on Jenkins. I'd like feedback on these and if it's going well I can merge the relevant automation scripts into Spark mainline and document it on the website. Right now I'm doing: 1. SNAPSHOT's of Spark master and release branches published to ASF Maven snapshot repo: https://repository.apache.org/content/repositories/snapshots/org/apache/spark/ These are usable by adding this repository in your build and using a snapshot version (e.g. 1.3.2-SNAPSHOT). 2. Nightly binary package builds and doc builds of master and release versions. http://people.apache.org/~pwendell/spark-nightly/ These build 4 times per day and are tagged based on commits. If anyone has feedback on these please let me know. Thanks! - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Policy around backporting bug fixes
Hi All, A few times I've been asked about backporting and when to backport and not backport fix patches. Since I have managed this for many of the past releases, I wanted to point out the way I have been thinking about it. If we have some consensus I can put it on the wiki. The trade off when backporting is you get to deliver the fix to people running older versions (great!), but you risk introducing new or even worse bugs in maintenance releases (bad!). The decision point is when you have a bug fix and it's not clear whether it is worth backporting. I think the following facets are important to consider: (a) Backports are an extremely valuable service to the community and should be considered for any bug fix. (b) Introducing a new bug in a maintenance release must be avoided at all costs. It over time would erode confidence in our release process. (c) Distributions or advanced users can always backport risky patches on their own, if they see fit. For me, the consequence of these is that we should backport in the following situations: - Both the bug and the fix are well understood and isolated. Code being modified is well tested. - The bug being addressed is high priority to the community. - The backported fix does not vary widely from the master branch fix. We tend to avoid backports in the converse situations: - The bug or fix are not well understood. For instance, it relates to interactions between complex components or third party libraries (e.g. Hadoop libraries). The code is not well tested outside of the immediate bug being fixed. - The bug is not clearly a high priority for the community. - The backported fix is widely different from the master branch fix. These are clearly subjective criteria, but ones worth considering. I am always happy to help advise people on specific patches if they want a soundingboard to understand whether it makes sense to backport. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: non-deprecation compiler warnings are upgraded to build errors now
Jenkins only run Scala 2.10. I'm actually not sure what the behavior is with 2.11 for that patch. iulian - can you take a look into it and see if it is working as expected? On Fri, Jul 24, 2015 at 10:24 AM, Iulian Dragoș iulian.dra...@typesafe.com wrote: On Thu, Jul 23, 2015 at 6:08 AM, Reynold Xin r...@databricks.com wrote: Hi all, FYI, we just merged a patch that fails a build if there is a scala compiler warning (if it is not deprecation warning). I’m a bit confused, since I see quite a lot of warnings in semi-legitimate code. For instance, @transient (plenty of instances like this in spark-streaming) might generate warnings like: abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext) extends InputDStream[T](ssc_) { // and the warning is: no valid targets for annotation on value ssc_ - it is discarded unused. You may specify targets with meta-annotations, e.g. @(transient @param) At least that’s what happens if I build with Scala 2.11, not sure if this setting is only for 2.10, or something really weird is happening on my machine that doesn’t happen on others. iulian In the past, many compiler warnings are actually caused by legitimate bugs that we need to address. However, if we don't fail the build with warnings, people don't pay attention at all to the warnings (it is also tough to pay attention since there are a lot of deprecated warnings due to unit tests testing deprecated APIs and reliance on Hadoop on deprecated APIs). Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). -- -- Iulian Dragos -- Reactive Apps on the JVM www.typesafe.com
jenkins failing on Kinesis shard limits
Looks like Jenkins is hitting some AWS limits https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38396/testReport/org.apache.spark.streaming.kinesis/KinesisBackedBlockRDDSuite/_It_is_not_a_test_/
Re: review SPARK-8730
It looks like you have a number of review comments on the PR that you have not replied to. The PR does not merge at the moment either. On Fri, Jul 24, 2015 at 12:03 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hey, I've opened a PR to fix ser/de issue of primitives classes in the java serializer. I already encountered this problem in different scenarios, so I am bringing it up. Would be great if someone wants to have a look at it! :) https://issues.apache.org/jira/browse/SPARK-8730 https://github.com/apache/spark/pull/7122 Thanks, Eugen - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: review SPARK-8730
I just got those comments and it doesn't merge since few time, the code evolved since I opened the pr 2015-07-24 14:12 GMT+02:00 Sean Owen so...@cloudera.com: It looks like you have a number of review comments on the PR that you have not replied to. The PR does not merge at the moment either. On Fri, Jul 24, 2015 at 12:03 PM, Eugen Cepoi cepoi.eu...@gmail.com wrote: Hey, I've opened a PR to fix ser/de issue of primitives classes in the java serializer. I already encountered this problem in different scenarios, so I am bringing it up. Would be great if someone wants to have a look at it! :) https://issues.apache.org/jira/browse/SPARK-8730 https://github.com/apache/spark/pull/7122 Thanks, Eugen