Re: [VOTE] Release Apache Spark 1.2.0 (RC1)
+1 (non-binding) -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Sat, Nov 29, 2014 01:16 PM To: dev@spark.apache.orgdev@spark.apache.org; Subject: [VOTE] Release Apache Spark 1.2.0 (RC1) Please vote on releasing the following candidate as Apache Spark version 1.2.0! The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1048/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/ Please vote on releasing this package as Apache Spark 1.2.0! The vote is open until Tuesday, December 02, at 05:15 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.1.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == What justifies a -1 vote for this release? == This vote is happening very late into the QA period compared with previous votes, so -1 votes should only occur for significant regressions from 1.0.2. Bugs already present in 1.1.X, minor regressions, or bugs related to new features will not block this release. == What default changes should I be aware of? == 1. The default value of spark.shuffle.blockTransferService has been changed to netty -- Old behavior can be restored by switching to nio 2. The default value of spark.shuffle.manager has been changed to sort. -- Old behavior can be restored by setting spark.shuffle.manager to hash. == Other notes == Because this vote is occurring over a weekend, I will likely extend the vote if this RC survives until the end of the vote period. - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spurious test failures, testing best practices
In the course of trying to make contributions to Spark, I have had a lot of trouble running Spark's tests successfully. The main pain points I've experienced are: 1) frequent, spurious test failures 2) high latency of running tests 3) difficulty running specific tests in an iterative fashion Here is an example series of failures that I encountered this weekend (along with footnote links to the console output from each and approximately how long each took): - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen before. - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite passed, but scala compiler crashed on the catalyst project. - `mvn clean`: some attempts to run earlier commands (that previously didn't crash the compiler) all result in the same compiler crash. Previous discussion on this list implies this can only be solved by a `mvn clean` [4]. - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, BroadcastSuite can't run because assembly is not built. - `./dev/run-tests` again [6]: pyspark tests fail, some messages about version mismatches and python 2.6. The machine this ran on has python 2.7, so I don't know what that's about. - `./dev/run-tests` again [7]: too many open files errors in several tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is not enough, but only some of the time? I increased it to 8192 and tried again. - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to be the issue from SPARK-3867 [9], which was supposedly fixed on October 14; not sure how I'm seeing it now. In any case, switched to Python 2.6 and installed unittest2, and python/run-tests seems to be unblocked. - `./dev/run-tests` again [10]: finally passes! This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial changes added on (that I wanted to test before sending out a PR), on a macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11]. Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands from the same repo state: - `./dev/run-tests` [12]: YarnClusterSuite failure. - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen this one before on this machine and am guessing it actually occurs every time. - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more time from ceb6281, and saw the same failure. This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac, from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used), and it passed [16], so the failure seems specific to my linux machine/arch. At this point I believe that my changes don't break any tests (the YarnClusterSuite failure on my linux presumably not being... real), and I am ready to send out a PR. Whew! However, reflecting on the 5 or 6 distinct failure-modes represented above: - One of them (too many files open), is something I can (and did, hopefully) fix once and for all. It cost me an ~hour this time (approximate time of running ./dev/run-tests) and a few hours other times when I didn't fully understand/fix it. It doesn't happen deterministically (why?), but does happen somewhat frequently to people, having been discussed on the user list multiple times [17] and on SO [18]. Maybe some note in the documentation advising people to check their ulimit makes sense? - One of them (unittest2 must be installed for python 2.6) was supposedly fixed upstream of the commits I tested here; I don't know why I'm still running into it. This cost me a few hours of running `./dev/run-tests` multiple times to see if it was transient, plus some time researching and working around it. - The original BroadcastSuite failure cost me a few hours and went away before I'd even run `mvn clean`. - A new incarnation of the sbt-compiler-crash phenomenon cost me a few hours of running `./dev/run-tests` in different ways before deciding that, as usual, there was no way around it and that I'd need to run `mvn clean` and start running tests from scratch. - The YarnClusterSuite failures on my linux box have cost me hours of trying to figure out whether they're my fault. I've seen them many times over the past weeks/months, plus or minus other failures that have come and gone, and was especially befuddled by them when I was seeing a disjoint set of reproducible failures on my mac [19] (the triaging of which involved dozens of runs of `./dev/run-tests`). While I'm interested in digging into each of these issues, I also want to discuss the frequency with which I've run into issues like these. This is unfortunately not the first time in recent months that I've spent days playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests iteration time, which is no fun! So I am wondering/thinking: - Do other people experience
Re: Spurious test failures, testing best practices
+1, you aren¹t alone in this. I certainly would like some clarity in these things well, but, as its been said on this listserv a few times (and you noted), most developers use `sbt` for their day-to-day compilations to greatly speed up the iterative testing process. I personally use `sbt` for all builds until I¹m ready to submit a PR and *then* run ./dev/run-tests to ensure all the tests / code I¹ve written still pass (i.e. nothing breaks in the code I¹ve changed or downstream). Sometimes, like you¹ve said, you still get errors with the ./dev/run-tests script, but, for me, it comes down to where the errors initiate from and whether I¹m confident the code I wrote caused it or not as the delimiter to whether I submit the PR. Again, not a great answer and hoping others can shed more light, but thats my 2c on the problem. On 11/30/14, 5:39 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: In the course of trying to make contributions to Spark, I have had a lot of trouble running Spark's tests successfully. The main pain points I've experienced are: 1) frequent, spurious test failures 2) high latency of running tests 3) difficulty running specific tests in an iterative fashion Here is an example series of failures that I encountered this weekend (along with footnote links to the console output from each and approximately how long each took): - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen before. - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite passed, but scala compiler crashed on the catalyst project. - `mvn clean`: some attempts to run earlier commands (that previously didn't crash the compiler) all result in the same compiler crash. Previous discussion on this list implies this can only be solved by a `mvn clean` [4]. - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, BroadcastSuite can't run because assembly is not built. - `./dev/run-tests` again [6]: pyspark tests fail, some messages about version mismatches and python 2.6. The machine this ran on has python 2.7, so I don't know what that's about. - `./dev/run-tests` again [7]: too many open files errors in several tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is not enough, but only some of the time? I increased it to 8192 and tried again. - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to be the issue from SPARK-3867 [9], which was supposedly fixed on October 14; not sure how I'm seeing it now. In any case, switched to Python 2.6 and installed unittest2, and python/run-tests seems to be unblocked. - `./dev/run-tests` again [10]: finally passes! This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial changes added on (that I wanted to test before sending out a PR), on a macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11]. Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands from the same repo state: - `./dev/run-tests` [12]: YarnClusterSuite failure. - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen this one before on this machine and am guessing it actually occurs every time. - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more time from ceb6281, and saw the same failure. This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac, from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used), and it passed [16], so the failure seems specific to my linux machine/arch. At this point I believe that my changes don't break any tests (the YarnClusterSuite failure on my linux presumably not being... real), and I am ready to send out a PR. Whew! However, reflecting on the 5 or 6 distinct failure-modes represented above: - One of them (too many files open), is something I can (and did, hopefully) fix once and for all. It cost me an ~hour this time (approximate time of running ./dev/run-tests) and a few hours other times when I didn't fully understand/fix it. It doesn't happen deterministically (why?), but does happen somewhat frequently to people, having been discussed on the user list multiple times [17] and on SO [18]. Maybe some note in the documentation advising people to check their ulimit makes sense? - One of them (unittest2 must be installed for python 2.6) was supposedly fixed upstream of the commits I tested here; I don't know why I'm still running into it. This cost me a few hours of running `./dev/run-tests` multiple times to see if it was transient, plus some time researching and working around it. - The original BroadcastSuite failure cost me a few hours and went away before I'd even run `mvn clean`. - A new incarnation of the sbt-compiler-crash phenomenon cost me a few hours of running `./dev/run-tests` in different ways before deciding
Re: Spurious test failures, testing best practices
Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following: - Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g. DistributedSuite, which creates multi-process mini-clusters). Many of the individual suites run fast without requiring this, however, so you can pick the ones you want. Perhaps we should find a way to tag them so people can do a quick-test that skips the integration ones. The assembly builds are annoying but they only take about a minute for me on a MacBook Pro with SBT warmed up. The assembly is actually only required for some of the integration tests (which launch new processes), but I'd recommend doing it all the time anyway since it would be very confusing to run those with an old assembly. The Scala compiler crash issue can also be a problem, but I don't see it very often with SBT. If it happens, I exit SBT and do sbt clean. Anyway, this is useful feedback and I think we should try to improve some of these suites, but hopefully you can also try the faster SBT process. At the end of the day, if we want integration tests, the whole test process will take an hour, but most of the developers I know leave that to Jenkins and only run individual tests locally before submitting a patch. Matei On Nov 30, 2014, at 2:39 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: In the course of trying to make contributions to Spark, I have had a lot of trouble running Spark's tests successfully. The main pain points I've experienced are: 1) frequent, spurious test failures 2) high latency of running tests 3) difficulty running specific tests in an iterative fashion Here is an example series of failures that I encountered this weekend (along with footnote links to the console output from each and approximately how long each took): - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen before. - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite passed, but scala compiler crashed on the catalyst project. - `mvn clean`: some attempts to run earlier commands (that previously didn't crash the compiler) all result in the same compiler crash. Previous discussion on this list implies this can only be solved by a `mvn clean` [4]. - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, BroadcastSuite can't run because assembly is not built. - `./dev/run-tests` again [6]: pyspark tests fail, some messages about version mismatches and python 2.6. The machine this ran on has python 2.7, so I don't know what that's about. - `./dev/run-tests` again [7]: too many open files errors in several tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is not enough, but only some of the time? I increased it to 8192 and tried again. - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to be the issue from SPARK-3867 [9], which was supposedly fixed on October 14; not sure how I'm seeing it now. In any case, switched to Python 2.6 and installed unittest2, and python/run-tests seems to be unblocked. - `./dev/run-tests` again [10]: finally passes! This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial changes added on (that I wanted to test before sending out a PR), on a macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11]. Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands from the same repo state: - `./dev/run-tests` [12]: YarnClusterSuite failure. - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen this one before on this machine and am guessing it actually occurs every time. - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more time from ceb6281, and saw the same failure. This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac, from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used), and it passed [16], so the failure seems specific to my linux machine/arch. At this point I believe that my changes don't break any tests (the
Re: Spurious test failures, testing best practices
thanks for the info, Matei and Brennon. I will try to switch my workflow to using sbt. Other potential action items: - currently the docs only contain information about building with maven, and even then don't cover many important cases, as I described in my previous email. If SBT is as much better as you've described then that should be made much more obvious. Wasn't it the case recently that there was only a page about building with SBT, and not one about building with maven? Clearer messaging around this needs to exist in the documentation, not just on the mailing list, imho. - +1 to better distinguishing between unit and integration tests, having separate scripts for each, improving documentation around common workflows, expectations of brittleness with each kind of test, advisability of just relying on Jenkins for certain kinds of tests to not waste too much time, etc. Things like the compiler crash should be discussed in the documentation, not just in the mailing list archives, if new contributors are likely to run into them through no fault of their own. - What is the algorithm you use to decide what tests you might have broken? Can we codify it in some scripts that other people can use? On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com wrote: Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following: - Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g. DistributedSuite, which creates multi-process mini-clusters). Many of the individual suites run fast without requiring this, however, so you can pick the ones you want. Perhaps we should find a way to tag them so people can do a quick-test that skips the integration ones. The assembly builds are annoying but they only take about a minute for me on a MacBook Pro with SBT warmed up. The assembly is actually only required for some of the integration tests (which launch new processes), but I'd recommend doing it all the time anyway since it would be very confusing to run those with an old assembly. The Scala compiler crash issue can also be a problem, but I don't see it very often with SBT. If it happens, I exit SBT and do sbt clean. Anyway, this is useful feedback and I think we should try to improve some of these suites, but hopefully you can also try the faster SBT process. At the end of the day, if we want integration tests, the whole test process will take an hour, but most of the developers I know leave that to Jenkins and only run individual tests locally before submitting a patch. Matei On Nov 30, 2014, at 2:39 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: In the course of trying to make contributions to Spark, I have had a lot of trouble running Spark's tests successfully. The main pain points I've experienced are: 1) frequent, spurious test failures 2) high latency of running tests 3) difficulty running specific tests in an iterative fashion Here is an example series of failures that I encountered this weekend (along with footnote links to the console output from each and approximately how long each took): - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen before. - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite passed, but scala compiler crashed on the catalyst project. - `mvn clean`: some attempts to run earlier commands (that previously didn't crash the compiler) all result in the same compiler crash. Previous discussion on this list implies this can only be solved by a `mvn clean` [4]. - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, BroadcastSuite can't run because assembly is not built. - `./dev/run-tests` again [6]: pyspark tests fail, some messages about version mismatches and python 2.6. The machine this ran on has python 2.7, so I don't know what that's about. - `./dev/run-tests` again [7]: too many open files errors in several tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is not enough, but only some of the time? I increased it to 8192 and tried again. - `./dev/run-tests` again [8]: same pyspark
Re: Spurious test failures, testing best practices
- currently the docs only contain information about building with maven, and even then don’t cover many important cases All other points aside, I just want to point out that the docs document both how to use Maven and SBT and clearly state https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt that Maven is the “build of reference” while SBT may be preferable for day-to-day development. I believe the main reason most people miss this documentation is that, though it’s up-to-date on GitHub, it has’t been published yet to the docs site. It should go out with the 1.2 release. Improvements to the documentation on building Spark belong here: https://github.com/apache/spark/blob/master/docs/building-spark.md If there are clear recommendations that come out of this thread but are not in that doc, they should be added in there. Other, less important details may possibly be better suited for the Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark guide. Nick On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, A few more things here. You should feel free to send patches to Jenkins to test them, since this is the reference environment in which we regularly run tests. This is the normal workflow for most developers and we spend a lot of effort provisioning/maintaining a very large jenkins cluster to allow developers access this resource. A common development approach is to locally run tests that you've added in a patch, then send it to jenkins for the full run, and then try to debug locally if you see specific unanticipated test failures. One challenge we have is that given the proliferation of OS versions, Java versions, Python versions, ulimits, etc. there is a combinatorial number of environments in which tests could be run. It is very hard in some cases to figure out post-hoc why a given test is not working in a specific environment. I think a good solution here would be to use a standardized docker container for running Spark tests and asking folks to use that locally if they are trying to run all of the hundreds of Spark tests. Another solution would be to mock out every system interaction in Spark's tests including e.g. filesystem interactions to try and reduce variance across environments. However, that seems difficult. As the number of developers of Spark increases, it's definitely a good idea for us to invest in developer infrastructure including things like snapshot releases, better documentation, etc. Thanks for bringing this up as a pain point. - Patrick On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: thanks for the info, Matei and Brennon. I will try to switch my workflow to using sbt. Other potential action items: - currently the docs only contain information about building with maven, and even then don't cover many important cases, as I described in my previous email. If SBT is as much better as you've described then that should be made much more obvious. Wasn't it the case recently that there was only a page about building with SBT, and not one about building with maven? Clearer messaging around this needs to exist in the documentation, not just on the mailing list, imho. - +1 to better distinguishing between unit and integration tests, having separate scripts for each, improving documentation around common workflows, expectations of brittleness with each kind of test, advisability of just relying on Jenkins for certain kinds of tests to not waste too much time, etc. Things like the compiler crash should be discussed in the documentation, not just in the mailing list archives, if new contributors are likely to run into them through no fault of their own. - What is the algorithm you use to decide what tests you might have broken? Can we codify it in some scripts that other people can use? On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com wrote: Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following: - Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g.
Re: Spurious test failures, testing best practices
- Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) The equivalent using Maven: - Start zinc - Build your assembly using the mvn package or install target (install is actually the equivalent of SBT's publishLocal) -- this step is the first step in http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven - Run all the tests in one module: mvn -pl core test - Run a specific suite: mvn -pl core -DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't strictly necessary if you don't mind waiting for Maven to scan through all the other sub-projects only to do nothing; and, of course, it needs to be something other than core if the test you want to run is in another sub-project.) You also typically want to carry along in each subsequent step any relevant command line options you added in the package/install step. On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following: - Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g. DistributedSuite, which creates multi-process mini-clusters). Many of the individual suites run fast without requiring this, however, so you can pick the ones you want. Perhaps we should find a way to tag them so people can do a quick-test that skips the integration ones. The assembly builds are annoying but they only take about a minute for me on a MacBook Pro with SBT warmed up. The assembly is actually only required for some of the integration tests (which launch new processes), but I'd recommend doing it all the time anyway since it would be very confusing to run those with an old assembly. The Scala compiler crash issue can also be a problem, but I don't see it very often with SBT. If it happens, I exit SBT and do sbt clean. Anyway, this is useful feedback and I think we should try to improve some of these suites, but hopefully you can also try the faster SBT process. At the end of the day, if we want integration tests, the whole test process will take an hour, but most of the developers I know leave that to Jenkins and only run individual tests locally before submitting a patch. Matei On Nov 30, 2014, at 2:39 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: In the course of trying to make contributions to Spark, I have had a lot of trouble running Spark's tests successfully. The main pain points I've experienced are: 1) frequent, spurious test failures 2) high latency of running tests 3) difficulty running specific tests in an iterative fashion Here is an example series of failures that I encountered this weekend (along with footnote links to the console output from each and approximately how long each took): - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen before. - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure. - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite passed, but scala compiler crashed on the catalyst project. - `mvn clean`: some attempts to run earlier commands (that previously didn't crash the compiler) all result in the same compiler crash. Previous discussion on this list implies this can only be solved by a `mvn clean` [4]. - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean, BroadcastSuite can't run because assembly is not built. - `./dev/run-tests` again [6]: pyspark tests fail, some messages about version mismatches and python 2.6. The machine this ran on has python 2.7, so I don't know what that's about. - `./dev/run-tests` again [7]: too many open files errors in several tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is not enough, but only some of the time? I increased it to 8192 and tried again. - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to be the issue from SPARK-3867 [9], which was supposedly fixed on October
Re: Spurious test failures, testing best practices
Hey Ryan, The existing JIRA also covers publishing nightly docs: https://issues.apache.org/jira/browse/SPARK-1517 - Patrick On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: Thanks Nicholas, glad to hear that some of this info will be pushed to the main site soon, but this brings up yet another point of confusion that I've struggled with, namely whether the documentation on github or that on spark.apache.org should be considered the primary reference for people seeking to learn about best practices for developing Spark. Trying to read docs starting from https://github.com/apache/spark/blob/master/docs/index.md right now, I find that all of the links to other parts of the documentation are broken: they point to relative paths that end in .html, which will work when published on the docs-site, but that would have to end in .md if a person was to be able to navigate them on github. So expecting people to use the up-to-date docs on github (where all internal URLs 404 and the main github README suggests that the latest Spark documentation can be found on the actually-months-old docs-site https://github.com/apache/spark#online-documentation) is not a good solution. On the other hand, consulting months-old docs on the site is also problematic, as this thread and your last email have borne out. The result is that there is no good place on the internet to learn about the most up-to-date best practices for using/developing Spark. Why not build http://spark.apache.org/docs/latest/ nightly (or every commit) off of what's in github, rather than having that URL point to the last release's docs (up to ~3 months old)? This way, casual users who want the docs for the released version they happen to be using (which is already frequently != /latest today, for many Spark users) can (still) find them at http://spark.apache.org/docs/X.Y.Z, and the github README can safely point people to a site (/latest) that actually has up-to-date docs that reflect ToT and whose links work. If there are concerns about existing semantics around /latest URLs being broken, some new URL could be used, like http://spark.apache.org/docs/snapshot/, but given that everything under http://spark.apache.org/docs/latest/ is in a state of planned-backwards-incompatible-changes every ~3mos, that doesn't sound like that serious an issue to me; anyone sending around permanent links to things under /latest is already going to have those links break / not make sense in the near future. On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: - currently the docs only contain information about building with maven, and even then don't cover many important cases All other points aside, I just want to point out that the docs document both how to use Maven and SBT and clearly state https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt that Maven is the build of reference while SBT may be preferable for day-to-day development. I believe the main reason most people miss this documentation is that, though it's up-to-date on GitHub, it has't been published yet to the docs site. It should go out with the 1.2 release. Improvements to the documentation on building Spark belong here: https://github.com/apache/spark/blob/master/docs/building-spark.md If there are clear recommendations that come out of this thread but are not in that doc, they should be added in there. Other, less important details may possibly be better suited for the Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark guide. Nick On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, A few more things here. You should feel free to send patches to Jenkins to test them, since this is the reference environment in which we regularly run tests. This is the normal workflow for most developers and we spend a lot of effort provisioning/maintaining a very large jenkins cluster to allow developers access this resource. A common development approach is to locally run tests that you've added in a patch, then send it to jenkins for the full run, and then try to debug locally if you see specific unanticipated test failures. One challenge we have is that given the proliferation of OS versions, Java versions, Python versions, ulimits, etc. there is a combinatorial number of environments in which tests could be run. It is very hard in some cases to figure out post-hoc why a given test is not working in a specific environment. I think a good solution here would be to use a standardized docker container for running Spark tests and asking folks to use that locally if they are trying to run all of the hundreds of Spark tests. Another solution would be to mock out every system interaction in Spark's tests including e.g. filesystem interactions to try and
Re: Spurious test failures, testing best practices
Btw - the documnetation on github represents the source code of our docs, which is versioned with each release. Unfortunately github will always try to render .md files so it could look to a passerby like this is supposed to represent published docs. This is a feature limitation of github, AFAIK we cannot disable it. The official published docs are associated with each release and available on the apache.org website. I think /latest is a common convention for referring to the latest *published release* docs, so probably we can't change that (the audience for /latest is orders of magnitude larger than for snapshot docs). However we could just add /snapshot and publish docs there. - Patrick On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, The existing JIRA also covers publishing nightly docs: https://issues.apache.org/jira/browse/SPARK-1517 - Patrick On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: Thanks Nicholas, glad to hear that some of this info will be pushed to the main site soon, but this brings up yet another point of confusion that I've struggled with, namely whether the documentation on github or that on spark.apache.org should be considered the primary reference for people seeking to learn about best practices for developing Spark. Trying to read docs starting from https://github.com/apache/spark/blob/master/docs/index.md right now, I find that all of the links to other parts of the documentation are broken: they point to relative paths that end in .html, which will work when published on the docs-site, but that would have to end in .md if a person was to be able to navigate them on github. So expecting people to use the up-to-date docs on github (where all internal URLs 404 and the main github README suggests that the latest Spark documentation can be found on the actually-months-old docs-site https://github.com/apache/spark#online-documentation) is not a good solution. On the other hand, consulting months-old docs on the site is also problematic, as this thread and your last email have borne out. The result is that there is no good place on the internet to learn about the most up-to-date best practices for using/developing Spark. Why not build http://spark.apache.org/docs/latest/ nightly (or every commit) off of what's in github, rather than having that URL point to the last release's docs (up to ~3 months old)? This way, casual users who want the docs for the released version they happen to be using (which is already frequently != /latest today, for many Spark users) can (still) find them at http://spark.apache.org/docs/X.Y.Z, and the github README can safely point people to a site (/latest) that actually has up-to-date docs that reflect ToT and whose links work. If there are concerns about existing semantics around /latest URLs being broken, some new URL could be used, like http://spark.apache.org/docs/snapshot/, but given that everything under http://spark.apache.org/docs/latest/ is in a state of planned-backwards-incompatible-changes every ~3mos, that doesn't sound like that serious an issue to me; anyone sending around permanent links to things under /latest is already going to have those links break / not make sense in the near future. On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: - currently the docs only contain information about building with maven, and even then don't cover many important cases All other points aside, I just want to point out that the docs document both how to use Maven and SBT and clearly state https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt that Maven is the build of reference while SBT may be preferable for day-to-day development. I believe the main reason most people miss this documentation is that, though it's up-to-date on GitHub, it has't been published yet to the docs site. It should go out with the 1.2 release. Improvements to the documentation on building Spark belong here: https://github.com/apache/spark/blob/master/docs/building-spark.md If there are clear recommendations that come out of this thread but are not in that doc, they should be added in there. Other, less important details may possibly be better suited for the Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark guide. Nick On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, A few more things here. You should feel free to send patches to Jenkins to test them, since this is the reference environment in which we regularly run tests. This is the normal workflow for most developers and we spend a lot of effort provisioning/maintaining a very large jenkins cluster to allow developers access this resource. A common development approach is to locally run tests that you've added
Re: Spurious test failures, testing best practices
Thanks Mark, most of those commands are things I've been using and used in my original post except for Start zinc. I now see the section about it on the unpublished building-spark https://github.com/apache/spark/blob/master/docs/building-spark.md#speeding-up-compilation-with-zinc page and will try using it. Even so, finding those commands took a nontrivial amount of trial and error, I've not seen them very-well-documented outside of this list (your and Matei's emails (and previous emails to this list) each have more info about building/testing with Maven and SBT (resp.) than building-spark https://github.com/apache/spark/blob/master/docs/building-spark.md#spark-tests-in-maven does), the per-suite invocation is still subject to requiring assembly in some cases (without warning from my perspective, having not read up on the names of all Spark integration tests), spurious failures still abound, there's no good way to run only the things that a given change actually could have broken, etc. Anyway, hopefully zinc brings me to the world of ~minute iteration times that have been reported on this thread. On Sun Nov 30 2014 at 6:53:57 PM Ryan Williams ryan.blake.willi...@gmail.com wrote: Thanks Nicholas, glad to hear that some of this info will be pushed to the main site soon, but this brings up yet another point of confusion that I've struggled with, namely whether the documentation on github or that on spark.apache.org should be considered the primary reference for people seeking to learn about best practices for developing Spark. Trying to read docs starting from https://github.com/apache/spark/blob/master/docs/index.md right now, I find that all of the links to other parts of the documentation are broken: they point to relative paths that end in .html, which will work when published on the docs-site, but that would have to end in .md if a person was to be able to navigate them on github. So expecting people to use the up-to-date docs on github (where all internal URLs 404 and the main github README suggests that the latest Spark documentation can be found on the actually-months-old docs-site https://github.com/apache/spark#online-documentation) is not a good solution. On the other hand, consulting months-old docs on the site is also problematic, as this thread and your last email have borne out. The result is that there is no good place on the internet to learn about the most up-to-date best practices for using/developing Spark. Why not build http://spark.apache.org/docs/latest/ nightly (or every commit) off of what's in github, rather than having that URL point to the last release's docs (up to ~3 months old)? This way, casual users who want the docs for the released version they happen to be using (which is already frequently != /latest today, for many Spark users) can (still) find them at http://spark.apache.org/docs/X.Y.Z, and the github README can safely point people to a site (/latest) that actually has up-to-date docs that reflect ToT and whose links work. If there are concerns about existing semantics around /latest URLs being broken, some new URL could be used, like http://spark.apache.org/docs/snapshot/, but given that everything under http://spark.apache.org/docs/latest/ is in a state of planned-backwards-incompatible-changes every ~3mos, that doesn't sound like that serious an issue to me; anyone sending around permanent links to things under /latest is already going to have those links break / not make sense in the near future. On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: - currently the docs only contain information about building with maven, and even then don’t cover many important cases All other points aside, I just want to point out that the docs document both how to use Maven and SBT and clearly state https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt that Maven is the “build of reference” while SBT may be preferable for day-to-day development. I believe the main reason most people miss this documentation is that, though it’s up-to-date on GitHub, it has’t been published yet to the docs site. It should go out with the 1.2 release. Improvements to the documentation on building Spark belong here: https://github.com/apache/spark/blob/master/docs/building-spark.md If there are clear recommendations that come out of this thread but are not in that doc, they should be added in there. Other, less important details may possibly be better suited for the Contributing to Spark https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark guide. Nick On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, A few more things here. You should feel free to send patches to Jenkins to test them, since this is the reference environment in which we regularly run tests. This is the normal workflow for
Re: [RESULT] [VOTE] Designating maintainers for some Spark components
An update on this: After adding the initial maintainer list, we got feedback to add more maintainers for some components, so we added four others (Josh Rosen for core API, Mark Hamstra for scheduler, Shivaram Venkataraman for MLlib and Xiangrui Meng for Python). We also decided to lower the timeout for waiting for a maintainer to a week. Hopefully this will provide more options for reviewing in these components. The complete list is available at https://cwiki.apache.org/confluence/display/SPARK/Committers. Matei On Nov 8, 2014, at 7:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Thanks everyone for voting on this. With all of the PMC votes being for, the vote passes, but there were some concerns that I wanted to address for everyone who brought them up, as well as in the wording we will use for this policy. First, like every Apache project, Spark follows the Apache voting process (http://www.apache.org/foundation/voting.html), wherein all code changes are done by consensus. This means that any PMC member can block a code change on technical grounds, and thus that there is consensus when something goes in. It's absolutely true that every PMC member is responsible for the whole codebase, as Greg said (not least due to legal reasons, e.g. making sure it complies to licensing rules), and this idea will not change that. To make this clear, I will include that in the wording on the project page, to make sure new committers and other community members are all aware of it. What the maintainer model does, instead, is to change the review process, by having a required review from some people on some types of code changes (assuming those people respond in time). Projects can have their own diverse review processes (e.g. some do commit-then-review and others do review-then-commit, some point people to specific reviewers, etc). This kind of process seems useful to try (and to refine) as the project grows. We will of course evaluate how it goes and respond to any problems. So to summarize, - Every committer is responsible for, and more than welcome to review and vote on, every code change. In fact all community members are welcome to do this, and lots are doing it. - Everyone has the same voting rights on these code changes (namely consensus as described at http://www.apache.org/foundation/voting.html) - Committers will be asked to run patches that are making architectural and API changes by the maintainers before merging. In practice, none of this matters too much because we are not exactly a hot-well of discord ;), and even in the case of discord, the point of the ASF voting process is to create consensus. The goal is just to have a better structure for reviewing and minimize the chance of errors. Here is a tally of the votes: Binding votes (from PMC): 17 +1, no 0 or -1 Matei Zaharia Michael Armbrust Reynold Xin Patrick Wendell Andrew Or Prashant Sharma Mark Hamstra Xiangrui Meng Ankur Dave Imran Rashid Jason Dai Tom Graves Sean McNamara Nick Pentreath Josh Rosen Kay Ousterhout Tathagata Das Non-binding votes: 18 +1, one +0, one -1 +1: Nan Zhu Nicholas Chammas Denny Lee Cheng Lian Timothy Chen Jeremy Freeman Cheng Hao Jackylk Likun Kousuke Saruta Reza Zadeh Xuefeng Wu Witgo Manoj Babu Ravindra Pesala Liquan Pei Kushal Datta Davies Liu Vaquar Khan +0: Corey Nolet -1: Greg Stein I'll send another email when I have a more detailed writeup of this on the website. Matei - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spurious test failures, testing best practices
Thanks Patrick, great to hear that docs-snapshots-via-jenkins is already JIRA'd; you can interpret some of this thread as a gigantic +1 from me on prioritizing that, which it looks like you are doing :) I do understand the limitations of the github vs. official site status quo; I was mostly responding to a perceived implication that I should have been getting building/testing-spark advice from the github .md files instead of from /latest. I agree that neither one works very well currently, and that docs-snapshots-via-jenkins is the right solution. Per my other email, leaving /latest as-is sounds reasonable, as long as jenkins is putting the latest docs *somewhere*. On Sun Nov 30 2014 at 7:19:33 PM Patrick Wendell pwend...@gmail.com wrote: Btw - the documnetation on github represents the source code of our docs, which is versioned with each release. Unfortunately github will always try to render .md files so it could look to a passerby like this is supposed to represent published docs. This is a feature limitation of github, AFAIK we cannot disable it. The official published docs are associated with each release and available on the apache.org website. I think /latest is a common convention for referring to the latest *published release* docs, so probably we can't change that (the audience for /latest is orders of magnitude larger than for snapshot docs). However we could just add /snapshot and publish docs there. - Patrick On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, The existing JIRA also covers publishing nightly docs: https://issues.apache.org/jira/browse/SPARK-1517 - Patrick On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: Thanks Nicholas, glad to hear that some of this info will be pushed to the main site soon, but this brings up yet another point of confusion that I've struggled with, namely whether the documentation on github or that on spark.apache.org should be considered the primary reference for people seeking to learn about best practices for developing Spark. Trying to read docs starting from https://github.com/apache/spark/blob/master/docs/index.md right now, I find that all of the links to other parts of the documentation are broken: they point to relative paths that end in .html, which will work when published on the docs-site, but that would have to end in .md if a person was to be able to navigate them on github. So expecting people to use the up-to-date docs on github (where all internal URLs 404 and the main github README suggests that the latest Spark documentation can be found on the actually-months-old docs-site https://github.com/apache/spark#online-documentation) is not a good solution. On the other hand, consulting months-old docs on the site is also problematic, as this thread and your last email have borne out. The result is that there is no good place on the internet to learn about the most up-to-date best practices for using/developing Spark. Why not build http://spark.apache.org/docs/latest/ nightly (or every commit) off of what's in github, rather than having that URL point to the last release's docs (up to ~3 months old)? This way, casual users who want the docs for the released version they happen to be using (which is already frequently != /latest today, for many Spark users) can (still) find them at http://spark.apache.org/docs/X.Y.Z, and the github README can safely point people to a site (/latest) that actually has up-to-date docs that reflect ToT and whose links work. If there are concerns about existing semantics around /latest URLs being broken, some new URL could be used, like http://spark.apache.org/docs/snapshot/, but given that everything under http://spark.apache.org/docs/latest/ is in a state of planned-backwards-incompatible-changes every ~3mos, that doesn't sound like that serious an issue to me; anyone sending around permanent links to things under /latest is already going to have those links break / not make sense in the near future. On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: - currently the docs only contain information about building with maven, and even then don't cover many important cases All other points aside, I just want to point out that the docs document both how to use Maven and SBT and clearly state https://github.com/apache/spark/blob/master/docs/ building-spark.md#building-with-sbt that Maven is the build of reference while SBT may be preferable for day-to-day development. I believe the main reason most people miss this documentation is that, though it's up-to-date on GitHub, it has't been published yet to the docs site. It should go out with the 1.2 release. Improvements to the documentation on building Spark belong here:
Re: Spurious test failures, testing best practices
Hi, Patrick - with regards to testing on Jenkins, is the process for this to submit a pull request for the branch or is there another interface we can use to submit a build to Jenkins for testing? On 11/30/14, 6:49 PM, Patrick Wendell pwend...@gmail.com wrote: Hey Ryan, A few more things here. You should feel free to send patches to Jenkins to test them, since this is the reference environment in which we regularly run tests. This is the normal workflow for most developers and we spend a lot of effort provisioning/maintaining a very large jenkins cluster to allow developers access this resource. A common development approach is to locally run tests that you've added in a patch, then send it to jenkins for the full run, and then try to debug locally if you see specific unanticipated test failures. One challenge we have is that given the proliferation of OS versions, Java versions, Python versions, ulimits, etc. there is a combinatorial number of environments in which tests could be run. It is very hard in some cases to figure out post-hoc why a given test is not working in a specific environment. I think a good solution here would be to use a standardized docker container for running Spark tests and asking folks to use that locally if they are trying to run all of the hundreds of Spark tests. Another solution would be to mock out every system interaction in Spark's tests including e.g. filesystem interactions to try and reduce variance across environments. However, that seems difficult. As the number of developers of Spark increases, it's definitely a good idea for us to invest in developer infrastructure including things like snapshot releases, better documentation, etc. Thanks for bringing this up as a pain point. - Patrick On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams ryan.blake.willi...@gmail.com wrote: thanks for the info, Matei and Brennon. I will try to switch my workflow to using sbt. Other potential action items: - currently the docs only contain information about building with maven, and even then don't cover many important cases, as I described in my previous email. If SBT is as much better as you've described then that should be made much more obvious. Wasn't it the case recently that there was only a page about building with SBT, and not one about building with maven? Clearer messaging around this needs to exist in the documentation, not just on the mailing list, imho. - +1 to better distinguishing between unit and integration tests, having separate scripts for each, improving documentation around common workflows, expectations of brittleness with each kind of test, advisability of just relying on Jenkins for certain kinds of tests to not waste too much time, etc. Things like the compiler crash should be discussed in the documentation, not just in the mailing list archives, if new contributors are likely to run into them through no fault of their own. - What is the algorithm you use to decide what tests you might have broken? Can we codify it in some scripts that other people can use? On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com wrote: Hi Ryan, As a tip (and maybe this isn't documented well), I normally use SBT for development to avoid the slow build process, and use its interactive console to run only specific tests. The nice advantage is that SBT can keep the Scala compiler loaded and JITed across builds, making it faster to iterate. To use it, you can do the following: - Start the SBT interactive console with sbt/sbt - Build your assembly by running the assembly target in the assembly project: assembly/assembly - Run all the tests in one module: core/test - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also supports tab completion) Running all the tests does take a while, and I usually just rely on Jenkins for that once I've run the tests for the things I believed my patch could break. But this is because some of them are integration tests (e.g. DistributedSuite, which creates multi-process mini-clusters). Many of the individual suites run fast without requiring this, however, so you can pick the ones you want. Perhaps we should find a way to tag them so people can do a quick-test that skips the integration ones. The assembly builds are annoying but they only take about a minute for me on a MacBook Pro with SBT warmed up. The assembly is actually only required for some of the integration tests (which launch new processes), but I'd recommend doing it all the time anyway since it would be very confusing to run those with an old assembly. The Scala compiler crash issue can also be a problem, but I don't see it very often with SBT. If it happens, I exit SBT and do sbt clean. Anyway, this is useful feedback and I think we should try to improve some of these suites, but hopefully you can also try the faster SBT process. At the end of the day, if we want integration tests,
Re: [mllib] Which is the correct package to add a new algorithm?
Hi Joseph, Thank you for your nice work and telling us the draft! During the next development cycle, new algorithms should be contributed to spark.mllib. Optionally, wrappers for new (and old) algorithms can be contributed to spark.ml. I understand that we should contribute new algorithms to spark.mllib. thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540p9575.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org