Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-11-30 Thread GuoQiang Li
+1 (non-binding‍)




-- Original --
From:  Patrick Wendell;pwend...@gmail.com;
Date:  Sat, Nov 29, 2014 01:16 PM
To:  dev@spark.apache.orgdev@spark.apache.org; 

Subject:  [VOTE] Release Apache Spark 1.2.0 (RC1)



Please vote on releasing the following candidate as Apache Spark version 1.2.0!

The tag to be voted on is v1.2.0-rc1 (commit 1056e9ec1):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=1056e9ec13203d0c51564265e94d77a054498fdb

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1048/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.2.0-rc1-docs/

Please vote on releasing this package as Apache Spark 1.2.0!

The vote is open until Tuesday, December 02, at 05:15 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.1.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

== What justifies a -1 vote for this release? ==
This vote is happening very late into the QA period compared with
previous votes, so -1 votes should only occur for significant
regressions from 1.0.2. Bugs already present in 1.1.X, minor
regressions, or bugs related to new features will not block this
release.

== What default changes should I be aware of? ==
1. The default value of spark.shuffle.blockTransferService has been
changed to netty
-- Old behavior can be restored by switching to nio

2. The default value of spark.shuffle.manager has been changed to sort.
-- Old behavior can be restored by setting spark.shuffle.manager to hash.

== Other notes ==
Because this vote is occurring over a weekend, I will likely extend
the vote if this RC survives until the end of the vote period.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
In the course of trying to make contributions to Spark, I have had a lot of
trouble running Spark's tests successfully. The main pain points I've
experienced are:

1) frequent, spurious test failures
2) high latency of running tests
3) difficulty running specific tests in an iterative fashion

Here is an example series of failures that I encountered this weekend
(along with footnote links to the console output from each and
approximately how long each took):

- `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
before.
- `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
- `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
passed, but scala compiler crashed on the catalyst project.
- `mvn clean`: some attempts to run earlier commands (that previously
didn't crash the compiler) all result in the same compiler crash. Previous
discussion on this list implies this can only be solved by a `mvn clean`
[4].
- `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
BroadcastSuite can't run because assembly is not built.
- `./dev/run-tests` again [6]: pyspark tests fail, some messages about
version mismatches and python 2.6. The machine this ran on has python 2.7,
so I don't know what that's about.
- `./dev/run-tests` again [7]: too many open files errors in several
tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
not enough, but only some of the time? I increased it to 8192 and tried
again.
- `./dev/run-tests` again [8]: same pyspark errors as before. This seems to
be the issue from SPARK-3867 [9], which was supposedly fixed on October 14;
not sure how I'm seeing it now. In any case, switched to Python 2.6 and
installed unittest2, and python/run-tests seems to be unblocked.
- `./dev/run-tests` again [10]: finally passes!

This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
changes added on (that I wanted to test before sending out a PR), on a
macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].

Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands
from the same repo state:

- `./dev/run-tests` [12]: YarnClusterSuite failure.
- `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
this one before on this machine and am guessing it actually occurs every
time.
- `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
time from ceb6281, and saw the same failure.

This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow
down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac,
from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used),
and it passed [16], so the failure seems specific to my linux machine/arch.

At this point I believe that my changes don't break any tests (the
YarnClusterSuite failure on my linux presumably not being... real), and I
am ready to send out a PR. Whew!

However, reflecting on the 5 or 6 distinct failure-modes represented above:

- One of them (too many files open), is something I can (and did,
hopefully) fix once and for all. It cost me an ~hour this time (approximate
time of running ./dev/run-tests) and a few hours other times when I didn't
fully understand/fix it. It doesn't happen deterministically (why?), but
does happen somewhat frequently to people, having been discussed on the
user list multiple times [17] and on SO [18]. Maybe some note in the
documentation advising people to check their ulimit makes sense?
- One of them (unittest2 must be installed for python 2.6) was supposedly
fixed upstream of the commits I tested here; I don't know why I'm still
running into it. This cost me a few hours of running `./dev/run-tests`
multiple times to see if it was transient, plus some time researching and
working around it.
- The original BroadcastSuite failure cost me a few hours and went away
before I'd even run `mvn clean`.
- A new incarnation of the sbt-compiler-crash phenomenon cost me a few
hours of running `./dev/run-tests` in different ways before deciding that,
as usual, there was no way around it and that I'd need to run `mvn clean`
and start running tests from scratch.
- The YarnClusterSuite failures on my linux box have cost me hours of
trying to figure out whether they're my fault. I've seen them many times
over the past weeks/months, plus or minus other failures that have come and
gone, and was especially befuddled by them when I was seeing a disjoint set
of reproducible failures on my mac [19] (the triaging of which involved
dozens of runs of `./dev/run-tests`).

While I'm interested in digging into each of these issues, I also want to
discuss the frequency with which I've run into issues like these. This is
unfortunately not the first time in recent months that I've spent days
playing spurious-test-failure whack-a-mole with a 60-90min dev/run-tests
iteration time, which is no fun! So I am wondering/thinking:

- Do other people experience 

Re: Spurious test failures, testing best practices

2014-11-30 Thread York, Brennon
+1, you aren¹t alone in this. I certainly would like some clarity in these
things well, but, as its been said on this listserv a few times (and you
noted), most developers use `sbt` for their day-to-day compilations to
greatly speed up the iterative testing process. I personally use `sbt` for
all builds until I¹m ready to submit a PR and *then* run ./dev/run-tests
to ensure all the tests / code I¹ve written still pass (i.e. nothing
breaks in the code I¹ve changed or downstream). Sometimes, like you¹ve
said, you still get errors with the ./dev/run-tests script, but, for me,
it comes down to where the errors initiate from and whether I¹m confident
the code I wrote caused it or not as the delimiter to whether I submit the
PR.

Again, not a great answer and hoping others can shed more light, but thats
my 2c on the problem.

On 11/30/14, 5:39 PM, Ryan Williams ryan.blake.willi...@gmail.com
wrote:

In the course of trying to make contributions to Spark, I have had a lot
of
trouble running Spark's tests successfully. The main pain points I've
experienced are:

1) frequent, spurious test failures
2) high latency of running tests
3) difficulty running specific tests in an iterative fashion

Here is an example series of failures that I encountered this weekend
(along with footnote links to the console output from each and
approximately how long each took):

- `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
before.
- `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
- `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
passed, but scala compiler crashed on the catalyst project.
- `mvn clean`: some attempts to run earlier commands (that previously
didn't crash the compiler) all result in the same compiler crash. Previous
discussion on this list implies this can only be solved by a `mvn clean`
[4].
- `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
BroadcastSuite can't run because assembly is not built.
- `./dev/run-tests` again [6]: pyspark tests fail, some messages about
version mismatches and python 2.6. The machine this ran on has python 2.7,
so I don't know what that's about.
- `./dev/run-tests` again [7]: too many open files errors in several
tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
not enough, but only some of the time? I increased it to 8192 and tried
again.
- `./dev/run-tests` again [8]: same pyspark errors as before. This seems
to
be the issue from SPARK-3867 [9], which was supposedly fixed on October
14;
not sure how I'm seeing it now. In any case, switched to Python 2.6 and
installed unittest2, and python/run-tests seems to be unblocked.
- `./dev/run-tests` again [10]: finally passes!

This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
changes added on (that I wanted to test before sending out a PR), on a
macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].

Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar
commands
from the same repo state:

- `./dev/run-tests` [12]: YarnClusterSuite failure.
- `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
this one before on this machine and am guessing it actually occurs every
time.
- `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
time from ceb6281, and saw the same failure.

This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to
narrow
down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my
mac,
from ceb6281, with java 1.7 (instead of 1.8, which the previous runs
used),
and it passed [16], so the failure seems specific to my linux
machine/arch.

At this point I believe that my changes don't break any tests (the
YarnClusterSuite failure on my linux presumably not being... real), and
I
am ready to send out a PR. Whew!

However, reflecting on the 5 or 6 distinct failure-modes represented
above:

- One of them (too many files open), is something I can (and did,
hopefully) fix once and for all. It cost me an ~hour this time
(approximate
time of running ./dev/run-tests) and a few hours other times when I didn't
fully understand/fix it. It doesn't happen deterministically (why?), but
does happen somewhat frequently to people, having been discussed on the
user list multiple times [17] and on SO [18]. Maybe some note in the
documentation advising people to check their ulimit makes sense?
- One of them (unittest2 must be installed for python 2.6) was supposedly
fixed upstream of the commits I tested here; I don't know why I'm still
running into it. This cost me a few hours of running `./dev/run-tests`
multiple times to see if it was transient, plus some time researching and
working around it.
- The original BroadcastSuite failure cost me a few hours and went away
before I'd even run `mvn clean`.
- A new incarnation of the sbt-compiler-crash phenomenon cost me a few
hours of running `./dev/run-tests` in different ways before deciding 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Matei Zaharia
Hi Ryan,

As a tip (and maybe this isn't documented well), I normally use SBT for 
development to avoid the slow build process, and use its interactive console to 
run only specific tests. The nice advantage is that SBT can keep the Scala 
compiler loaded and JITed across builds, making it faster to iterate. To use 
it, you can do the following:

- Start the SBT interactive console with sbt/sbt
- Build your assembly by running the assembly target in the assembly project: 
assembly/assembly
- Run all the tests in one module: core/test
- Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this also 
supports tab completion)

Running all the tests does take a while, and I usually just rely on Jenkins for 
that once I've run the tests for the things I believed my patch could break. 
But this is because some of them are integration tests (e.g. DistributedSuite, 
which creates multi-process mini-clusters). Many of the individual suites run 
fast without requiring this, however, so you can pick the ones you want. 
Perhaps we should find a way to tag them so people  can do a quick-test that 
skips the integration ones.

The assembly builds are annoying but they only take about a minute for me on a 
MacBook Pro with SBT warmed up. The assembly is actually only required for some 
of the integration tests (which launch new processes), but I'd recommend 
doing it all the time anyway since it would be very confusing to run those with 
an old assembly. The Scala compiler crash issue can also be a problem, but I 
don't see it very often with SBT. If it happens, I exit SBT and do sbt clean.

Anyway, this is useful feedback and I think we should try to improve some of 
these suites, but hopefully you can also try the faster SBT process. At the end 
of the day, if we want integration tests, the whole test process will take an 
hour, but most of the developers I know leave that to Jenkins and only run 
individual tests locally before submitting a patch.

Matei


 On Nov 30, 2014, at 2:39 PM, Ryan Williams ryan.blake.willi...@gmail.com 
 wrote:
 
 In the course of trying to make contributions to Spark, I have had a lot of
 trouble running Spark's tests successfully. The main pain points I've
 experienced are:
 
1) frequent, spurious test failures
2) high latency of running tests
3) difficulty running specific tests in an iterative fashion
 
 Here is an example series of failures that I encountered this weekend
 (along with footnote links to the console output from each and
 approximately how long each took):
 
 - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
 before.
 - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
 - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
 passed, but scala compiler crashed on the catalyst project.
 - `mvn clean`: some attempts to run earlier commands (that previously
 didn't crash the compiler) all result in the same compiler crash. Previous
 discussion on this list implies this can only be solved by a `mvn clean`
 [4].
 - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
 BroadcastSuite can't run because assembly is not built.
 - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
 version mismatches and python 2.6. The machine this ran on has python 2.7,
 so I don't know what that's about.
 - `./dev/run-tests` again [7]: too many open files errors in several
 tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
 not enough, but only some of the time? I increased it to 8192 and tried
 again.
 - `./dev/run-tests` again [8]: same pyspark errors as before. This seems to
 be the issue from SPARK-3867 [9], which was supposedly fixed on October 14;
 not sure how I'm seeing it now. In any case, switched to Python 2.6 and
 installed unittest2, and python/run-tests seems to be unblocked.
 - `./dev/run-tests` again [10]: finally passes!
 
 This was on a spark checkout at ceb6281 (ToT Friday), with a few trivial
 changes added on (that I wanted to test before sending out a PR), on a
 macbook running OSX Yosemite (10.10.1), java 1.8 and mvn 3.2.3 [11].
 
 Meanwhile, on a linux 2.6.32 / CentOS 6.4 machine, I tried similar commands
 from the same repo state:
 
 - `./dev/run-tests` [12]: YarnClusterSuite failure.
 - `./dev/run-tests` [13]: same YarnClusterSuite failure. I know I've seen
 this one before on this machine and am guessing it actually occurs every
 time.
 - `./dev/run-tests` [14]: to be sure, I reverted my changes, ran one more
 time from ceb6281, and saw the same failure.
 
 This was with java 1.7 and maven 3.2.3 [15]. In one final attempt to narrow
 down the linux YarnClusterSuite failure, I ran `./dev/run-tests` on my mac,
 from ceb6281, with java 1.7 (instead of 1.8, which the previous runs used),
 and it passed [16], so the failure seems specific to my linux machine/arch.
 
 At this point I believe that my changes don't break any tests (the
 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
thanks for the info, Matei and Brennon. I will try to switch my workflow to
using sbt. Other potential action items:

- currently the docs only contain information about building with maven,
and even then don't cover many important cases, as I described in my
previous email. If SBT is as much better as you've described then that
should be made much more obvious. Wasn't it the case recently that there
was only a page about building with SBT, and not one about building with
maven? Clearer messaging around this needs to exist in the documentation,
not just on the mailing list, imho.

- +1 to better distinguishing between unit and integration tests, having
separate scripts for each, improving documentation around common workflows,
expectations of brittleness with each kind of test, advisability of just
relying on Jenkins for certain kinds of tests to not waste too much time,
etc. Things like the compiler crash should be discussed in the
documentation, not just in the mailing list archives, if new contributors
are likely to run into them through no fault of their own.

- What is the algorithm you use to decide what tests you might have broken?
Can we codify it in some scripts that other people can use?



On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Ryan,

 As a tip (and maybe this isn't documented well), I normally use SBT for
 development to avoid the slow build process, and use its interactive
 console to run only specific tests. The nice advantage is that SBT can keep
 the Scala compiler loaded and JITed across builds, making it faster to
 iterate. To use it, you can do the following:

 - Start the SBT interactive console with sbt/sbt
 - Build your assembly by running the assembly target in the assembly
 project: assembly/assembly
 - Run all the tests in one module: core/test
 - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
 also supports tab completion)

 Running all the tests does take a while, and I usually just rely on
 Jenkins for that once I've run the tests for the things I believed my patch
 could break. But this is because some of them are integration tests (e.g.
 DistributedSuite, which creates multi-process mini-clusters). Many of the
 individual suites run fast without requiring this, however, so you can pick
 the ones you want. Perhaps we should find a way to tag them so people  can
 do a quick-test that skips the integration ones.

 The assembly builds are annoying but they only take about a minute for me
 on a MacBook Pro with SBT warmed up. The assembly is actually only required
 for some of the integration tests (which launch new processes), but I'd
 recommend doing it all the time anyway since it would be very confusing to
 run those with an old assembly. The Scala compiler crash issue can also be
 a problem, but I don't see it very often with SBT. If it happens, I exit
 SBT and do sbt clean.

 Anyway, this is useful feedback and I think we should try to improve some
 of these suites, but hopefully you can also try the faster SBT process. At
 the end of the day, if we want integration tests, the whole test process
 will take an hour, but most of the developers I know leave that to Jenkins
 and only run individual tests locally before submitting a patch.

 Matei


  On Nov 30, 2014, at 2:39 PM, Ryan Williams 
 ryan.blake.willi...@gmail.com wrote:
 
  In the course of trying to make contributions to Spark, I have had a lot
 of
  trouble running Spark's tests successfully. The main pain points I've
  experienced are:
 
 1) frequent, spurious test failures
 2) high latency of running tests
 3) difficulty running specific tests in an iterative fashion
 
  Here is an example series of failures that I encountered this weekend
  (along with footnote links to the console output from each and
  approximately how long each took):
 
  - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
  before.
  - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
  - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
  passed, but scala compiler crashed on the catalyst project.
  - `mvn clean`: some attempts to run earlier commands (that previously
  didn't crash the compiler) all result in the same compiler crash.
 Previous
  discussion on this list implies this can only be solved by a `mvn clean`
  [4].
  - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
  BroadcastSuite can't run because assembly is not built.
  - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
  version mismatches and python 2.6. The machine this ran on has python
 2.7,
  so I don't know what that's about.
  - `./dev/run-tests` again [7]: too many open files errors in several
  tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
  not enough, but only some of the time? I increased it to 8192 and tried
  again.
  - `./dev/run-tests` again [8]: same pyspark 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Nicholas Chammas
   - currently the docs only contain information about building with maven,
   and even then don’t cover many important cases

 All other points aside, I just want to point out that the docs document
both how to use Maven and SBT and clearly state
https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt
that Maven is the “build of reference” while SBT may be preferable for
day-to-day development.

I believe the main reason most people miss this documentation is that,
though it’s up-to-date on GitHub, it has’t been published yet to the docs
site. It should go out with the 1.2 release.

Improvements to the documentation on building Spark belong here:
https://github.com/apache/spark/blob/master/docs/building-spark.md

If there are clear recommendations that come out of this thread but are not
in that doc, they should be added in there. Other, less important details
may possibly be better suited for the Contributing to Spark
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
guide.

Nick
​

On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com wrote:

 Hey Ryan,

 A few more things here. You should feel free to send patches to
 Jenkins to test them, since this is the reference environment in which
 we regularly run tests. This is the normal workflow for most
 developers and we spend a lot of effort provisioning/maintaining a
 very large jenkins cluster to allow developers access this resource. A
 common development approach is to locally run tests that you've added
 in a patch, then send it to jenkins for the full run, and then try to
 debug locally if you see specific unanticipated test failures.

 One challenge we have is that given the proliferation of OS versions,
 Java versions, Python versions, ulimits, etc. there is a combinatorial
 number of environments in which tests could be run. It is very hard in
 some cases to figure out post-hoc why a given test is not working in a
 specific environment. I think a good solution here would be to use a
 standardized docker container for running Spark tests and asking folks
 to use that locally if they are trying to run all of the hundreds of
 Spark tests.

 Another solution would be to mock out every system interaction in
 Spark's tests including e.g. filesystem interactions to try and reduce
 variance across environments. However, that seems difficult.

 As the number of developers of Spark increases, it's definitely a good
 idea for us to invest in developer infrastructure including things
 like snapshot releases, better documentation, etc. Thanks for bringing
 this up as a pain point.

 - Patrick


 On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
 ryan.blake.willi...@gmail.com wrote:
  thanks for the info, Matei and Brennon. I will try to switch my workflow
 to
  using sbt. Other potential action items:
 
  - currently the docs only contain information about building with maven,
  and even then don't cover many important cases, as I described in my
  previous email. If SBT is as much better as you've described then that
  should be made much more obvious. Wasn't it the case recently that there
  was only a page about building with SBT, and not one about building with
  maven? Clearer messaging around this needs to exist in the documentation,
  not just on the mailing list, imho.
 
  - +1 to better distinguishing between unit and integration tests, having
  separate scripts for each, improving documentation around common
 workflows,
  expectations of brittleness with each kind of test, advisability of just
  relying on Jenkins for certain kinds of tests to not waste too much time,
  etc. Things like the compiler crash should be discussed in the
  documentation, not just in the mailing list archives, if new contributors
  are likely to run into them through no fault of their own.
 
  - What is the algorithm you use to decide what tests you might have
 broken?
  Can we codify it in some scripts that other people can use?
 
 
 
  On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com
  wrote:
 
  Hi Ryan,
 
  As a tip (and maybe this isn't documented well), I normally use SBT for
  development to avoid the slow build process, and use its interactive
  console to run only specific tests. The nice advantage is that SBT can
 keep
  the Scala compiler loaded and JITed across builds, making it faster to
  iterate. To use it, you can do the following:
 
  - Start the SBT interactive console with sbt/sbt
  - Build your assembly by running the assembly target in the assembly
  project: assembly/assembly
  - Run all the tests in one module: core/test
  - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
 (this
  also supports tab completion)
 
  Running all the tests does take a while, and I usually just rely on
  Jenkins for that once I've run the tests for the things I believed my
 patch
  could break. But this is because some of them are integration tests
 (e.g.
  

Re: Spurious test failures, testing best practices

2014-11-30 Thread Mark Hamstra

 - Start the SBT interactive console with sbt/sbt
 - Build your assembly by running the assembly target in the assembly
 project: assembly/assembly
 - Run all the tests in one module: core/test
 - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
 also supports tab completion)


The equivalent using Maven:

- Start zinc
- Build your assembly using the mvn package or install target
(install is actually the equivalent of SBT's publishLocal) -- this step
is the first step in
http://spark.apache.org/docs/latest/building-with-maven.html#spark-tests-in-maven
- Run all the tests in one module: mvn -pl core test
- Run a specific suite: mvn -pl core
-DwildcardSuites=org.apache.spark.rdd.RDDSuite test (the -pl option isn't
strictly necessary if you don't mind waiting for Maven to scan through all
the other sub-projects only to do nothing; and, of course, it needs to be
something other than core if the test you want to run is in another
sub-project.)

You also typically want to carry along in each subsequent step any relevant
command line options you added in the package/install step.

On Sun, Nov 30, 2014 at 3:06 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Hi Ryan,

 As a tip (and maybe this isn't documented well), I normally use SBT for
 development to avoid the slow build process, and use its interactive
 console to run only specific tests. The nice advantage is that SBT can keep
 the Scala compiler loaded and JITed across builds, making it faster to
 iterate. To use it, you can do the following:

 - Start the SBT interactive console with sbt/sbt
 - Build your assembly by running the assembly target in the assembly
 project: assembly/assembly
 - Run all the tests in one module: core/test
 - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite (this
 also supports tab completion)

 Running all the tests does take a while, and I usually just rely on
 Jenkins for that once I've run the tests for the things I believed my patch
 could break. But this is because some of them are integration tests (e.g.
 DistributedSuite, which creates multi-process mini-clusters). Many of the
 individual suites run fast without requiring this, however, so you can pick
 the ones you want. Perhaps we should find a way to tag them so people  can
 do a quick-test that skips the integration ones.

 The assembly builds are annoying but they only take about a minute for me
 on a MacBook Pro with SBT warmed up. The assembly is actually only required
 for some of the integration tests (which launch new processes), but I'd
 recommend doing it all the time anyway since it would be very confusing to
 run those with an old assembly. The Scala compiler crash issue can also be
 a problem, but I don't see it very often with SBT. If it happens, I exit
 SBT and do sbt clean.

 Anyway, this is useful feedback and I think we should try to improve some
 of these suites, but hopefully you can also try the faster SBT process. At
 the end of the day, if we want integration tests, the whole test process
 will take an hour, but most of the developers I know leave that to Jenkins
 and only run individual tests locally before submitting a patch.

 Matei


  On Nov 30, 2014, at 2:39 PM, Ryan Williams 
 ryan.blake.willi...@gmail.com wrote:
 
  In the course of trying to make contributions to Spark, I have had a lot
 of
  trouble running Spark's tests successfully. The main pain points I've
  experienced are:
 
 1) frequent, spurious test failures
 2) high latency of running tests
 3) difficulty running specific tests in an iterative fashion
 
  Here is an example series of failures that I encountered this weekend
  (along with footnote links to the console output from each and
  approximately how long each took):
 
  - `./dev/run-tests` [1]: failure in BroadcastSuite that I've not seen
  before.
  - `mvn '-Dsuites=*BroadcastSuite*' test` [2]: same failure.
  - `mvn '-Dsuites=*BroadcastSuite* Unpersisting' test` [3]: BroadcastSuite
  passed, but scala compiler crashed on the catalyst project.
  - `mvn clean`: some attempts to run earlier commands (that previously
  didn't crash the compiler) all result in the same compiler crash.
 Previous
  discussion on this list implies this can only be solved by a `mvn clean`
  [4].
  - `mvn '-Dsuites=*BroadcastSuite*' test` [5]: immediately post-clean,
  BroadcastSuite can't run because assembly is not built.
  - `./dev/run-tests` again [6]: pyspark tests fail, some messages about
  version mismatches and python 2.6. The machine this ran on has python
 2.7,
  so I don't know what that's about.
  - `./dev/run-tests` again [7]: too many open files errors in several
  tests. `ulimit -a` shows a maximum of 4864 open files. Apparently this is
  not enough, but only some of the time? I increased it to 8192 and tried
  again.
  - `./dev/run-tests` again [8]: same pyspark errors as before. This seems
 to
  be the issue from SPARK-3867 [9], which was supposedly fixed on October
 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Hey Ryan,

The existing JIRA also covers publishing nightly docs:
https://issues.apache.org/jira/browse/SPARK-1517

- Patrick

On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
ryan.blake.willi...@gmail.com wrote:
 Thanks Nicholas, glad to hear that some of this info will be pushed to the
 main site soon, but this brings up yet another point of confusion that I've
 struggled with, namely whether the documentation on github or that on
 spark.apache.org should be considered the primary reference for people
 seeking to learn about best practices for developing Spark.

 Trying to read docs starting from
 https://github.com/apache/spark/blob/master/docs/index.md right now, I find
 that all of the links to other parts of the documentation are broken: they
 point to relative paths that end in .html, which will work when published
 on the docs-site, but that would have to end in .md if a person was to be
 able to navigate them on github.

 So expecting people to use the up-to-date docs on github (where all
 internal URLs 404 and the main github README suggests that the latest
 Spark documentation can be found on the actually-months-old docs-site
 https://github.com/apache/spark#online-documentation) is not a good
 solution. On the other hand, consulting months-old docs on the site is also
 problematic, as this thread and your last email have borne out.  The result
 is that there is no good place on the internet to learn about the most
 up-to-date best practices for using/developing Spark.

 Why not build http://spark.apache.org/docs/latest/ nightly (or every
 commit) off of what's in github, rather than having that URL point to the
 last release's docs (up to ~3 months old)? This way, casual users who want
 the docs for the released version they happen to be using (which is already
 frequently != /latest today, for many Spark users) can (still) find them
 at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
 point people to a site (/latest) that actually has up-to-date docs that
 reflect ToT and whose links work.

 If there are concerns about existing semantics around /latest URLs being
 broken, some new URL could be used, like
 http://spark.apache.org/docs/snapshot/, but given that everything under
 http://spark.apache.org/docs/latest/ is in a state of
 planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
 that serious an issue to me; anyone sending around permanent links to
 things under /latest is already going to have those links break / not make
 sense in the near future.


 On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:


- currently the docs only contain information about building with
maven,
and even then don't cover many important cases

  All other points aside, I just want to point out that the docs document
 both how to use Maven and SBT and clearly state
 https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt
 that Maven is the build of reference while SBT may be preferable for
 day-to-day development.

 I believe the main reason most people miss this documentation is that,
 though it's up-to-date on GitHub, it has't been published yet to the docs
 site. It should go out with the 1.2 release.

 Improvements to the documentation on building Spark belong here:
 https://github.com/apache/spark/blob/master/docs/building-spark.md

 If there are clear recommendations that come out of this thread but are
 not in that doc, they should be added in there. Other, less important
 details may possibly be better suited for the Contributing to Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 guide.

 Nick


 On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Ryan,

 A few more things here. You should feel free to send patches to
 Jenkins to test them, since this is the reference environment in which
 we regularly run tests. This is the normal workflow for most
 developers and we spend a lot of effort provisioning/maintaining a
 very large jenkins cluster to allow developers access this resource. A
 common development approach is to locally run tests that you've added
 in a patch, then send it to jenkins for the full run, and then try to
 debug locally if you see specific unanticipated test failures.

 One challenge we have is that given the proliferation of OS versions,
 Java versions, Python versions, ulimits, etc. there is a combinatorial
 number of environments in which tests could be run. It is very hard in
 some cases to figure out post-hoc why a given test is not working in a
 specific environment. I think a good solution here would be to use a
 standardized docker container for running Spark tests and asking folks
 to use that locally if they are trying to run all of the hundreds of
 Spark tests.

 Another solution would be to mock out every system interaction in
 Spark's tests including e.g. filesystem interactions to try and 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Patrick Wendell
Btw - the documnetation on github represents the source code of our
docs, which is versioned with each release. Unfortunately github will
always try to render .md files so it could look to a passerby like
this is supposed to represent published docs. This is a feature
limitation of github, AFAIK we cannot disable it.

The official published docs are associated with each release and
available on the apache.org website. I think /latest is a common
convention for referring to the latest *published release* docs, so
probably we can't change that (the audience for /latest is orders of
magnitude larger than for snapshot docs). However we could just add
/snapshot and publish docs there.

- Patrick

On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell pwend...@gmail.com wrote:
 Hey Ryan,

 The existing JIRA also covers publishing nightly docs:
 https://issues.apache.org/jira/browse/SPARK-1517

 - Patrick

 On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
 ryan.blake.willi...@gmail.com wrote:
 Thanks Nicholas, glad to hear that some of this info will be pushed to the
 main site soon, but this brings up yet another point of confusion that I've
 struggled with, namely whether the documentation on github or that on
 spark.apache.org should be considered the primary reference for people
 seeking to learn about best practices for developing Spark.

 Trying to read docs starting from
 https://github.com/apache/spark/blob/master/docs/index.md right now, I find
 that all of the links to other parts of the documentation are broken: they
 point to relative paths that end in .html, which will work when published
 on the docs-site, but that would have to end in .md if a person was to be
 able to navigate them on github.

 So expecting people to use the up-to-date docs on github (where all
 internal URLs 404 and the main github README suggests that the latest
 Spark documentation can be found on the actually-months-old docs-site
 https://github.com/apache/spark#online-documentation) is not a good
 solution. On the other hand, consulting months-old docs on the site is also
 problematic, as this thread and your last email have borne out.  The result
 is that there is no good place on the internet to learn about the most
 up-to-date best practices for using/developing Spark.

 Why not build http://spark.apache.org/docs/latest/ nightly (or every
 commit) off of what's in github, rather than having that URL point to the
 last release's docs (up to ~3 months old)? This way, casual users who want
 the docs for the released version they happen to be using (which is already
 frequently != /latest today, for many Spark users) can (still) find them
 at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
 point people to a site (/latest) that actually has up-to-date docs that
 reflect ToT and whose links work.

 If there are concerns about existing semantics around /latest URLs being
 broken, some new URL could be used, like
 http://spark.apache.org/docs/snapshot/, but given that everything under
 http://spark.apache.org/docs/latest/ is in a state of
 planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
 that serious an issue to me; anyone sending around permanent links to
 things under /latest is already going to have those links break / not make
 sense in the near future.


 On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:


- currently the docs only contain information about building with
maven,
and even then don't cover many important cases

  All other points aside, I just want to point out that the docs document
 both how to use Maven and SBT and clearly state
 https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt
 that Maven is the build of reference while SBT may be preferable for
 day-to-day development.

 I believe the main reason most people miss this documentation is that,
 though it's up-to-date on GitHub, it has't been published yet to the docs
 site. It should go out with the 1.2 release.

 Improvements to the documentation on building Spark belong here:
 https://github.com/apache/spark/blob/master/docs/building-spark.md

 If there are clear recommendations that come out of this thread but are
 not in that doc, they should be added in there. Other, less important
 details may possibly be better suited for the Contributing to Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 guide.

 Nick


 On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Ryan,

 A few more things here. You should feel free to send patches to
 Jenkins to test them, since this is the reference environment in which
 we regularly run tests. This is the normal workflow for most
 developers and we spend a lot of effort provisioning/maintaining a
 very large jenkins cluster to allow developers access this resource. A
 common development approach is to locally run tests that you've added
 

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
Thanks Mark, most of those commands are things I've been using and used in
my original post except for Start zinc. I now see the section about it on
the unpublished building-spark
https://github.com/apache/spark/blob/master/docs/building-spark.md#speeding-up-compilation-with-zinc
page and will try using it.

Even so, finding those commands took a nontrivial amount of trial and
error, I've not seen them very-well-documented outside of this list (your
and Matei's emails (and previous emails to this list) each have more info
about building/testing with Maven and SBT (resp.) than building-spark
https://github.com/apache/spark/blob/master/docs/building-spark.md#spark-tests-in-maven
does),
the per-suite invocation is still subject to requiring assembly in some
cases (without warning from my perspective, having not read up on the
names of all Spark integration tests), spurious failures still abound,
there's no good way to run only the things that a given change actually
could have broken, etc.

Anyway, hopefully zinc brings me to the world of ~minute iteration times
that have been reported on this thread.


On Sun Nov 30 2014 at 6:53:57 PM Ryan Williams 
ryan.blake.willi...@gmail.com wrote:

 Thanks Nicholas, glad to hear that some of this info will be pushed to the
 main site soon, but this brings up yet another point of confusion that I've
 struggled with, namely whether the documentation on github or that on
 spark.apache.org should be considered the primary reference for people
 seeking to learn about best practices for developing Spark.

 Trying to read docs starting from
 https://github.com/apache/spark/blob/master/docs/index.md right now, I
 find that all of the links to other parts of the documentation are broken:
 they point to relative paths that end in .html, which will work when
 published on the docs-site, but that would have to end in .md if a person
 was to be able to navigate them on github.

 So expecting people to use the up-to-date docs on github (where all
 internal URLs 404 and the main github README suggests that the latest
 Spark documentation can be found on the actually-months-old docs-site
 https://github.com/apache/spark#online-documentation) is not a good
 solution. On the other hand, consulting months-old docs on the site is also
 problematic, as this thread and your last email have borne out.  The result
 is that there is no good place on the internet to learn about the most
 up-to-date best practices for using/developing Spark.

 Why not build http://spark.apache.org/docs/latest/ nightly (or every
 commit) off of what's in github, rather than having that URL point to the
 last release's docs (up to ~3 months old)? This way, casual users who want
 the docs for the released version they happen to be using (which is already
 frequently != /latest today, for many Spark users) can (still) find them
 at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
 point people to a site (/latest) that actually has up-to-date docs that
 reflect ToT and whose links work.

 If there are concerns about existing semantics around /latest URLs being
 broken, some new URL could be used, like
 http://spark.apache.org/docs/snapshot/, but given that everything under
 http://spark.apache.org/docs/latest/ is in a state of
 planned-backwards-incompatible-changes every ~3mos, that doesn't sound like
 that serious an issue to me; anyone sending around permanent links to
 things under /latest is already going to have those links break / not make
 sense in the near future.


 On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:


- currently the docs only contain information about building with
maven,
and even then don’t cover many important cases

  All other points aside, I just want to point out that the docs document
 both how to use Maven and SBT and clearly state
 https://github.com/apache/spark/blob/master/docs/building-spark.md#building-with-sbt
 that Maven is the “build of reference” while SBT may be preferable for
 day-to-day development.

 I believe the main reason most people miss this documentation is that,
 though it’s up-to-date on GitHub, it has’t been published yet to the docs
 site. It should go out with the 1.2 release.

 Improvements to the documentation on building Spark belong here:
 https://github.com/apache/spark/blob/master/docs/building-spark.md

 If there are clear recommendations that come out of this thread but are
 not in that doc, they should be added in there. Other, less important
 details may possibly be better suited for the Contributing to Spark
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 guide.

 Nick
 ​

 On Sun Nov 30 2014 at 6:50:55 PM Patrick Wendell pwend...@gmail.com
 wrote:

 Hey Ryan,

 A few more things here. You should feel free to send patches to
 Jenkins to test them, since this is the reference environment in which
 we regularly run tests. This is the normal workflow for 

Re: [RESULT] [VOTE] Designating maintainers for some Spark components

2014-11-30 Thread Matei Zaharia
An update on this: After adding the initial maintainer list, we got feedback to 
add more maintainers for some components, so we added four others (Josh Rosen 
for core API, Mark Hamstra for scheduler, Shivaram Venkataraman for MLlib and 
Xiangrui Meng for Python). We also decided to lower the timeout for waiting 
for a maintainer to a week. Hopefully this will provide more options for 
reviewing in these components.

The complete list is available at 
https://cwiki.apache.org/confluence/display/SPARK/Committers.

Matei

 On Nov 8, 2014, at 7:28 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Thanks everyone for voting on this. With all of the PMC votes being for, the 
 vote passes, but there were some concerns that I wanted to address for 
 everyone who brought them up, as well as in the wording we will use for this 
 policy.
 
 First, like every Apache project, Spark follows the Apache voting process 
 (http://www.apache.org/foundation/voting.html), wherein all code changes are 
 done by consensus. This means that any PMC member can block a code change on 
 technical grounds, and thus that there is consensus when something goes in. 
 It's absolutely true that every PMC member is responsible for the whole 
 codebase, as Greg said (not least due to legal reasons, e.g. making sure it 
 complies to licensing rules), and this idea will not change that. To make 
 this clear, I will include that in the wording on the project page, to make 
 sure new committers and other community members are all aware of it.
 
 What the maintainer model does, instead, is to change the review process, by 
 having a required review from some people on some types of code changes 
 (assuming those people respond in time). Projects can have their own diverse 
 review processes (e.g. some do commit-then-review and others do 
 review-then-commit, some point people to specific reviewers, etc). This kind 
 of process seems useful to try (and to refine) as the project grows. We will 
 of course evaluate how it goes and respond to any problems.
 
 So to summarize,
 
 - Every committer is responsible for, and more than welcome to review and 
 vote on, every code change. In fact all community members are welcome to do 
 this, and lots are doing it.
 - Everyone has the same voting rights on these code changes (namely consensus 
 as described at http://www.apache.org/foundation/voting.html)
 - Committers will be asked to run patches that are making architectural and 
 API changes by the maintainers before merging.
 
 In practice, none of this matters too much because we are not exactly a 
 hot-well of discord ;), and even in the case of discord, the point of the ASF 
 voting process is to create consensus. The goal is just to have a better 
 structure for reviewing and minimize the chance of errors.
 
 Here is a tally of the votes:
 
 Binding votes (from PMC): 17 +1, no 0 or -1
 
 Matei Zaharia
 Michael Armbrust
 Reynold Xin
 Patrick Wendell
 Andrew Or
 Prashant Sharma
 Mark Hamstra
 Xiangrui Meng
 Ankur Dave
 Imran Rashid
 Jason Dai
 Tom Graves
 Sean McNamara
 Nick Pentreath
 Josh Rosen
 Kay Ousterhout
 Tathagata Das
 
 Non-binding votes: 18 +1, one +0, one -1
 
 +1:
 Nan Zhu
 Nicholas Chammas
 Denny Lee
 Cheng Lian
 Timothy Chen
 Jeremy Freeman
 Cheng Hao
 Jackylk Likun
 Kousuke Saruta
 Reza Zadeh
 Xuefeng Wu
 Witgo
 Manoj Babu
 Ravindra Pesala
 Liquan Pei
 Kushal Datta
 Davies Liu
 Vaquar Khan
 
 +0: Corey Nolet
 
 -1: Greg Stein
 
 I'll send another email when I have a more detailed writeup of this on the 
 website.
 
 Matei


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Spurious test failures, testing best practices

2014-11-30 Thread Ryan Williams
Thanks Patrick, great to hear that docs-snapshots-via-jenkins is already
JIRA'd; you can interpret some of this thread as a gigantic +1 from me on
prioritizing that, which it looks like you are doing :)

I do understand the limitations of the github vs. official site status
quo; I was mostly responding to a perceived implication that I should have
been getting building/testing-spark advice from the github .md files
instead of from /latest. I agree that neither one works very well
currently, and that docs-snapshots-via-jenkins is the right solution. Per
my other email, leaving /latest as-is sounds reasonable, as long as jenkins
is putting the latest docs *somewhere*.

On Sun Nov 30 2014 at 7:19:33 PM Patrick Wendell pwend...@gmail.com wrote:

 Btw - the documnetation on github represents the source code of our
 docs, which is versioned with each release. Unfortunately github will
 always try to render .md files so it could look to a passerby like
 this is supposed to represent published docs. This is a feature
 limitation of github, AFAIK we cannot disable it.

 The official published docs are associated with each release and
 available on the apache.org website. I think /latest is a common
 convention for referring to the latest *published release* docs, so
 probably we can't change that (the audience for /latest is orders of
 magnitude larger than for snapshot docs). However we could just add
 /snapshot and publish docs there.

 - Patrick

 On Sun, Nov 30, 2014 at 6:15 PM, Patrick Wendell pwend...@gmail.com
 wrote:
  Hey Ryan,
 
  The existing JIRA also covers publishing nightly docs:
  https://issues.apache.org/jira/browse/SPARK-1517
 
  - Patrick
 
  On Sun, Nov 30, 2014 at 5:53 PM, Ryan Williams
  ryan.blake.willi...@gmail.com wrote:
  Thanks Nicholas, glad to hear that some of this info will be pushed to
 the
  main site soon, but this brings up yet another point of confusion that
 I've
  struggled with, namely whether the documentation on github or that on
  spark.apache.org should be considered the primary reference for people
  seeking to learn about best practices for developing Spark.
 
  Trying to read docs starting from
  https://github.com/apache/spark/blob/master/docs/index.md right now, I
 find
  that all of the links to other parts of the documentation are broken:
 they
  point to relative paths that end in .html, which will work when
 published
  on the docs-site, but that would have to end in .md if a person was
 to be
  able to navigate them on github.
 
  So expecting people to use the up-to-date docs on github (where all
  internal URLs 404 and the main github README suggests that the latest
  Spark documentation can be found on the actually-months-old docs-site
  https://github.com/apache/spark#online-documentation) is not a good
  solution. On the other hand, consulting months-old docs on the site is
 also
  problematic, as this thread and your last email have borne out.  The
 result
  is that there is no good place on the internet to learn about the most
  up-to-date best practices for using/developing Spark.
 
  Why not build http://spark.apache.org/docs/latest/ nightly (or every
  commit) off of what's in github, rather than having that URL point to
 the
  last release's docs (up to ~3 months old)? This way, casual users who
 want
  the docs for the released version they happen to be using (which is
 already
  frequently != /latest today, for many Spark users) can (still) find
 them
  at http://spark.apache.org/docs/X.Y.Z, and the github README can safely
  point people to a site (/latest) that actually has up-to-date docs that
  reflect ToT and whose links work.
 
  If there are concerns about existing semantics around /latest URLs
 being
  broken, some new URL could be used, like
  http://spark.apache.org/docs/snapshot/, but given that everything under
  http://spark.apache.org/docs/latest/ is in a state of
  planned-backwards-incompatible-changes every ~3mos, that doesn't sound
 like
  that serious an issue to me; anyone sending around permanent links to
  things under /latest is already going to have those links break / not
 make
  sense in the near future.
 
 
  On Sun Nov 30 2014 at 5:24:33 PM Nicholas Chammas 
  nicholas.cham...@gmail.com wrote:
 
 
 - currently the docs only contain information about building with
 maven,
 and even then don't cover many important cases
 
   All other points aside, I just want to point out that the docs
 document
  both how to use Maven and SBT and clearly state
  https://github.com/apache/spark/blob/master/docs/
 building-spark.md#building-with-sbt
  that Maven is the build of reference while SBT may be preferable for
  day-to-day development.
 
  I believe the main reason most people miss this documentation is that,
  though it's up-to-date on GitHub, it has't been published yet to the
 docs
  site. It should go out with the 1.2 release.
 
  Improvements to the documentation on building Spark belong here:
  

Re: Spurious test failures, testing best practices

2014-11-30 Thread Ganelin, Ilya
Hi, Patrick - with regards to testing on Jenkins, is the process for this
to submit a pull request for the branch or is there another interface we
can use to submit a build to Jenkins for testing?

On 11/30/14, 6:49 PM, Patrick Wendell pwend...@gmail.com wrote:

Hey Ryan,

A few more things here. You should feel free to send patches to
Jenkins to test them, since this is the reference environment in which
we regularly run tests. This is the normal workflow for most
developers and we spend a lot of effort provisioning/maintaining a
very large jenkins cluster to allow developers access this resource. A
common development approach is to locally run tests that you've added
in a patch, then send it to jenkins for the full run, and then try to
debug locally if you see specific unanticipated test failures.

One challenge we have is that given the proliferation of OS versions,
Java versions, Python versions, ulimits, etc. there is a combinatorial
number of environments in which tests could be run. It is very hard in
some cases to figure out post-hoc why a given test is not working in a
specific environment. I think a good solution here would be to use a
standardized docker container for running Spark tests and asking folks
to use that locally if they are trying to run all of the hundreds of
Spark tests.

Another solution would be to mock out every system interaction in
Spark's tests including e.g. filesystem interactions to try and reduce
variance across environments. However, that seems difficult.

As the number of developers of Spark increases, it's definitely a good
idea for us to invest in developer infrastructure including things
like snapshot releases, better documentation, etc. Thanks for bringing
this up as a pain point.

- Patrick


On Sun, Nov 30, 2014 at 3:35 PM, Ryan Williams
ryan.blake.willi...@gmail.com wrote:
 thanks for the info, Matei and Brennon. I will try to switch my
workflow to
 using sbt. Other potential action items:

 - currently the docs only contain information about building with maven,
 and even then don't cover many important cases, as I described in my
 previous email. If SBT is as much better as you've described then that
 should be made much more obvious. Wasn't it the case recently that there
 was only a page about building with SBT, and not one about building with
 maven? Clearer messaging around this needs to exist in the
documentation,
 not just on the mailing list, imho.

 - +1 to better distinguishing between unit and integration tests, having
 separate scripts for each, improving documentation around common
workflows,
 expectations of brittleness with each kind of test, advisability of just
 relying on Jenkins for certain kinds of tests to not waste too much
time,
 etc. Things like the compiler crash should be discussed in the
 documentation, not just in the mailing list archives, if new
contributors
 are likely to run into them through no fault of their own.

 - What is the algorithm you use to decide what tests you might have
broken?
 Can we codify it in some scripts that other people can use?



 On Sun Nov 30 2014 at 4:06:41 PM Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hi Ryan,

 As a tip (and maybe this isn't documented well), I normally use SBT for
 development to avoid the slow build process, and use its interactive
 console to run only specific tests. The nice advantage is that SBT can
keep
 the Scala compiler loaded and JITed across builds, making it faster to
 iterate. To use it, you can do the following:

 - Start the SBT interactive console with sbt/sbt
 - Build your assembly by running the assembly target in the assembly
 project: assembly/assembly
 - Run all the tests in one module: core/test
 - Run a specific suite: core/test-only org.apache.spark.rdd.RDDSuite
(this
 also supports tab completion)

 Running all the tests does take a while, and I usually just rely on
 Jenkins for that once I've run the tests for the things I believed my
patch
 could break. But this is because some of them are integration tests
(e.g.
 DistributedSuite, which creates multi-process mini-clusters). Many of
the
 individual suites run fast without requiring this, however, so you can
pick
 the ones you want. Perhaps we should find a way to tag them so people
can
 do a quick-test that skips the integration ones.

 The assembly builds are annoying but they only take about a minute for
me
 on a MacBook Pro with SBT warmed up. The assembly is actually only
required
 for some of the integration tests (which launch new processes), but
I'd
 recommend doing it all the time anyway since it would be very
confusing to
 run those with an old assembly. The Scala compiler crash issue can
also be
 a problem, but I don't see it very often with SBT. If it happens, I
exit
 SBT and do sbt clean.

 Anyway, this is useful feedback and I think we should try to improve
some
 of these suites, but hopefully you can also try the faster SBT
process. At
 the end of the day, if we want integration tests, 

Re: [mllib] Which is the correct package to add a new algorithm?

2014-11-30 Thread Yu Ishikawa
Hi Joseph, 

Thank you for your nice work and telling us the draft!

 During the next development cycle, new algorithms should be contributed to 
 spark.mllib.  Optionally, wrappers for new (and old) algorithms can be 
 contributed to spark.ml. 

I understand that we should contribute new algorithms to spark.mllib.
thanks, 
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Which-is-the-correct-package-to-add-a-new-algorithm-tp9540p9575.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org