Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/2848#issuecomment-67106447
Disclaimer: for iterative debugging, I use `sbt` to build Spark, not Maven.
Spark 1.2.0 has docs on building with SBT. If possible, I'd switch to using
that workflow.
The issue here is probably that you're running a full `mvn clean` and
starting over from scratch after each change. I'd like to help move this PR
along, so the following is going to be some interactive logs to see whether I
can quickly iterate on this using Maven.
Let's say that I'm starting from a completely cold build (but with Maven
dependencies already downloaded):
```bash
# Since I have zinc installed, I'll use it:
zinc -start
git checkout /a/branch/with/your/pr/code
# Here, the -T C1 says "build in parallel with one thread per core":
time mvn -T C1 clean package -DskipTests
```
This didn't take _super_ long, but it was a few minutes:
```
real 4m19.537s
user 3m14.634s
sys 0m16.882s
```
Let's run just the test suite that we're interested in ([instructions from
here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools):
```
time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
```
This took a little while because it had to build a bunch of test sources,
but it was only a few seconds before the tests started running (and failing):
```
real 0m33.968s
user 0m36.544s
sys 0m3.032s
FileServerSuite:
- Distributing files locally
- Distributing files locally security On *** FAILED ***
java.io.FileNotFoundException:
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
at com.google.common.io.Files.copy(Files.java:436)
at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
at
org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96)
...
- Distributing files locally using URL as input *** FAILED ***
java.io.FileNotFoundException:
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
at com.google.common.io.Files.copy(Files.java:436)
at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
at
org.apache.spark.FileServerSuite$$anonfun$5.apply$mcV$sp(FileServerSuite.scala:112)
...
- Dynamically adding JARS locally
- Distributing files on a standalone cluster *** FAILED ***
java.io.FileNotFoundException:
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
(No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
at com.google.common.io.Files.copy(Files.java:436)
at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
at
org.apache.spark.FileServerSuite$$anonfun$8.apply$mcV$sp(FileServerSuite.scala:137)
...
- Dynamically adding JARS on a standalone cluster
- Dynamically adding JARS on a standalone cluster using local: URL
```
Let's try adding a print statement to `SparkContext.addFile`, then
re-running the tests. We could do this by re-packaging:
```
mvn -T C1 package -DskipTests
```
```
[INFO] BUILD SUCCESS
[INFO]
------------------------------------------------------------------------
[INFO] Total time: 01:24 min (Wall Clock)
[INFO] Finished at: 2014-12-15T18:47:00-08:00
[INFO] Final Memory: 50M/1535M
```
And to re-run the test:
```
time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
```
```
real 0m40.149s
user 0m35.894s
sys 0m3.163s
```
So this took a couple of minutes.
Let's do the same thing in SBT. First, let's start off from a completely
clean-slate:
```
time sbt/sbt clean package assembly
```
The timing here could be messed up because I have a multi-core machine:
```
real 3m53.643s
user 8m22.337s
sys 1m15.794s
```
Next, let's run just the suite we're interested in. Here's a naive way to
do this, which involves building every test, so this first run will take longer
than subsequent runs:
```
time sbt/sbt "test-only FileServerSuite"
```
```
[success] Total time: 88 s, completed Dec 15, 2014 6:56:35 PM
real 1m39.013s
user 7m37.323s
sys 0m16.206s
```
Whoops, I made a mistake here! My `test-only` pattern didn't include a
wildcard, so `FileServerSuite` didn't match the fully-qualified name of a test
suite. Let me go ahead and re-run with the right command:
```
time sbt/sbt "test-only *FileServerSuite"
```
This was pretty fast:
```
real 0m29.075s
user 0m50.744s
sys 0m3.512s
```
I could also have run this from the interactive shell to get automatic
rebuilding on source changes.
There's an even faster way of running just FileServerSuite, though: I can
tell SBT to only build / run the `core` module. This time, let's do this
interactively, but from a clean slate:
```scala
[info] Set current project to spark-parent (in build
file:/Users/joshrosen/Documents/spark/)
> clean
[success] Total time: 20 s, completed Dec 15, 2014 7:01:20 PM
> project core
[info] Set current project to spark-core (in build
file:/Users/joshrosen/Documents/spark/)
> package
[...]
[info] Compiling 42 Java sources to
/Users/joshrosen/Documents/spark/network/common/target/scala-2.10/classes...
[info] Compiling 20 Java sources to
/Users/joshrosen/Documents/spark/network/shuffle/target/scala-2.10/classes...
[info] Compiling 397 Scala sources and 33 Java sources to
/Users/joshrosen/Documents/spark/core/target/scala-2.10/classes...
[...]
[info] Packaging
/Users/joshrosen/Documents/spark/core/target/scala-2.10/spark-core_2.10-1.3.0-SNAPSHOT.jar
...
[info] Done packaging.
[success] Total time: 64 s, completed Dec 15, 2014 7:02:36 PM
> test-only *FileServerSuite
[...]
[info] Compiling 124 Scala sources and 4 Java sources to
/Users/joshrosen/Documents/spark/core/target/scala-2.10/test-classes...
[...]
[---- tests run ----]
[---- tests go into infinite loop ----]
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such
file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such
file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such
file or directory
ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such
file or directory
[... infinite repetitions ...]
```
Hmm, so it looks like the tests that rely on `local-cluster` mode need to
have access to a Spark assembly JAR in order to run, and that there's
mis-handling of this error condition somewhere (hence the infinite loop). This
is pretty annoying, so I guess I'll build an assembly once then use `export
SPARK_PREPEND_CLASSES=true` so that I don't have to keep re-building it across
test runs:
```
export SPARK_PREPEND_CLASSES=true
sbt/sbt assembly/assembly
sbt/sbt
```
Now, from the SBT shell:
```scala
> project core
[info] Set current project to spark-core (in build
file:/Users/joshrosen/Documents/spark/)
> ~test-only *FileServerSuite
[... tests run ...]
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 7, Failed 3, Errors 0, Passed 4
[error] Failed tests:
[error] org.apache.spark.FileServerSuite
[error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 18 s, completed Dec 15, 2014 7:10:57 PM
1. Waiting for source changes... (press enter to interrupt)
[... add a println to Utils.addFile ...]
[... tests start up almost instantly and run ...]
[info] *** 3 TESTS FAILED ***
[error] Failed: Total 7, Failed 3, Errors 0, Passed 4
[error] Failed tests:
[error] org.apache.spark.FileServerSuite
[error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 19 s, completed Dec 15, 2014 7:11:46 PM
```
So, to summarize: I agree that there are a bunch of pain-points in the
current build process. Day-to-day, though, it hasn't affected me that much
since I'll usually `sbt/sbt clean package assembly` and `export
SPARK_PREPEND_CLASSES=true` once at the beginning of the day then keep working
in my SBT shell, where incremental recompilation means that I can make changes
in my IDE and see the failing test update (almost) instantly.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]