[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...

JoshRosen Mon, 15 Dec 2014 19:13:57 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2848#issuecomment-67106447
  
    Disclaimer: for iterative debugging, I use `sbt` to build Spark, not Maven. 
 Spark 1.2.0 has docs on building with SBT.  If possible, I'd switch to using 
that workflow.
    
    The issue here is probably that you're running a full `mvn clean` and 
starting over from scratch after each change.  I'd like to help move this PR 
along, so the following is going to be some interactive logs to see whether I 
can quickly iterate on this using Maven.
    
    Let's say that I'm starting from a completely cold build (but with Maven 
dependencies already downloaded):
    
    ```bash
    # Since I have zinc installed, I'll use it:
    zinc -start
    git checkout /a/branch/with/your/pr/code
    # Here, the -T C1 says "build in parallel with one thread per core":
    time mvn -T C1 clean package -DskipTests
    ```
    
    This didn't take _super_ long, but it was a few minutes:
    
    ```
    real        4m19.537s
    user        3m14.634s
    sys 0m16.882s
    ```
    
    Let's run just the test suite that we're interested in ([instructions from 
here](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools):
    
    ```
    time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
    ```
    
    This took a little while because it had to build a bunch of test sources, 
but it was only a few seconds before the tests started running (and failing):
    
    ```
    real        0m33.968s
    user        0m36.544s
    sys 0m3.032s
    
    FileServerSuite:
    - Distributing files locally
    - Distributing files locally security On *** FAILED ***
      java.io.FileNotFoundException: 
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
 (No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at 
org.apache.spark.FileServerSuite$$anonfun$3.apply$mcV$sp(FileServerSuite.scala:96)
      ...
    - Distributing files locally using URL as input *** FAILED ***
      java.io.FileNotFoundException: 
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
 (No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at 
org.apache.spark.FileServerSuite$$anonfun$5.apply$mcV$sp(FileServerSuite.scala:112)
      ...
     - Dynamically adding JARS locally
    - Distributing files on a standalone cluster *** FAILED ***
      java.io.FileNotFoundException: 
/var/folders/0k/2qp2p2vs7bv033vljnb8nk1c0000gn/T/spark-7206dcdd-9db6-4d78-ac9e-2de69619dc67/test/FileServerSuite.txt
 (No such file or directory)
      at java.io.FileInputStream.open(Native Method)
      at java.io.FileInputStream.<init>(FileInputStream.java:146)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
      at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
      at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
      at com.google.common.io.Files.copy(Files.java:436)
      at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:72)
      at org.apache.spark.HttpFileServer.addFile(HttpFileServer.scala:55)
      at org.apache.spark.SparkContext.addFile(SparkContext.scala:965)
      at 
org.apache.spark.FileServerSuite$$anonfun$8.apply$mcV$sp(FileServerSuite.scala:137)
      ...
    - Dynamically adding JARS on a standalone cluster
    - Dynamically adding JARS on a standalone cluster using local: URL
    ```
    
    Let's try adding a print statement to `SparkContext.addFile`, then 
re-running the tests.  We could do this by re-packaging:
    
    ```
    mvn -T C1 package -DskipTests
    ```
    
    ```
    [INFO] BUILD SUCCESS
    [INFO] 
------------------------------------------------------------------------
    [INFO] Total time: 01:24 min (Wall Clock)
    [INFO] Finished at: 2014-12-15T18:47:00-08:00
    [INFO] Final Memory: 50M/1535M
    ```
    
    And to re-run the test:
    
    ```
    time mvn -T C1 test -DwildcardSuites=org.apache.spark.FileServerSuite
    ```
    
    ```
    real        0m40.149s
    user        0m35.894s
    sys 0m3.163s
    ```
    
    So this took a couple of minutes.
    
    Let's do the same thing in SBT.  First, let's start off from a completely 
clean-slate:
    
    ```
    time sbt/sbt clean package assembly
    ```
    
    The timing here could be messed up because I have a multi-core machine:
    
    ```
    real        3m53.643s
    user        8m22.337s
    sys 1m15.794s
    ```
    
    Next, let's run just the suite we're interested in.  Here's a naive way to 
do this, which involves building every test, so this first run will take longer 
than subsequent runs:
    
    ```
    time sbt/sbt "test-only FileServerSuite"
    ```
    
    ```
    [success] Total time: 88 s, completed Dec 15, 2014 6:56:35 PM
    
    real        1m39.013s
    user        7m37.323s
    sys 0m16.206s
    ```
    
    Whoops, I made a mistake here!  My `test-only` pattern didn't include a 
wildcard, so `FileServerSuite` didn't match the fully-qualified name of a test 
suite.  Let me go ahead and re-run with the right command:
    
    ```
    time sbt/sbt "test-only *FileServerSuite"
    ```
    
    This was pretty fast:
    
    ```
    real        0m29.075s
    user        0m50.744s
    sys 0m3.512s
    ```
    
    I could also have run this from the interactive shell to get automatic 
rebuilding on source changes.
    
    There's an even faster way of running just FileServerSuite, though: I can 
tell SBT to only build / run the `core` module.  This time, let's do this 
interactively, but from a clean slate:
    
    ```scala
    [info] Set current project to spark-parent (in build 
file:/Users/joshrosen/Documents/spark/)
    > clean
    [success] Total time: 20 s, completed Dec 15, 2014 7:01:20 PM
    > project core
    [info] Set current project to spark-core (in build 
file:/Users/joshrosen/Documents/spark/)
    > package
    [...]
    [info] Compiling 42 Java sources to 
/Users/joshrosen/Documents/spark/network/common/target/scala-2.10/classes...
    [info] Compiling 20 Java sources to 
/Users/joshrosen/Documents/spark/network/shuffle/target/scala-2.10/classes...
    [info] Compiling 397 Scala sources and 33 Java sources to 
/Users/joshrosen/Documents/spark/core/target/scala-2.10/classes...
    [...]
    [info] Packaging 
/Users/joshrosen/Documents/spark/core/target/scala-2.10/spark-core_2.10-1.3.0-SNAPSHOT.jar
 ...
    [info] Done packaging.
    [success] Total time: 64 s, completed Dec 15, 2014 7:02:36 PM
    > test-only *FileServerSuite
    [...]
    [info] Compiling 124 Scala sources and 4 Java sources to 
/Users/joshrosen/Documents/spark/core/target/scala-2.10/test-classes...
    [...]
    [---- tests run ----]
    [---- tests go into infinite loop ----]
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such 
file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such 
file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such 
file or directory
    ls: /Users/joshrosen/Documents/spark/assembly/target/scala-2.10: No such 
file or directory
    [... infinite repetitions ...]
    ```
    
    Hmm, so it looks like the tests that rely on `local-cluster` mode need to 
have access to a Spark assembly JAR in order to run, and that there's 
mis-handling of this error condition somewhere (hence the infinite loop).  This 
is pretty annoying, so I guess I'll build an assembly once then use `export 
SPARK_PREPEND_CLASSES=true` so that I don't have to keep re-building it across 
test runs:
    
    ```
    export SPARK_PREPEND_CLASSES=true
    sbt/sbt assembly/assembly
    sbt/sbt
    ```
    
    Now, from the SBT shell:
    
    ```scala
    > project core
    [info] Set current project to spark-core (in build 
file:/Users/joshrosen/Documents/spark/)
    > ~test-only *FileServerSuite
    [... tests run ...]
    [info] *** 3 TESTS FAILED ***
    [error] Failed: Total 7, Failed 3, Errors 0, Passed 4
    [error] Failed tests:
    [error]     org.apache.spark.FileServerSuite
    [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
    [error] Total time: 18 s, completed Dec 15, 2014 7:10:57 PM
    1. Waiting for source changes... (press enter to interrupt)
    [... add a println to Utils.addFile ...]
    [... tests start up almost instantly and run ...]
    [info] *** 3 TESTS FAILED ***
    [error] Failed: Total 7, Failed 3, Errors 0, Passed 4
    [error] Failed tests:
    [error]     org.apache.spark.FileServerSuite
    [error] (core/test:testOnly) sbt.TestsFailedException: Tests unsuccessful
    [error] Total time: 19 s, completed Dec 15, 2014 7:11:46 PM
    ```
    
    So, to summarize: I agree that there are a bunch of pain-points in the 
current build process.  Day-to-day, though, it hasn't affected me that much 
since I'll usually `sbt/sbt clean package assembly` and `export 
SPARK_PREPEND_CLASSES=true` once at the beginning of the day then keep working 
in my SBT shell, where incremental recompilation means that I can make changes 
in my IDE and see the failing test update (almost) instantly.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3967] don’t redundantly overwrite exe...

Reply via email to