Re: Error with --files
That fixed it! Thank you! --Ben On Thu, Apr 14, 2016 at 5:53 PM, Marcelo Vanzin <van...@cloudera.com> wrote: > On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen <quasi...@gmail.com> > wrote: > >> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files > >> /home/ubuntu/localtest.txt#appSees.txt > > --files should come before the path to your python script. Otherwise > it's just passed as arguments to your script when it's run. > > -- > Marcelo >
Error with --files
Hi All, I'm trying to use the --files option with yarn: spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files > /home/ubuntu/localtest.txt#appSees.txt I never see the file in HDFS or in the yarn containers. Am I doing something incorrect ? I'm running spark 1.6.0 Thanks, --Ben
Re: 1.5 Build Errors
Hi All, Sean patiently worked with me in solving this issue. The problem was entirely my fault in settings MAVEN_OPTS env variable was set and was overriding everything. --Ben On Tue, Sep 8, 2015 at 1:37 PM, Benjamin Zaitlen <quasi...@gmail.com> wrote: > Yes, just reran with the following > > (spark_build)root@ip-10-45-130-206:~/spark# export MAVEN_OPTS="-Xmx4096mb >> -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m" >> (spark_build)root@ip-10-45-130-206:~/spark# build/mvn -Pyarn >> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package > > > and grepping for java > > > root 641 9.9 0.3 4411732 49040 pts/4 Sl+ 17:35 0:01 >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g >> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m >> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath >> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar >> com.typesafe.zinc.Nailgun 3030 0 >> root 687 226 2.0 1803664 312876 pts/4 Sl+ 17:36 0:22 >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath >> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar >> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf >> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3 >> -Dmaven.multiModuleProjectDirectory=/root/spark >> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 -Pyarn >> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package > > > On Tue, Sep 8, 2015 at 1:14 PM, Sean Owen <so...@cloudera.com> wrote: > >> MAVEN_OPTS shouldn't affect zinc as it's an unrelated application. You >> can run "zinc -J-Xmx4g..." in general, but in the provided script, >> ZINC_OPTS seems to be the equivalent, yes. It kind of looks like your >> mvn process isn't getting any special memory args there. Is MAVEN_OPTS >> really exported? >> >> FWIW I use my own local mvn and zinc and it works fine. >> >> On Tue, Sep 8, 2015 at 6:05 PM, Benjamin Zaitlen <quasi...@gmail.com> >> wrote: >> > I'm running zinv while compiling. It seems that MAVEN_OPTS doesn't >> really >> > change much? Or perhaps I'm misunderstanding something -- grepping for >> java >> > i see >> > >> >> root 24355 102 8.8 4687376 1350724 pts/4 Sl 16:51 11:08 >> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g >> >> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m >> >> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath >> >> >> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar >> >> com.typesafe.zinc.Nailgun 3030 0 >> >> root 25151 22.0 3.2 2269092 495276 pts/4 Sl+ 16:53 1:56 >> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath >> >> >> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar >> >> >> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf >> >> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3 >> >> -Dmaven.multiModuleProjectDirectory=/root/spark >> >> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean >> >> package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4 >> >> -Dhadoop.version=2.4.0 >> > >> > >> > So the heap size is still 2g even with MAVEN_OPTS set with 4g. I >> noticed >> > that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what >> > ZINC_OPTS is set to. >> > >> > --Ben >> > >> > >> > On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> >> >> Do you run Zinc while compiling ? >> >> >> >> Ch
1.5 Build Errors
Hi All, I'm trying to build a distribution off of the latest in master and I keep getting errors on MQTT and the build fails. I'm running the build on a m1.large which has 7.5 GB of RAM and no other major processes are running. MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" > ./make-distribution.sh --name continuum-custom-spark-1.5 --tgz -Pyarn > -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0 INFO] Spark Project GraphX ... SUCCESS [ 33.345 > s] > [INFO] Spark Project Streaming SUCCESS [01:08 > min] > [INFO] Spark Project Catalyst . SUCCESS [01:39 > min] > [INFO] Spark Project SQL .. SUCCESS [02:06 > min] > [INFO] Spark Project ML Library ... SUCCESS [02:16 > min] > [INFO] Spark Project Tools SUCCESS [ > 4.087 s] > [INFO] Spark Project Hive . SUCCESS [01:28 > min] > [INFO] Spark Project REPL . SUCCESS [ > 16.291 s] > [INFO] Spark Project YARN Shuffle Service . SUCCESS [ > 13.671 s] > [INFO] Spark Project YARN . SUCCESS [ > 20.554 s] > [INFO] Spark Project Hive Thrift Server ... SUCCESS [ > 14.332 s] > [INFO] Spark Project Assembly . SUCCESS [03:33 > min] > [INFO] Spark Project External Twitter . SUCCESS [ > 14.208 s] > [INFO] Spark Project External Flume Sink .. SUCCESS [ > 11.535 s] > [INFO] Spark Project External Flume ... SUCCESS [ > 19.010 s] > [INFO] Spark Project External Flume Assembly .. SUCCESS [ > 5.210 s] > [INFO] Spark Project External MQTT FAILURE [01:10 > min] > [INFO] Spark Project External MQTT Assembly ... SKIPPED > [INFO] Spark Project External ZeroMQ .. SKIPPED > [INFO] Spark Project External Kafka ... SKIPPED > [INFO] Spark Project Examples . SKIPPED > [INFO] Spark Project External Kafka Assembly .. SKIPPED > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 22:55 min > [INFO] Finished at: 2015-09-07T22:42:57+00:00 > [INFO] Final Memory: 240M/455M > [INFO] > > [ERROR] GC overhead limit exceeded -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the > -e switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError > + return 1 > + exit 1 Any thoughts would be extremely helpful. --Ben
Re: 1.5 Build Errors
Ah, right. Should've caught that. The docs seem to recommend 2gb. Should that be increased as well? --Ben On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote: > It shows you there that Maven is out of memory. Give it more heap. I use > 3gb. > > On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com> > wrote: > > Hi All, > > > > I'm trying to build a distribution off of the latest in master and I keep > > getting errors on MQTT and the build fails. I'm running the build on a > > m1.large which has 7.5 GB of RAM and no other major processes are > running. > > > >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" > >> ./make-distribution.sh --name continuum-custom-spark-1.5 --tgz -Pyarn > >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0 > > > > > > > >> INFO] Spark Project GraphX ... SUCCESS [ > >> 33.345 s] > >> [INFO] Spark Project Streaming SUCCESS > [01:08 > >> min] > >> [INFO] Spark Project Catalyst . SUCCESS > [01:39 > >> min] > >> [INFO] Spark Project SQL .. SUCCESS > [02:06 > >> min] > >> [INFO] Spark Project ML Library ... SUCCESS > [02:16 > >> min] > >> [INFO] Spark Project Tools SUCCESS [ > >> 4.087 s] > >> [INFO] Spark Project Hive . SUCCESS > [01:28 > >> min] > >> [INFO] Spark Project REPL . SUCCESS [ > >> 16.291 s] > >> [INFO] Spark Project YARN Shuffle Service . SUCCESS [ > >> 13.671 s] > >> [INFO] Spark Project YARN . SUCCESS [ > >> 20.554 s] > >> [INFO] Spark Project Hive Thrift Server ... SUCCESS [ > >> 14.332 s] > >> [INFO] Spark Project Assembly . SUCCESS > [03:33 > >> min] > >> [INFO] Spark Project External Twitter . SUCCESS [ > >> 14.208 s] > >> [INFO] Spark Project External Flume Sink .. SUCCESS [ > >> 11.535 s] > >> [INFO] Spark Project External Flume ... SUCCESS [ > >> 19.010 s] > >> [INFO] Spark Project External Flume Assembly .. SUCCESS [ > >> 5.210 s] > >> [INFO] Spark Project External MQTT FAILURE > [01:10 > >> min] > >> [INFO] Spark Project External MQTT Assembly ... SKIPPED > >> [INFO] Spark Project External ZeroMQ .. SKIPPED > >> [INFO] Spark Project External Kafka ... SKIPPED > >> [INFO] Spark Project Examples . SKIPPED > >> [INFO] Spark Project External Kafka Assembly .. SKIPPED > >> [INFO] > >> > >> [INFO] BUILD FAILURE > >> [INFO] > >> > >> [INFO] Total time: 22:55 min > >> [INFO] Finished at: 2015-09-07T22:42:57+00:00 > >> [INFO] Final Memory: 240M/455M > >> [INFO] > >> > >> [ERROR] GC overhead limit exceeded -> [Help 1] > >> [ERROR] > >> [ERROR] To see the full stack trace of the errors, re-run Maven with the > >> -e switch. > >> [ERROR] Re-run Maven using the -X switch to enable full debug logging. > >> [ERROR] > >> [ERROR] For more information about the errors and possible solutions, > >> please read the following articles: > >> [ERROR] [Help 1] > >> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError > >> + return 1 > >> + exit 1 > > > > > > Any thoughts would be extremely helpful. > > > > --Ben >
Re: 1.5 Build Errors
I'm still getting errors with 3g. I've increase to 4g and I'll report back To be clear: export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m" [ERROR] GC overhead limit exceeded -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the > -e switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, > please read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError > + return 1 > + exit 1 On Tue, Sep 8, 2015 at 10:03 AM, Sean Owen <so...@cloudera.com> wrote: > It might need more memory in certain situations / running certain > tests. If 3gb works for your relatively full build, yes you can open a > PR to change any occurrences of lower recommendations to 3gb. > > On Tue, Sep 8, 2015 at 3:02 PM, Benjamin Zaitlen <quasi...@gmail.com> > wrote: > > Ah, right. Should've caught that. > > > > The docs seem to recommend 2gb. Should that be increased as well? > > > > --Ben > > > > On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote: > >> > >> It shows you there that Maven is out of memory. Give it more heap. I use > >> 3gb. > >> > >> On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com> > >> wrote: > >> > Hi All, > >> > > >> > I'm trying to build a distribution off of the latest in master and I > >> > keep > >> > getting errors on MQTT and the build fails. I'm running the build > on a > >> > m1.large which has 7.5 GB of RAM and no other major processes are > >> > running. > >> > > >> >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M > -XX:ReservedCodeCacheSize=512m" > >> >> ./make-distribution.sh --name continuum-custom-spark-1.5 --tgz > -Pyarn > >> >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0 > >> > > >> > > >> > > >> >> INFO] Spark Project GraphX ... SUCCESS [ > >> >> 33.345 s] > >> >> [INFO] Spark Project Streaming SUCCESS > >> >> [01:08 > >> >> min] > >> >> [INFO] Spark Project Catalyst . SUCCESS > >> >> [01:39 > >> >> min] > >> >> [INFO] Spark Project SQL .. SUCCESS > >> >> [02:06 > >> >> min] > >> >> [INFO] Spark Project ML Library ... SUCCESS > >> >> [02:16 > >> >> min] > >> >> [INFO] Spark Project Tools SUCCESS [ > >> >> 4.087 s] > >> >> [INFO] Spark Project Hive . SUCCESS > >> >> [01:28 > >> >> min] > >> >> [INFO] Spark Project REPL . SUCCESS [ > >> >> 16.291 s] > >> >> [INFO] Spark Project YARN Shuffle Service . SUCCESS [ > >> >> 13.671 s] > >> >> [INFO] Spark Project YARN . SUCCESS [ > >> >> 20.554 s] > >> >> [INFO] Spark Project Hive Thrift Server ... SUCCESS [ > >> >> 14.332 s] > >> >> [INFO] Spark Project Assembly . SUCCESS > >> >> [03:33 > >> >> min] > >> >> [INFO] Spark Project External Twitter . SUCCESS [ > >> >> 14.208 s] > >> >> [INFO] Spark Project External Flume Sink .. SUCCESS [ > >> >> 11.535 s] > >> >> [INFO] Spark Project External Flume ... SUCCESS [ > >> >> 19.010 s] > >> >> [INFO] Spark Project External Flume Assembly .. SUCCESS [ > >> >> 5.210 s] > >> >> [INFO] Spark Project External MQTT FAILURE > >> >> [01:10 > >> >> min] > >> >> [INFO] Spark Project External MQTT Assembly ... SKIPPED > >> >> [INFO] Spark Project External ZeroMQ .. SKIPPED > >> >> [INFO] Spark Project External Kafka ... SKIPPED > >> >> [INFO] Spark Project Examples . SKIPPED > >> >> [INFO]
Re: 1.5 Build Errors
Yes, just reran with the following (spark_build)root@ip-10-45-130-206:~/spark# export MAVEN_OPTS="-Xmx4096mb > -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m" > (spark_build)root@ip-10-45-130-206:~/spark# build/mvn -Pyarn -Phadoop-2.4 > -Dhadoop.version=2.4.0 -DskipTests clean package and grepping for java root 641 9.9 0.3 4411732 49040 pts/4 Sl+ 17:35 0:01 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g > -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m > -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath > /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar > com.typesafe.zinc.Nailgun 3030 0 > root 687 226 2.0 1803664 312876 pts/4 Sl+ 17:36 0:22 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath > /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar > -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf > -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3 > -Dmaven.multiModuleProjectDirectory=/root/spark > org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 -Pyarn > -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package On Tue, Sep 8, 2015 at 1:14 PM, Sean Owen <so...@cloudera.com> wrote: > MAVEN_OPTS shouldn't affect zinc as it's an unrelated application. You > can run "zinc -J-Xmx4g..." in general, but in the provided script, > ZINC_OPTS seems to be the equivalent, yes. It kind of looks like your > mvn process isn't getting any special memory args there. Is MAVEN_OPTS > really exported? > > FWIW I use my own local mvn and zinc and it works fine. > > On Tue, Sep 8, 2015 at 6:05 PM, Benjamin Zaitlen <quasi...@gmail.com> > wrote: > > I'm running zinv while compiling. It seems that MAVEN_OPTS doesn't > really > > change much? Or perhaps I'm misunderstanding something -- grepping for > java > > i see > > > >> root 24355 102 8.8 4687376 1350724 pts/4 Sl 16:51 11:08 > >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g > >> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m > >> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath > >> > /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar > >> com.typesafe.zinc.Nailgun 3030 0 > >> root 25151 22.0 3.2 2269092 495276 pts/4 Sl+ 16:53 1:56 > >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath > >> > /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar > >> > -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf > >> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3 > >> -Dmaven.multiModuleProjectDirectory=/root/spark > >> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean > >> package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4 > >> -Dhadoop.version=2.4.0 > > > > > > So the heap size is still 2g even with MAVEN_OPTS set with 4g. I noticed > > that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what > > ZINC_OPTS is set to. > > > > --Ben > > > > > > On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > >> Do you run Zinc while compiling ? > >> > >> Cheers > >> > >> On Tue, Sep 8, 2015 at 7:56 AM, Benjamin Zaitlen <quasi...@gmail.com> > >> wrote: > >>> > >>> I'm still getting errors with 3g. I've increase to 4g and I'll report > >>> back > >>> > >>> To be clear: > >>> > >>> export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M > >>> -XX:ReservedCodeCacheSize=1024m" > >>> > >>>> [ERROR] GC overhead limit exceeded -> [Help 1] > >>>> [ERROR] > >>
Re: 1.5 Build Errors
I'm running zinv while compiling. It seems that MAVEN_OPTS doesn't really change much? Or perhaps I'm misunderstanding something -- grepping for java i see root 24355 102 8.8 4687376 1350724 pts/4 Sl 16:51 11:08 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g > -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m > -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath > /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar > com.typesafe.zinc.Nailgun 3030 0 > root 25151 22.0 3.2 2269092 495276 pts/4 Sl+ 16:53 1:56 > /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath > /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar > -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf > -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3 > -Dmaven.multiModuleProjectDirectory=/root/spark > org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean > package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4 > -Dhadoop.version=2.4.0 So the heap size is still 2g even with MAVEN_OPTS set with 4g. I noticed that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what ZINC_OPTS is set to. --Ben On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote: > Do you run Zinc while compiling ? > > Cheers > > On Tue, Sep 8, 2015 at 7:56 AM, Benjamin Zaitlen <quasi...@gmail.com> > wrote: > >> I'm still getting errors with 3g. I've increase to 4g and I'll report >> back >> >> To be clear: >> >> export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M >> -XX:ReservedCodeCacheSize=1024m" >> >> [ERROR] GC overhead limit exceeded -> [Help 1] >>> [ERROR] >>> [ERROR] To see the full stack trace of the errors, re-run Maven with the >>> -e switch. >>> [ERROR] Re-run Maven using the -X switch to enable full debug logging. >>> [ERROR] >>> [ERROR] For more information about the errors and possible solutions, >>> please read the following articles: >>> [ERROR] [Help 1] >>> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError >>> + return 1 >>> + exit 1 >> >> >> On Tue, Sep 8, 2015 at 10:03 AM, Sean Owen <so...@cloudera.com> wrote: >> >>> It might need more memory in certain situations / running certain >>> tests. If 3gb works for your relatively full build, yes you can open a >>> PR to change any occurrences of lower recommendations to 3gb. >>> >>> On Tue, Sep 8, 2015 at 3:02 PM, Benjamin Zaitlen <quasi...@gmail.com> >>> wrote: >>> > Ah, right. Should've caught that. >>> > >>> > The docs seem to recommend 2gb. Should that be increased as well? >>> > >>> > --Ben >>> > >>> > On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote: >>> >> >>> >> It shows you there that Maven is out of memory. Give it more heap. I >>> use >>> >> 3gb. >>> >> >>> >> On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com> >>> >> wrote: >>> >> > Hi All, >>> >> > >>> >> > I'm trying to build a distribution off of the latest in master and I >>> >> > keep >>> >> > getting errors on MQTT and the build fails. I'm running the build >>> on a >>> >> > m1.large which has 7.5 GB of RAM and no other major processes are >>> >> > running. >>> >> > >>> >> >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M >>> -XX:ReservedCodeCacheSize=512m" >>> >> >> ./make-distribution.sh --name continuum-custom-spark-1.5 --tgz >>> -Pyarn >>> >> >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0 >>> >> > >>> >> > >>> >> > >>> >> >> INFO] Spark Project GraphX ... SUCCESS >>> [ >>> >> >> 33.345 s] >>> >> >> [INFO] Spark Project Streaming SUCCESS >>>
Submitting Python Applications from Remote to Master
Hi All, I'm not quite clear on whether submitting a python application to spark standalone on ec2 is possible. Am I reading this correctly: *A common deployment strategy is to submit your application from a gateway machine that is physically co-located with your worker machines (e.g. Master node in a standalone EC2 cluster). In this setup, client mode is appropriate. In client mode, the driver is launched directly within the client spark-submit process, with the input and output of the application attached to the console. Thus, this mode is especially suitable for applications that involve the REPL (e.g. Spark shell). Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to usecluster mode to minimize network latency between the drivers and the executors. Note that cluster mode is currently not supported for standalone clusters, Mesos clusters, or python applications. So I shouldn't be able to do something like: ./bin/spark-submit --master spark:/x.compute-1.amazonaws.com:7077 examples/src/main/python/pi.py From a laptop connecting to a previously launched spark cluster using the default spark-ec2 script, correct? If I am not mistaken about this then docs are slightly confusing -- the above example is more or less the example here: https://spark.apache.org/docs/1.1.0/submitting-applications.html If I am mistaken, apologies, can you help me figure out where I went wrong? I've also taken to opening port 7077 to 0.0.0.0/0 --Ben
Re: iPython notebook ec2 cluster matlabplot not found?
HI Andy, I built an anaconda/spark AMI a few months ago. I'm still iterating on it so if things break please report them. If you want to give it awhirl: ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3ecd0c56 The nice thing about anaconda is that it come pre-baked with ipython-notebook, matplotlib, scipy stack, and many other libraries. --Ben On Mon, Sep 29, 2014 at 12:45 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi Nicholas Yes out of the box PySpark works. My problem is I am using iPython note book and matlabplot is not found. It seems that out of the box the cluster has an old version of python and iPython notebook. It was suggested I upgrade iPython because the new version include matlabplot. This upgrade requires going to python 2.7. The python upgrade and iPython upgrade seemed to work how ever I am still getting my original problem ERROR: Line magic function `%matplotlib` not found I also posted to the iPython-dev mail list. So far I have not found a solution. Maybe I’ll have to switch to a different graphing package Thanks Andy From: Nicholas Chammas nicholas.cham...@gmail.com Date: Saturday, September 27, 2014 at 4:49 PM To: Andrew Davidson a...@santacruzintegration.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: iPython notebook ec2 cluster matlabplot not found? Can you first confirm that the regular PySpark shell works on your cluster? Without upgrading to 2.7. That is, you log on to your master using spark-ec2 login and run bin/pyspark successfully without any special flags. And as far as I can tell, you should be able to use IPython at 2.6, so I’d next confirm that that is working before throwing the 2.7 upgrade into the mix. Also, when upgrading or installing things, try doing so for all the nodes in your cluster using pssh. If you install stuff just on the master without somehow transferring it to the slaves, that will be problematic. Finally, there is an open pull request https://github.com/apache/spark/pull/2554 related to IPython that may be relevant, though I haven’t looked at it too closely. Nick On Sat, Sep 27, 2014 at 7:33 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am having a heck of time trying to get python to work correctly on my cluster created using the spark-ec2 script The following link was really helpful https://issues.apache.org/jira/browse/SPARK-922 I am still running into problem with matplotlib. (it works fine on my mac). I can not figure out how to get libagg, freetype, or Qhull dependencies installed. Has anyone else run into this problem? Thanks Andy sudo yum install freetype-devel sudo yum install libpng-devel sudo pip2.7 install six sudo pip2.7 install python-dateutil sudo pip2.7 install pyparsing sudo pip2.7 install pycxx sudo pip2.7 install matplotlib ec2-user@ip-172-31-15-87 ~]$ sudo pip2.7 install matplotlib Downloading/unpacking matplotlib Downloading matplotlib-1.4.0.tar.gz (51.2MB): 51.2MB downloaded Running setup.py (path:/tmp/pip_build_root/matplotlib/setup.py) egg_info for package matplotlib Edit setup.cfg to change the build options BUILDING MATPLOTLIB matplotlib: yes [1.4.0] python: yes [2.7.5 (default, Sep 15 2014, 17:30:20) [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]] platform: yes [linux2] REQUIRED DEPENDENCIES AND EXTENSIONS numpy: yes [version 1.9.0] six: yes [using six version 1.8.0] dateutil: yes [using dateutil version 2.2] tornado: yes [using tornado version 4.0.2] pyparsing: yes [using pyparsing version 2.0.2] pycxx: yes [Couldn't import. Using local copy.] libagg: yes [pkg-config information for 'libagg' could not be found. Using local copy.] freetype: no [Requires freetype2 2.4 or later. Found 2.3.11.] png: yes [version 1.2.49] qhull: yes [pkg-config information for 'qhull' could not be found. Using local copy.] OPTIONAL SUBPACKAGES sample_data: yes [installing] toolkits: yes [installing] tests: yes [using nose version 1.3.4 / mock is required to run the matplotlib test suite. pip/easy_install may attempt to install it after matplotlib.] toolkits_tests: yes [using nose version 1.3.4 / mock is required to run the matplotlib test suite. pip/easy_install may attempt to install it after
TimeStamp selection with SparkSQL
I may have missed this but is it possible to select on datetime in a SparkSQL query jan1 = sqlContext.sql(SELECT * FROM Stocks WHERE datetime = '2014-01-01') Additionally, is there a guide as to what SQL is valid? The guide says, Note that Spark SQL currently uses a very basic SQL parser It would be great to post what is currently supported. --Ben
Re: Anaconda Spark AMI
Hi All, Thanks to Jey's help, I have a release AMI candidate for spark-1.0/anaconda-2.0 integration. It's currently limited to availability in US-EAST: ami-3ecd0c56 Give it a try if you have some time. This should* just work* with spark 1.0: ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3ecd0c56 If you have suggestions or run into trouble please email, --Ben PS: I found that writing a noop map function is a decent way to install pkgs on worker nodes (though most scientific pkgs are pre-installed with anaconda: def subprocess_noop(x): import os os.system(/opt/anaconda/bin/conda install h5py) return 1 install_noop = rdd.map(subprocess_noop) install_noop.count() On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam j...@cs.berkeley.edu wrote: Hi Ben, Has the PYSPARK_PYTHON environment variable been set in spark/conf/spark-env.sh to the path of the new python binary? FYI, there's a /root/copy-dirs script that can be handy when updating files on an already-running cluster. You'll want to restart the spark cluster for the changes to take effect, as described at https://spark.apache.org/docs/latest/ec2-scripts.html Hope that helps, -Jey On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote: Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeated by a number of errors. Aggregations seemed to choke -- where as small takes executed as aspected (errors are linked to the gist): sc.appName u'PySparkShell' sc._conf.getAll() [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'spark.app.name', u' PySparkShell'), (u'spark.executor.extraClassPath', u'/root/ephemeral-hdfs/conf'), (u'spark.master', u'spark://.compute-1.amazonaws.com:7077')] file = sc.textFile(hdfs:///user/root/chekhov.txt) file.take(2) [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov, u''] lines = file.filter(lambda x: len(x) 0) lines.count() VARIOUS ERROS DISCUSSED BELOW My first thought was that I could simply get away with including anaconda on the base AMI, point the path at /dir/anaconda/bin, and bake a new one. Doing so resulted in some strange py4j errors like the following: Py4JError: An error occurred while calling o17.partitions. Trace: py4j.Py4JException: Method partitions([]) does not exist At some point I also saw: SystemError: Objects/cellobject.c:24: bad argument to internal function which is really strange, possibly the result of a version mismatch? I had another thought of building spark from master on the AMI, leaving the spark directory in place, and removing the spark call from the modules list in spark-ec2 launch script. Unfortunately, this resulted in the following errors: https://gist.github.com/quasiben/da0f4778fbc87d02c088 If a spark dev was willing to make some time in the near future, I'm sure she/he and I could sort out these issues and give the Spark community a python distro ready to go for numerical computing. For instance, I'm not sure how pyspark calls out to launching a python session on a slave? Is this done as root or as the hadoop user? (i believe i changed /etc/bashrc to point to my anaconda bin directory so it shouldn't really matter. Is there something special about the py4j zip include in spark dir compared with the py4j in pypi? Thoughts? --Ben
Anaconda Spark AMI
Hi All, I'm a dev a Continuum and we are developing a fair amount of tooling around Spark. A few days ago someone expressed interest in numpy+pyspark and Anaconda came up as a reasonable solution. I spent a number of hours yesterday trying to rework the base Spark AMI on EC2 but sadly was defeated by a number of errors. Aggregations seemed to choke -- where as small takes executed as aspected (errors are linked to the gist): sc.appName u'PySparkShell' sc._conf.getAll() [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u' spark.app.name', u' PySparkShell'), (u'spark.executor.extraClassPath', u'/root/ephemeral-hdfs/conf'), (u'spark.master', u'spark://.compute-1.amazonaws.com:7077')] file = sc.textFile(hdfs:///user/root/chekhov.txt) file.take(2) [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov, u''] lines = file.filter(lambda x: len(x) 0) lines.count() VARIOUS ERROS DISCUSSED BELOW My first thought was that I could simply get away with including anaconda on the base AMI, point the path at /dir/anaconda/bin, and bake a new one. Doing so resulted in some strange py4j errors like the following: Py4JError: An error occurred while calling o17.partitions. Trace: py4j.Py4JException: Method partitions([]) does not exist At some point I also saw: SystemError: Objects/cellobject.c:24: bad argument to internal function which is really strange, possibly the result of a version mismatch? I had another thought of building spark from master on the AMI, leaving the spark directory in place, and removing the spark call from the modules list in spark-ec2 launch script. Unfortunately, this resulted in the following errors: https://gist.github.com/quasiben/da0f4778fbc87d02c088 If a spark dev was willing to make some time in the near future, I'm sure she/he and I could sort out these issues and give the Spark community a python distro ready to go for numerical computing. For instance, I'm not sure how pyspark calls out to launching a python session on a slave? Is this done as root or as the hadoop user? (i believe i changed /etc/bashrc to point to my anaconda bin directory so it shouldn't really matter. Is there something special about the py4j zip include in spark dir compared with the py4j in pypi? Thoughts? --Ben