Re: Error with --files

2016-04-14 Thread Benjamin Zaitlen
That fixed it!

Thank you!

--Ben

On Thu, Apr 14, 2016 at 5:53 PM, Marcelo Vanzin <van...@cloudera.com> wrote:

> On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen <quasi...@gmail.com>
> wrote:
> >> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
> >> /home/ubuntu/localtest.txt#appSees.txt
>
> --files should come before the path to your python script. Otherwise
> it's just passed as arguments to your script when it's run.
>
> --
> Marcelo
>


Error with --files

2016-04-14 Thread Benjamin Zaitlen
Hi All,

I'm trying to use the --files option with yarn:

spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
> /home/ubuntu/localtest.txt#appSees.txt


I never see the file in HDFS or in the yarn containers.  Am I doing
something incorrect ?

I'm running spark 1.6.0


Thanks,
--Ben


Re: 1.5 Build Errors

2015-10-06 Thread Benjamin Zaitlen
Hi All,

Sean patiently worked with me in solving this issue.  The problem was
entirely my fault in settings MAVEN_OPTS env variable was set and was
overriding everything.

--Ben

On Tue, Sep 8, 2015 at 1:37 PM, Benjamin Zaitlen <quasi...@gmail.com> wrote:

> Yes, just reran with the following
>
> (spark_build)root@ip-10-45-130-206:~/spark# export MAVEN_OPTS="-Xmx4096mb
>> -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m"
>> (spark_build)root@ip-10-45-130-206:~/spark# build/mvn -Pyarn
>> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>
>
> and grepping for java
>
>
> root   641  9.9  0.3 4411732 49040 pts/4   Sl+  17:35   0:01
>> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g
>> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
>> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath
>> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar
>> com.typesafe.zinc.Nailgun 3030 0
>> root   687  226  2.0 1803664 312876 pts/4  Sl+  17:36   0:22
>> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath
>> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar
>> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf
>> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3
>> -Dmaven.multiModuleProjectDirectory=/root/spark
>> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 -Pyarn
>> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
>
>
> On Tue, Sep 8, 2015 at 1:14 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> MAVEN_OPTS shouldn't affect zinc as it's an unrelated application. You
>> can run "zinc -J-Xmx4g..." in general, but in the provided script,
>> ZINC_OPTS seems to be the equivalent, yes. It kind of looks like your
>> mvn process isn't getting any special memory args there. Is MAVEN_OPTS
>> really exported?
>>
>> FWIW I use my own local mvn and zinc and it works fine.
>>
>> On Tue, Sep 8, 2015 at 6:05 PM, Benjamin Zaitlen <quasi...@gmail.com>
>> wrote:
>> > I'm running zinv while compiling.  It seems that MAVEN_OPTS doesn't
>> really
>> > change much?  Or perhaps I'm misunderstanding something -- grepping for
>> java
>> > i see
>> >
>> >> root 24355  102  8.8 4687376 1350724 pts/4 Sl   16:51  11:08
>> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g
>> >> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
>> >> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath
>> >>
>> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar
>> >> com.typesafe.zinc.Nailgun 3030 0
>> >> root 25151 22.0  3.2 2269092 495276 pts/4  Sl+  16:53   1:56
>> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath
>> >>
>> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar
>> >>
>> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf
>> >> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3
>> >> -Dmaven.multiModuleProjectDirectory=/root/spark
>> >> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean
>> >> package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4
>> >> -Dhadoop.version=2.4.0
>> >
>> >
>> > So the heap size is still 2g even with MAVEN_OPTS set with 4g.  I
>> noticed
>> > that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what
>> > ZINC_OPTS is set to.
>> >
>> > --Ben
>> >
>> >
>> > On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >>
>> >> Do you run Zinc while compiling ?
>> >>
>> >> Ch

1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
Hi All,

I'm trying to build a distribution off of the latest in master and I keep
getting errors on MQTT and the build fails.   I'm running the build on a
m1.large which has 7.5 GB of RAM and no other major processes are running.

MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
> ./make-distribution.sh  --name continuum-custom-spark-1.5 --tgz -Pyarn
> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0



INFO] Spark Project GraphX ... SUCCESS [ 33.345
> s]
> [INFO] Spark Project Streaming  SUCCESS [01:08
> min]
> [INFO] Spark Project Catalyst . SUCCESS [01:39
> min]
> [INFO] Spark Project SQL .. SUCCESS [02:06
> min]
> [INFO] Spark Project ML Library ... SUCCESS [02:16
> min]
> [INFO] Spark Project Tools  SUCCESS [
>  4.087 s]
> [INFO] Spark Project Hive . SUCCESS [01:28
> min]
> [INFO] Spark Project REPL . SUCCESS [
> 16.291 s]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> 13.671 s]
> [INFO] Spark Project YARN . SUCCESS [
> 20.554 s]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
> 14.332 s]
> [INFO] Spark Project Assembly . SUCCESS [03:33
> min]
> [INFO] Spark Project External Twitter . SUCCESS [
> 14.208 s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [
> 11.535 s]
> [INFO] Spark Project External Flume ... SUCCESS [
> 19.010 s]
> [INFO] Spark Project External Flume Assembly .. SUCCESS [
>  5.210 s]
> [INFO] Spark Project External MQTT  FAILURE [01:10
> min]
> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> [INFO] Spark Project External ZeroMQ .. SKIPPED
> [INFO] Spark Project External Kafka ... SKIPPED
> [INFO] Spark Project Examples . SKIPPED
> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> [INFO]
> 
> [INFO] BUILD FAILURE
> [INFO]
> 
> [INFO] Total time: 22:55 min
> [INFO] Finished at: 2015-09-07T22:42:57+00:00
> [INFO] Final Memory: 240M/455M
> [INFO]
> 
> [ERROR] GC overhead limit exceeded -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
> + return 1
> + exit 1


Any thoughts would be extremely helpful.

--Ben


Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
Ah, right.  Should've caught that.

The docs seem to recommend 2gb.  Should that be increased as well?

--Ben

On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote:

> It shows you there that Maven is out of memory. Give it more heap. I use
> 3gb.
>
> On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com>
> wrote:
> > Hi All,
> >
> > I'm trying to build a distribution off of the latest in master and I keep
> > getting errors on MQTT and the build fails.   I'm running the build on a
> > m1.large which has 7.5 GB of RAM and no other major processes are
> running.
> >
> >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
> >> ./make-distribution.sh  --name continuum-custom-spark-1.5 --tgz -Pyarn
> >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0
> >
> >
> >
> >> INFO] Spark Project GraphX ... SUCCESS [
> >> 33.345 s]
> >> [INFO] Spark Project Streaming  SUCCESS
> [01:08
> >> min]
> >> [INFO] Spark Project Catalyst . SUCCESS
> [01:39
> >> min]
> >> [INFO] Spark Project SQL .. SUCCESS
> [02:06
> >> min]
> >> [INFO] Spark Project ML Library ... SUCCESS
> [02:16
> >> min]
> >> [INFO] Spark Project Tools  SUCCESS [
> >> 4.087 s]
> >> [INFO] Spark Project Hive . SUCCESS
> [01:28
> >> min]
> >> [INFO] Spark Project REPL . SUCCESS [
> >> 16.291 s]
> >> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> >> 13.671 s]
> >> [INFO] Spark Project YARN . SUCCESS [
> >> 20.554 s]
> >> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
> >> 14.332 s]
> >> [INFO] Spark Project Assembly . SUCCESS
> [03:33
> >> min]
> >> [INFO] Spark Project External Twitter . SUCCESS [
> >> 14.208 s]
> >> [INFO] Spark Project External Flume Sink .. SUCCESS [
> >> 11.535 s]
> >> [INFO] Spark Project External Flume ... SUCCESS [
> >> 19.010 s]
> >> [INFO] Spark Project External Flume Assembly .. SUCCESS [
> >> 5.210 s]
> >> [INFO] Spark Project External MQTT  FAILURE
> [01:10
> >> min]
> >> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> >> [INFO] Spark Project External ZeroMQ .. SKIPPED
> >> [INFO] Spark Project External Kafka ... SKIPPED
> >> [INFO] Spark Project Examples . SKIPPED
> >> [INFO] Spark Project External Kafka Assembly .. SKIPPED
> >> [INFO]
> >> 
> >> [INFO] BUILD FAILURE
> >> [INFO]
> >> 
> >> [INFO] Total time: 22:55 min
> >> [INFO] Finished at: 2015-09-07T22:42:57+00:00
> >> [INFO] Final Memory: 240M/455M
> >> [INFO]
> >> 
> >> [ERROR] GC overhead limit exceeded -> [Help 1]
> >> [ERROR]
> >> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> >> -e switch.
> >> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> >> [ERROR]
> >> [ERROR] For more information about the errors and possible solutions,
> >> please read the following articles:
> >> [ERROR] [Help 1]
> >> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
> >> + return 1
> >> + exit 1
> >
> >
> > Any thoughts would be extremely helpful.
> >
> > --Ben
>


Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
I'm still getting errors with 3g.  I've increase to 4g and I'll report back

To be clear:

export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M
-XX:ReservedCodeCacheSize=1024m"

[ERROR] GC overhead limit exceeded -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the
> -e switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
> + return 1
> + exit 1


On Tue, Sep 8, 2015 at 10:03 AM, Sean Owen <so...@cloudera.com> wrote:

> It might need more memory in certain situations / running certain
> tests. If 3gb works for your relatively full build, yes you can open a
> PR to change any occurrences of lower recommendations to 3gb.
>
> On Tue, Sep 8, 2015 at 3:02 PM, Benjamin Zaitlen <quasi...@gmail.com>
> wrote:
> > Ah, right.  Should've caught that.
> >
> > The docs seem to recommend 2gb.  Should that be increased as well?
> >
> > --Ben
> >
> > On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> It shows you there that Maven is out of memory. Give it more heap. I use
> >> 3gb.
> >>
> >> On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com>
> >> wrote:
> >> > Hi All,
> >> >
> >> > I'm trying to build a distribution off of the latest in master and I
> >> > keep
> >> > getting errors on MQTT and the build fails.   I'm running the build
> on a
> >> > m1.large which has 7.5 GB of RAM and no other major processes are
> >> > running.
> >> >
> >> >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
> -XX:ReservedCodeCacheSize=512m"
> >> >> ./make-distribution.sh  --name continuum-custom-spark-1.5 --tgz
> -Pyarn
> >> >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0
> >> >
> >> >
> >> >
> >> >> INFO] Spark Project GraphX ... SUCCESS [
> >> >> 33.345 s]
> >> >> [INFO] Spark Project Streaming  SUCCESS
> >> >> [01:08
> >> >> min]
> >> >> [INFO] Spark Project Catalyst . SUCCESS
> >> >> [01:39
> >> >> min]
> >> >> [INFO] Spark Project SQL .. SUCCESS
> >> >> [02:06
> >> >> min]
> >> >> [INFO] Spark Project ML Library ... SUCCESS
> >> >> [02:16
> >> >> min]
> >> >> [INFO] Spark Project Tools  SUCCESS [
> >> >> 4.087 s]
> >> >> [INFO] Spark Project Hive . SUCCESS
> >> >> [01:28
> >> >> min]
> >> >> [INFO] Spark Project REPL . SUCCESS [
> >> >> 16.291 s]
> >> >> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> >> >> 13.671 s]
> >> >> [INFO] Spark Project YARN . SUCCESS [
> >> >> 20.554 s]
> >> >> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
> >> >> 14.332 s]
> >> >> [INFO] Spark Project Assembly . SUCCESS
> >> >> [03:33
> >> >> min]
> >> >> [INFO] Spark Project External Twitter . SUCCESS [
> >> >> 14.208 s]
> >> >> [INFO] Spark Project External Flume Sink .. SUCCESS [
> >> >> 11.535 s]
> >> >> [INFO] Spark Project External Flume ... SUCCESS [
> >> >> 19.010 s]
> >> >> [INFO] Spark Project External Flume Assembly .. SUCCESS [
> >> >> 5.210 s]
> >> >> [INFO] Spark Project External MQTT  FAILURE
> >> >> [01:10
> >> >> min]
> >> >> [INFO] Spark Project External MQTT Assembly ... SKIPPED
> >> >> [INFO] Spark Project External ZeroMQ .. SKIPPED
> >> >> [INFO] Spark Project External Kafka ... SKIPPED
> >> >> [INFO] Spark Project Examples . SKIPPED
> >> >> [INFO] 

Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
Yes, just reran with the following

(spark_build)root@ip-10-45-130-206:~/spark# export MAVEN_OPTS="-Xmx4096mb
> -XX:MaxPermSize=1024M -XX:ReservedCodeCacheSize=1024m"
> (spark_build)root@ip-10-45-130-206:~/spark# build/mvn -Pyarn -Phadoop-2.4
> -Dhadoop.version=2.4.0 -DskipTests clean package


and grepping for java


root   641  9.9  0.3 4411732 49040 pts/4   Sl+  17:35   0:01
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g
> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath
> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar
> com.typesafe.zinc.Nailgun 3030 0
> root   687  226  2.0 1803664 312876 pts/4  Sl+  17:36   0:22
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath
> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar
> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf
> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3
> -Dmaven.multiModuleProjectDirectory=/root/spark
> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 -Pyarn
> -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package


On Tue, Sep 8, 2015 at 1:14 PM, Sean Owen <so...@cloudera.com> wrote:

> MAVEN_OPTS shouldn't affect zinc as it's an unrelated application. You
> can run "zinc -J-Xmx4g..." in general, but in the provided script,
> ZINC_OPTS seems to be the equivalent, yes. It kind of looks like your
> mvn process isn't getting any special memory args there. Is MAVEN_OPTS
> really exported?
>
> FWIW I use my own local mvn and zinc and it works fine.
>
> On Tue, Sep 8, 2015 at 6:05 PM, Benjamin Zaitlen <quasi...@gmail.com>
> wrote:
> > I'm running zinv while compiling.  It seems that MAVEN_OPTS doesn't
> really
> > change much?  Or perhaps I'm misunderstanding something -- grepping for
> java
> > i see
> >
> >> root 24355  102  8.8 4687376 1350724 pts/4 Sl   16:51  11:08
> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g
> >> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
> >> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath
> >>
> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar
> >> com.typesafe.zinc.Nailgun 3030 0
> >> root 25151 22.0  3.2 2269092 495276 pts/4  Sl+  16:53   1:56
> >> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath
> >>
> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar
> >>
> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf
> >> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3
> >> -Dmaven.multiModuleProjectDirectory=/root/spark
> >> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean
> >> package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4
> >> -Dhadoop.version=2.4.0
> >
> >
> > So the heap size is still 2g even with MAVEN_OPTS set with 4g.  I noticed
> > that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what
> > ZINC_OPTS is set to.
> >
> > --Ben
> >
> >
> > On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> >>
> >> Do you run Zinc while compiling ?
> >>
> >> Cheers
> >>
> >> On Tue, Sep 8, 2015 at 7:56 AM, Benjamin Zaitlen <quasi...@gmail.com>
> >> wrote:
> >>>
> >>> I'm still getting errors with 3g.  I've increase to 4g and I'll report
> >>> back
> >>>
> >>> To be clear:
> >>>
> >>> export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M
> >>> -XX:ReservedCodeCacheSize=1024m"
> >>>
> >>>> [ERROR] GC overhead limit exceeded -> [Help 1]
> >>>> [ERROR]
> >>

Re: 1.5 Build Errors

2015-09-08 Thread Benjamin Zaitlen
I'm running zinv while compiling.  It seems that MAVEN_OPTS doesn't really
change much?  Or perhaps I'm misunderstanding something -- grepping for
java i see

root 24355  102  8.8 4687376 1350724 pts/4 Sl   16:51  11:08
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -Xmx2g
> -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m
> -Dzinc.home=/root/spark/build/zinc-0.3.5.3 -classpath
> /root/spark/build/zinc-0.3.5.3/lib/compiler-interface-sources.jar:/root/spark/build/zinc-0.3.5.3/lib/incremental-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/nailgun-server.jar:/root/spark/build/zinc-0.3.5.3/lib/sbt-interface.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-compiler.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-library.jar:/root/spark/build/zinc-0.3.5.3/lib/scala-reflect.jar:/root/spark/build/zinc-0.3.5.3/lib/zinc.jar
> com.typesafe.zinc.Nailgun 3030 0
> root 25151 22.0  3.2 2269092 495276 pts/4  Sl+  16:53   1:56
> /usr/lib/jvm/java-7-openjdk-amd64/bin/java -Xms256m -Xmx512m -classpath
> /opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/boot/plexus-classworlds-2.5.2.jar
> -Dclassworlds.conf=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3/bin/m2.conf
> -Dmaven.home=/opt/anaconda/envs/spark_build/share/apache-maven-3.3.3
> -Dmaven.multiModuleProjectDirectory=/root/spark
> org.codehaus.plexus.classworlds.launcher.Launcher -DzincPort=3030 clean
> package -DskipTests -Pyarn -Phive -Phive-thriftserver -Phadoop-2.4
> -Dhadoop.version=2.4.0


So the heap size is still 2g even with MAVEN_OPTS set with 4g.  I noticed
that within build/mvn _COMPILE_JVM_OPTS is set to 2g and this is what
ZINC_OPTS is set to.

--Ben


On Tue, Sep 8, 2015 at 11:06 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Do you run Zinc while compiling ?
>
> Cheers
>
> On Tue, Sep 8, 2015 at 7:56 AM, Benjamin Zaitlen <quasi...@gmail.com>
> wrote:
>
>> I'm still getting errors with 3g.  I've increase to 4g and I'll report
>> back
>>
>> To be clear:
>>
>> export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1024M
>> -XX:ReservedCodeCacheSize=1024m"
>>
>> [ERROR] GC overhead limit exceeded -> [Help 1]
>>> [ERROR]
>>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>>> -e switch.
>>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>>> [ERROR]
>>> [ERROR] For more information about the errors and possible solutions,
>>> please read the following articles:
>>> [ERROR] [Help 1]
>>> http://cwiki.apache.org/confluence/display/MAVEN/OutOfMemoryError
>>> + return 1
>>> + exit 1
>>
>>
>> On Tue, Sep 8, 2015 at 10:03 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> It might need more memory in certain situations / running certain
>>> tests. If 3gb works for your relatively full build, yes you can open a
>>> PR to change any occurrences of lower recommendations to 3gb.
>>>
>>> On Tue, Sep 8, 2015 at 3:02 PM, Benjamin Zaitlen <quasi...@gmail.com>
>>> wrote:
>>> > Ah, right.  Should've caught that.
>>> >
>>> > The docs seem to recommend 2gb.  Should that be increased as well?
>>> >
>>> > --Ben
>>> >
>>> > On Tue, Sep 8, 2015 at 9:33 AM, Sean Owen <so...@cloudera.com> wrote:
>>> >>
>>> >> It shows you there that Maven is out of memory. Give it more heap. I
>>> use
>>> >> 3gb.
>>> >>
>>> >> On Tue, Sep 8, 2015 at 1:53 PM, Benjamin Zaitlen <quasi...@gmail.com>
>>> >> wrote:
>>> >> > Hi All,
>>> >> >
>>> >> > I'm trying to build a distribution off of the latest in master and I
>>> >> > keep
>>> >> > getting errors on MQTT and the build fails.   I'm running the build
>>> on a
>>> >> > m1.large which has 7.5 GB of RAM and no other major processes are
>>> >> > running.
>>> >> >
>>> >> >> MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
>>> -XX:ReservedCodeCacheSize=512m"
>>> >> >> ./make-distribution.sh  --name continuum-custom-spark-1.5 --tgz
>>> -Pyarn
>>> >> >> -Phive -Phive-thriftserver -Phadoop-2.4 -Dhadoop.version=2.4.0
>>> >> >
>>> >> >
>>> >> >
>>> >> >> INFO] Spark Project GraphX ... SUCCESS
>>> [
>>> >> >> 33.345 s]
>>> >> >> [INFO] Spark Project Streaming  SUCCESS
>>> 

Submitting Python Applications from Remote to Master

2014-11-14 Thread Benjamin Zaitlen
Hi All,

I'm not quite clear on whether submitting a python application to spark
standalone on ec2 is possible.

Am I reading this correctly:

*A common deployment strategy is to submit your application from a gateway
machine that is physically co-located with your worker machines (e.g.
Master node in a standalone EC2 cluster). In this setup, client mode is
appropriate. In client mode, the driver is launched directly within the
client spark-submit process, with the input and output of the application
attached to the console. Thus, this mode is especially suitable for
applications that involve the REPL (e.g. Spark shell).

Alternatively, if your application is submitted from a machine far from the
worker machines (e.g. locally on your laptop), it is common to usecluster mode
to minimize network latency between the drivers and the executors. Note
that cluster mode is currently not supported for standalone clusters, Mesos
clusters, or python applications.


So I shouldn't be able to do something like:

./bin/spark-submit  --master spark:/x.compute-1.amazonaws.com:7077
 examples/src/main/python/pi.py


From a laptop connecting to a previously launched spark cluster using the
default spark-ec2 script, correct?


If I am not mistaken about this then docs are slightly confusing -- the
above example is more or less the example here:
https://spark.apache.org/docs/1.1.0/submitting-applications.html


If I am mistaken, apologies, can you help me figure out where I went wrong?

I've also taken to opening port 7077 to 0.0.0.0/0

--Ben


Re: iPython notebook ec2 cluster matlabplot not found?

2014-09-29 Thread Benjamin Zaitlen
HI Andy,

I built an anaconda/spark AMI a few months ago.  I'm still iterating on it
so if things break please report them.  If you want to give it awhirl:
./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56

The nice thing about anaconda is that it come pre-baked with
ipython-notebook, matplotlib, scipy stack, and many other libraries.

--Ben

On Mon, Sep 29, 2014 at 12:45 PM, Andy Davidson 
a...@santacruzintegration.com wrote:

 Hi Nicholas

 Yes out of the box PySpark works. My problem is I am using iPython note
 book and matlabplot is not found. It seems that out of the box the cluster
 has an old version of python and iPython notebook. It was suggested I
 upgrade iPython because the new version include matlabplot. This upgrade
 requires going to python 2.7. The python upgrade and iPython upgrade seemed
 to work how ever I am still getting my original problem

 ERROR: Line magic function `%matplotlib` not found

 I also posted to the iPython-dev mail list. So far I have not found a
 solution. Maybe I’ll have to switch to a different graphing package

 Thanks

 Andy

 From: Nicholas Chammas nicholas.cham...@gmail.com
 Date: Saturday, September 27, 2014 at 4:49 PM
 To: Andrew Davidson a...@santacruzintegration.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: iPython notebook ec2 cluster matlabplot not found?

 Can you first confirm that the regular PySpark shell works on your
 cluster? Without upgrading to 2.7. That is, you log on to your master using 
 spark-ec2
 login and run bin/pyspark successfully without any special flags.

 And as far as I can tell, you should be able to use IPython at 2.6, so I’d
 next confirm that that is working before throwing the 2.7 upgrade into the
 mix.

 Also, when upgrading or installing things, try doing so for all the nodes
 in your cluster using pssh. If you install stuff just on the master
 without somehow transferring it to the slaves, that will be problematic.

 Finally, there is an open pull request
 https://github.com/apache/spark/pull/2554 related to IPython that may
 be relevant, though I haven’t looked at it too closely.

 Nick
 ​

 On Sat, Sep 27, 2014 at 7:33 PM, Andy Davidson 
 a...@santacruzintegration.com wrote:

 Hi

 I am having a heck of time trying to get python to work correctly on my
 cluster created using  the spark-ec2 script

 The following link was really helpful
 https://issues.apache.org/jira/browse/SPARK-922


 I am still running into problem with matplotlib. (it works fine on my
 mac). I can not figure out how to get libagg, freetype, or Qhull
 dependencies installed.

 Has anyone else run into this problem?

 Thanks

 Andy

 sudo yum install freetype-devel

 sudo yum install libpng-devel

 sudo pip2.7 install six

 sudo pip2.7 install python-dateutil

 sudo pip2.7 install pyparsing

 sudo pip2.7 install pycxx


 sudo pip2.7 install matplotlib

 ec2-user@ip-172-31-15-87 ~]$ sudo pip2.7 install matplotlib

 Downloading/unpacking matplotlib

   Downloading matplotlib-1.4.0.tar.gz (51.2MB): 51.2MB downloaded

   Running setup.py (path:/tmp/pip_build_root/matplotlib/setup.py)
 egg_info for package matplotlib


 

 Edit setup.cfg to change the build options



 BUILDING MATPLOTLIB

 matplotlib: yes [1.4.0]

 python: yes [2.7.5 (default, Sep 15 2014, 17:30:20)
 [GCC

 4.8.2 20140120 (Red Hat 4.8.2-16)]]

   platform: yes [linux2]



 REQUIRED DEPENDENCIES AND EXTENSIONS

  numpy: yes [version 1.9.0]

six: yes [using six version 1.8.0]

   dateutil: yes [using dateutil version 2.2]

tornado: yes [using tornado version 4.0.2]

  pyparsing: yes [using pyparsing version 2.0.2]

  pycxx: yes [Couldn't import.  Using local copy.]

 libagg: yes [pkg-config information for 'libagg'
 could not

 be found. Using local copy.]

   freetype: no  [Requires freetype2 2.4 or later.  Found

 2.3.11.]

png: yes [version 1.2.49]

  qhull: yes [pkg-config information for 'qhull' could
 not be

 found. Using local copy.]



 OPTIONAL SUBPACKAGES

sample_data: yes [installing]

   toolkits: yes [installing]

  tests: yes [using nose version 1.3.4 / mock is
 required to

 run the matplotlib test suite.
 pip/easy_install may

 attempt to install it after matplotlib.]

 toolkits_tests: yes [using nose version 1.3.4 / mock is
 required to

 run the matplotlib test suite.
 pip/easy_install may

 attempt to install it after 

TimeStamp selection with SparkSQL

2014-09-04 Thread Benjamin Zaitlen
I may have missed this but is it possible to select on datetime in a
SparkSQL query

jan1 = sqlContext.sql(SELECT * FROM Stocks WHERE datetime = '2014-01-01')

Additionally, is there a guide as to what SQL is valid? The guide says,
Note that Spark SQL currently uses a very basic SQL parser  It would be
great to post what is currently supported.

--Ben


Re: Anaconda Spark AMI

2014-07-12 Thread Benjamin Zaitlen
Hi All,

Thanks to Jey's help, I have a release AMI candidate for
spark-1.0/anaconda-2.0 integration.  It's currently limited to availability
in US-EAST: ami-3ecd0c56

Give it a try if you have some time.  This should* just work* with spark
1.0:

./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56

If you have suggestions or run into trouble please email,

--Ben

PS:  I found that writing a noop map function is a decent way to install
pkgs on worker nodes (though most scientific pkgs are pre-installed with
anaconda:

def subprocess_noop(x):
import os
os.system(/opt/anaconda/bin/conda install h5py)
return 1

install_noop = rdd.map(subprocess_noop)
install_noop.count()


On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Ben,

 Has the PYSPARK_PYTHON environment variable been set in
 spark/conf/spark-env.sh to the path of the new python binary?

 FYI, there's a /root/copy-dirs script that can be handy when updating
 files on an already-running cluster. You'll want to restart the spark
 cluster for the changes to take effect, as described at
 https://spark.apache.org/docs/latest/ec2-scripts.html

 Hope that helps,
 -Jey

 On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com
 wrote:
  Hi All,
 
  I'm a dev a Continuum and we are developing a fair amount of tooling
 around
  Spark.  A few days ago someone expressed interest in numpy+pyspark and
  Anaconda came up as a reasonable solution.
 
  I spent a number of hours yesterday trying to rework the base Spark AMI
 on
  EC2 but sadly was defeated by a number of errors.
 
  Aggregations seemed to choke -- where as small takes executed as aspected
  (errors are linked to the gist):
 
  sc.appName
  u'PySparkShell'
  sc._conf.getAll()
  [(u'spark.executor.extraLibraryPath',
 u'/root/ephemeral-hdfs/lib/native/'),
  (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
  (u'spark.app.name', u'
  PySparkShell'), (u'spark.executor.extraClassPath',
  u'/root/ephemeral-hdfs/conf'), (u'spark.master',
  u'spark://.compute-1.amazonaws.com:7077')]
  file = sc.textFile(hdfs:///user/root/chekhov.txt)
  file.take(2)
  [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton
 Chekhov,
  u'']
 
  lines = file.filter(lambda x: len(x)  0)
  lines.count()
  VARIOUS ERROS DISCUSSED BELOW
 
  My first thought was that I could simply get away with including
 anaconda on
  the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
  Doing so resulted in some strange py4j errors like the following:
 
  Py4JError: An error occurred while calling o17.partitions. Trace:
  py4j.Py4JException: Method partitions([]) does not exist
 
  At some point I also saw:
  SystemError: Objects/cellobject.c:24: bad argument to internal function
 
  which is really strange, possibly the result of a version mismatch?
 
  I had another thought of building spark from master on the AMI, leaving
 the
  spark directory in place, and removing the spark call from the modules
 list
  in spark-ec2 launch script. Unfortunately, this resulted in the following
  errors:
 
  https://gist.github.com/quasiben/da0f4778fbc87d02c088
 
  If a spark dev was willing to make some time in the near future, I'm sure
  she/he and I could sort out these issues and give the Spark community a
  python distro ready to go for numerical computing.  For instance, I'm not
  sure how pyspark calls out to launching a python session on a slave?  Is
  this done as root or as the hadoop user? (i believe i changed
 /etc/bashrc to
  point to my anaconda bin directory so it shouldn't really matter.  Is
 there
  something special about the py4j zip include in spark dir compared with
 the
  py4j in pypi?
 
  Thoughts?
 
  --Ben
 
 



Anaconda Spark AMI

2014-07-03 Thread Benjamin Zaitlen
Hi All,

I'm a dev a Continuum and we are developing a fair amount of tooling around
Spark.  A few days ago someone expressed interest in numpy+pyspark and
Anaconda came up as a reasonable solution.

I spent a number of hours yesterday trying to rework the base Spark AMI on
EC2 but sadly was defeated by a number of errors.

Aggregations seemed to choke -- where as small takes executed as aspected
(errors are linked to the gist):

 sc.appName
u'PySparkShell'
 sc._conf.getAll()
[(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
(u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'
spark.app.name', u'
PySparkShell'), (u'spark.executor.extraClassPath',
u'/root/ephemeral-hdfs/conf'), (u'spark.master',
u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile(hdfs:///user/root/chekhov.txt)
 file.take(2)
[uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
u'']

 lines = file.filter(lambda x: len(x)  0)
 lines.count()
VARIOUS ERROS DISCUSSED BELOW

My first thought was that I could simply get away with including anaconda
on the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

Py4JError: An error occurred while calling o17.partitions. Trace:
py4j.Py4JException: Method partitions([]) does not exist

At some point I also saw:
SystemError: Objects/cellobject.c:24: bad argument to internal function

which is really strange, possibly the result of a version mismatch?

I had another thought of building spark from master on the AMI, leaving the
spark directory in place, and removing the spark call from the modules list
in spark-ec2 launch script. Unfortunately, this resulted in the following
errors:

https://gist.github.com/quasiben/da0f4778fbc87d02c088

If a spark dev was willing to make some time in the near future, I'm sure
she/he and I could sort out these issues and give the Spark community a
python distro ready to go for numerical computing.  For instance, I'm not
sure how pyspark calls out to launching a python session on a slave?  Is
this done as root or as the hadoop user? (i believe i changed /etc/bashrc
to point to my anaconda bin directory so it shouldn't really matter.  Is
there something special about the py4j zip include in spark dir compared
with the py4j in pypi?

Thoughts?

--Ben