Hi Ben, This is great! I just spun up an EC2 cluster and tested basic pyspark + ipython/numpy/scipy functionality, and all seems to be working so far. Will let you know if any issues arise.
We do a lot with pyspark + scientific computing, and for EC2 usage I think this is a terrific way to get the core libraries installed. -- Jeremy On Jul 12, 2014, at 4:25 PM, Benjamin Zaitlen <quasi...@gmail.com> wrote: > Hi All, > > Thanks to Jey's help, I have a release AMI candidate for > spark-1.0/anaconda-2.0 integration. It's currently limited to availability > in US-EAST: ami-3ecd0c56 > > Give it a try if you have some time. This should just work with spark 1.0: > > ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa -a ami-3ecd0c56 > > If you have suggestions or run into trouble please email, > > --Ben > > PS: I found that writing a noop map function is a decent way to install pkgs > on worker nodes (though most scientific pkgs are pre-installed with anaconda: > > def subprocess_noop(x): > import os > os.system("/opt/anaconda/bin/conda install h5py") > return 1 > > install_noop = rdd.map(subprocess_noop) > install_noop.count() > > > On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam <j...@cs.berkeley.edu> wrote: > Hi Ben, > > Has the PYSPARK_PYTHON environment variable been set in > spark/conf/spark-env.sh to the path of the new python binary? > > FYI, there's a /root/copy-dirs script that can be handy when updating > files on an already-running cluster. You'll want to restart the spark > cluster for the changes to take effect, as described at > https://spark.apache.org/docs/latest/ec2-scripts.html > > Hope that helps, > -Jey > > On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen <quasi...@gmail.com> wrote: > > Hi All, > > > > I'm a dev a Continuum and we are developing a fair amount of tooling around > > Spark. A few days ago someone expressed interest in numpy+pyspark and > > Anaconda came up as a reasonable solution. > > > > I spent a number of hours yesterday trying to rework the base Spark AMI on > > EC2 but sadly was defeated by a number of errors. > > > > Aggregations seemed to choke -- where as small takes executed as aspected > > (errors are linked to the gist): > > > >>>> sc.appName > > u'PySparkShell' > >>>> sc._conf.getAll() > > [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'), > > (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), > > (u'spark.app.name', u' > > PySparkShell'), (u'spark.executor.extraClassPath', > > u'/root/ephemeral-hdfs/conf'), (u'spark.master', > > u'spark://XXXXXXXX.compute-1.amazonaws.com:7077')] > >>>> file = sc.textFile("hdfs:///user/root/chekhov.txt") > >>>> file.take(2) > > [u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov", > > u''] > > > >>>> lines = file.filter(lambda x: len(x) > 0) > >>>> lines.count() > > VARIOUS ERROS DISCUSSED BELOW > > > > My first thought was that I could simply get away with including anaconda on > > the base AMI, point the path at /dir/anaconda/bin, and bake a new one. > > Doing so resulted in some strange py4j errors like the following: > > > > Py4JError: An error occurred while calling o17.partitions. Trace: > > py4j.Py4JException: Method partitions([]) does not exist > > > > At some point I also saw: > > SystemError: Objects/cellobject.c:24: bad argument to internal function > > > > which is really strange, possibly the result of a version mismatch? > > > > I had another thought of building spark from master on the AMI, leaving the > > spark directory in place, and removing the spark call from the modules list > > in spark-ec2 launch script. Unfortunately, this resulted in the following > > errors: > > > > https://gist.github.com/quasiben/da0f4778fbc87d02c088 > > > > If a spark dev was willing to make some time in the near future, I'm sure > > she/he and I could sort out these issues and give the Spark community a > > python distro ready to go for numerical computing. For instance, I'm not > > sure how pyspark calls out to launching a python session on a slave? Is > > this done as root or as the hadoop user? (i believe i changed /etc/bashrc to > > point to my anaconda bin directory so it shouldn't really matter. Is there > > something special about the py4j zip include in spark dir compared with the > > py4j in pypi? > > > > Thoughts? > > > > --Ben > > > > >