Hi Ben,

This is great! I just spun up an EC2 cluster and tested basic pyspark  + 
ipython/numpy/scipy functionality, and all seems to be working so far. Will let 
you know if any issues arise.

We do a lot with pyspark + scientific computing, and for EC2 usage I think this 
is a terrific way to get the core libraries installed.

-- Jeremy

On Jul 12, 2014, at 4:25 PM, Benjamin Zaitlen <quasi...@gmail.com> wrote:

> Hi All,
> 
> Thanks to Jey's help, I have a release AMI candidate for 
> spark-1.0/anaconda-2.0 integration.  It's currently limited to availability 
> in US-EAST: ami-3ecd0c56
> 
> Give it a try if you have some time.  This should just work with spark 1.0:
> 
> ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56
> 
> If you have suggestions or run into trouble please email,
> 
> --Ben
> 
> PS:  I found that writing a noop map function is a decent way to install pkgs 
> on worker nodes (though most scientific pkgs are pre-installed with anaconda:
> 
> def subprocess_noop(x):
>     import os
>     os.system("/opt/anaconda/bin/conda install h5py") 
>     return 1
> 
> install_noop = rdd.map(subprocess_noop)
> install_noop.count()
> 
> 
> On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam <j...@cs.berkeley.edu> wrote:
> Hi Ben,
> 
> Has the PYSPARK_PYTHON environment variable been set in
> spark/conf/spark-env.sh to the path of the new python binary?
> 
> FYI, there's a /root/copy-dirs script that can be handy when updating
> files on an already-running cluster. You'll want to restart the spark
> cluster for the changes to take effect, as described at
> https://spark.apache.org/docs/latest/ec2-scripts.html
> 
> Hope that helps,
> -Jey
> 
> On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen <quasi...@gmail.com> wrote:
> > Hi All,
> >
> > I'm a dev a Continuum and we are developing a fair amount of tooling around
> > Spark.  A few days ago someone expressed interest in numpy+pyspark and
> > Anaconda came up as a reasonable solution.
> >
> > I spent a number of hours yesterday trying to rework the base Spark AMI on
> > EC2 but sadly was defeated by a number of errors.
> >
> > Aggregations seemed to choke -- where as small takes executed as aspected
> > (errors are linked to the gist):
> >
> >>>> sc.appName
> > u'PySparkShell'
> >>>> sc._conf.getAll()
> > [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
> > (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
> > (u'spark.app.name', u'
> > PySparkShell'), (u'spark.executor.extraClassPath',
> > u'/root/ephemeral-hdfs/conf'), (u'spark.master',
> > u'spark://XXXXXXXX.compute-1.amazonaws.com:7077')]
> >>>> file = sc.textFile("hdfs:///user/root/chekhov.txt")
> >>>> file.take(2)
> > [u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov",
> > u'']
> >
> >>>> lines = file.filter(lambda x: len(x) > 0)
> >>>> lines.count()
> > VARIOUS ERROS DISCUSSED BELOW
> >
> > My first thought was that I could simply get away with including anaconda on
> > the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
> > Doing so resulted in some strange py4j errors like the following:
> >
> > Py4JError: An error occurred while calling o17.partitions. Trace:
> > py4j.Py4JException: Method partitions([]) does not exist
> >
> > At some point I also saw:
> > SystemError: Objects/cellobject.c:24: bad argument to internal function
> >
> > which is really strange, possibly the result of a version mismatch?
> >
> > I had another thought of building spark from master on the AMI, leaving the
> > spark directory in place, and removing the spark call from the modules list
> > in spark-ec2 launch script. Unfortunately, this resulted in the following
> > errors:
> >
> > https://gist.github.com/quasiben/da0f4778fbc87d02c088
> >
> > If a spark dev was willing to make some time in the near future, I'm sure
> > she/he and I could sort out these issues and give the Spark community a
> > python distro ready to go for numerical computing.  For instance, I'm not
> > sure how pyspark calls out to launching a python session on a slave?  Is
> > this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
> > point to my anaconda bin directory so it shouldn't really matter.  Is there
> > something special about the py4j zip include in spark dir compared with the
> > py4j in pypi?
> >
> > Thoughts?
> >
> > --Ben
> >
> >
> 

Reply via email to