Re: Anaconda Spark AMI

2014-07-13 Thread Jeremy Freeman
Hi Ben,

This is great! I just spun up an EC2 cluster and tested basic pyspark  + 
ipython/numpy/scipy functionality, and all seems to be working so far. Will let 
you know if any issues arise.

We do a lot with pyspark + scientific computing, and for EC2 usage I think this 
is a terrific way to get the core libraries installed.

-- Jeremy

On Jul 12, 2014, at 4:25 PM, Benjamin Zaitlen quasi...@gmail.com wrote:

 Hi All,
 
 Thanks to Jey's help, I have a release AMI candidate for 
 spark-1.0/anaconda-2.0 integration.  It's currently limited to availability 
 in US-EAST: ami-3ecd0c56
 
 Give it a try if you have some time.  This should just work with spark 1.0:
 
 ./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56
 
 If you have suggestions or run into trouble please email,
 
 --Ben
 
 PS:  I found that writing a noop map function is a decent way to install pkgs 
 on worker nodes (though most scientific pkgs are pre-installed with anaconda:
 
 def subprocess_noop(x):
 import os
 os.system(/opt/anaconda/bin/conda install h5py) 
 return 1
 
 install_noop = rdd.map(subprocess_noop)
 install_noop.count()
 
 
 On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam j...@cs.berkeley.edu wrote:
 Hi Ben,
 
 Has the PYSPARK_PYTHON environment variable been set in
 spark/conf/spark-env.sh to the path of the new python binary?
 
 FYI, there's a /root/copy-dirs script that can be handy when updating
 files on an already-running cluster. You'll want to restart the spark
 cluster for the changes to take effect, as described at
 https://spark.apache.org/docs/latest/ec2-scripts.html
 
 Hope that helps,
 -Jey
 
 On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote:
  Hi All,
 
  I'm a dev a Continuum and we are developing a fair amount of tooling around
  Spark.  A few days ago someone expressed interest in numpy+pyspark and
  Anaconda came up as a reasonable solution.
 
  I spent a number of hours yesterday trying to rework the base Spark AMI on
  EC2 but sadly was defeated by a number of errors.
 
  Aggregations seemed to choke -- where as small takes executed as aspected
  (errors are linked to the gist):
 
  sc.appName
  u'PySparkShell'
  sc._conf.getAll()
  [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
  (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
  (u'spark.app.name', u'
  PySparkShell'), (u'spark.executor.extraClassPath',
  u'/root/ephemeral-hdfs/conf'), (u'spark.master',
  u'spark://.compute-1.amazonaws.com:7077')]
  file = sc.textFile(hdfs:///user/root/chekhov.txt)
  file.take(2)
  [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
  u'']
 
  lines = file.filter(lambda x: len(x)  0)
  lines.count()
  VARIOUS ERROS DISCUSSED BELOW
 
  My first thought was that I could simply get away with including anaconda on
  the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
  Doing so resulted in some strange py4j errors like the following:
 
  Py4JError: An error occurred while calling o17.partitions. Trace:
  py4j.Py4JException: Method partitions([]) does not exist
 
  At some point I also saw:
  SystemError: Objects/cellobject.c:24: bad argument to internal function
 
  which is really strange, possibly the result of a version mismatch?
 
  I had another thought of building spark from master on the AMI, leaving the
  spark directory in place, and removing the spark call from the modules list
  in spark-ec2 launch script. Unfortunately, this resulted in the following
  errors:
 
  https://gist.github.com/quasiben/da0f4778fbc87d02c088
 
  If a spark dev was willing to make some time in the near future, I'm sure
  she/he and I could sort out these issues and give the Spark community a
  python distro ready to go for numerical computing.  For instance, I'm not
  sure how pyspark calls out to launching a python session on a slave?  Is
  this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
  point to my anaconda bin directory so it shouldn't really matter.  Is there
  something special about the py4j zip include in spark dir compared with the
  py4j in pypi?
 
  Thoughts?
 
  --Ben
 
 
 



Re: Anaconda Spark AMI

2014-07-12 Thread Benjamin Zaitlen
Hi All,

Thanks to Jey's help, I have a release AMI candidate for
spark-1.0/anaconda-2.0 integration.  It's currently limited to availability
in US-EAST: ami-3ecd0c56

Give it a try if you have some time.  This should* just work* with spark
1.0:

./spark-ec2 -k my_key -i ~/.ssh/mykey.rsa  -a ami-3ecd0c56

If you have suggestions or run into trouble please email,

--Ben

PS:  I found that writing a noop map function is a decent way to install
pkgs on worker nodes (though most scientific pkgs are pre-installed with
anaconda:

def subprocess_noop(x):
import os
os.system(/opt/anaconda/bin/conda install h5py)
return 1

install_noop = rdd.map(subprocess_noop)
install_noop.count()


On Thu, Jul 3, 2014 at 2:32 PM, Jey Kottalam j...@cs.berkeley.edu wrote:

 Hi Ben,

 Has the PYSPARK_PYTHON environment variable been set in
 spark/conf/spark-env.sh to the path of the new python binary?

 FYI, there's a /root/copy-dirs script that can be handy when updating
 files on an already-running cluster. You'll want to restart the spark
 cluster for the changes to take effect, as described at
 https://spark.apache.org/docs/latest/ec2-scripts.html

 Hope that helps,
 -Jey

 On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com
 wrote:
  Hi All,
 
  I'm a dev a Continuum and we are developing a fair amount of tooling
 around
  Spark.  A few days ago someone expressed interest in numpy+pyspark and
  Anaconda came up as a reasonable solution.
 
  I spent a number of hours yesterday trying to rework the base Spark AMI
 on
  EC2 but sadly was defeated by a number of errors.
 
  Aggregations seemed to choke -- where as small takes executed as aspected
  (errors are linked to the gist):
 
  sc.appName
  u'PySparkShell'
  sc._conf.getAll()
  [(u'spark.executor.extraLibraryPath',
 u'/root/ephemeral-hdfs/lib/native/'),
  (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
  (u'spark.app.name', u'
  PySparkShell'), (u'spark.executor.extraClassPath',
  u'/root/ephemeral-hdfs/conf'), (u'spark.master',
  u'spark://.compute-1.amazonaws.com:7077')]
  file = sc.textFile(hdfs:///user/root/chekhov.txt)
  file.take(2)
  [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton
 Chekhov,
  u'']
 
  lines = file.filter(lambda x: len(x)  0)
  lines.count()
  VARIOUS ERROS DISCUSSED BELOW
 
  My first thought was that I could simply get away with including
 anaconda on
  the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
  Doing so resulted in some strange py4j errors like the following:
 
  Py4JError: An error occurred while calling o17.partitions. Trace:
  py4j.Py4JException: Method partitions([]) does not exist
 
  At some point I also saw:
  SystemError: Objects/cellobject.c:24: bad argument to internal function
 
  which is really strange, possibly the result of a version mismatch?
 
  I had another thought of building spark from master on the AMI, leaving
 the
  spark directory in place, and removing the spark call from the modules
 list
  in spark-ec2 launch script. Unfortunately, this resulted in the following
  errors:
 
  https://gist.github.com/quasiben/da0f4778fbc87d02c088
 
  If a spark dev was willing to make some time in the near future, I'm sure
  she/he and I could sort out these issues and give the Spark community a
  python distro ready to go for numerical computing.  For instance, I'm not
  sure how pyspark calls out to launching a python session on a slave?  Is
  this done as root or as the hadoop user? (i believe i changed
 /etc/bashrc to
  point to my anaconda bin directory so it shouldn't really matter.  Is
 there
  something special about the py4j zip include in spark dir compared with
 the
  py4j in pypi?
 
  Thoughts?
 
  --Ben
 
 



Re: Anaconda Spark AMI

2014-07-03 Thread Jey Kottalam
Hi Ben,

Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?

FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to take effect, as described at
https://spark.apache.org/docs/latest/ec2-scripts.html

Hope that helps,
-Jey

On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen quasi...@gmail.com wrote:
 Hi All,

 I'm a dev a Continuum and we are developing a fair amount of tooling around
 Spark.  A few days ago someone expressed interest in numpy+pyspark and
 Anaconda came up as a reasonable solution.

 I spent a number of hours yesterday trying to rework the base Spark AMI on
 EC2 but sadly was defeated by a number of errors.

 Aggregations seemed to choke -- where as small takes executed as aspected
 (errors are linked to the gist):

 sc.appName
 u'PySparkShell'
 sc._conf.getAll()
 [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
 (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
 (u'spark.app.name', u'
 PySparkShell'), (u'spark.executor.extraClassPath',
 u'/root/ephemeral-hdfs/conf'), (u'spark.master',
 u'spark://.compute-1.amazonaws.com:7077')]
 file = sc.textFile(hdfs:///user/root/chekhov.txt)
 file.take(2)
 [uProject Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov,
 u'']

 lines = file.filter(lambda x: len(x)  0)
 lines.count()
 VARIOUS ERROS DISCUSSED BELOW

 My first thought was that I could simply get away with including anaconda on
 the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

 Py4JError: An error occurred while calling o17.partitions. Trace:
 py4j.Py4JException: Method partitions([]) does not exist

 At some point I also saw:
 SystemError: Objects/cellobject.c:24: bad argument to internal function

 which is really strange, possibly the result of a version mismatch?

 I had another thought of building spark from master on the AMI, leaving the
 spark directory in place, and removing the spark call from the modules list
 in spark-ec2 launch script. Unfortunately, this resulted in the following
 errors:

 https://gist.github.com/quasiben/da0f4778fbc87d02c088

 If a spark dev was willing to make some time in the near future, I'm sure
 she/he and I could sort out these issues and give the Spark community a
 python distro ready to go for numerical computing.  For instance, I'm not
 sure how pyspark calls out to launching a python session on a slave?  Is
 this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
 point to my anaconda bin directory so it shouldn't really matter.  Is there
 something special about the py4j zip include in spark dir compared with the
 py4j in pypi?

 Thoughts?

 --Ben