Hi All,

I'm a dev a Continuum and we are developing a fair amount of tooling around
Spark.  A few days ago someone expressed interest in numpy+pyspark and
Anaconda came up as a reasonable solution.

I spent a number of hours yesterday trying to rework the base Spark AMI on
EC2 but sadly was defeated by a number of errors.

Aggregations seemed to choke -- where as small takes executed as aspected
(errors are linked to the gist):

>>> sc.appName
u'PySparkShell'
>>> sc._conf.getAll()
[(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
(u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''), (u'
spark.app.name', u'
PySparkShell'), (u'spark.executor.extraClassPath',
u'/root/ephemeral-hdfs/conf'), (u'spark.master',
u'spark://XXXXXXXX.compute-1.amazonaws.com:7077')]
>>> file = sc.textFile("hdfs:///user/root/chekhov.txt")
>>> file.take(2)
[u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov",
u'']

>>> lines = file.filter(lambda x: len(x) > 0)
>>> lines.count()
VARIOUS ERROS DISCUSSED BELOW

My first thought was that I could simply get away with including anaconda
on the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
 Doing so resulted in some strange py4j errors like the following:

Py4JError: An error occurred while calling o17.partitions. Trace:
py4j.Py4JException: Method partitions([]) does not exist

At some point I also saw:
SystemError: Objects/cellobject.c:24: bad argument to internal function

which is really strange, possibly the result of a version mismatch?

I had another thought of building spark from master on the AMI, leaving the
spark directory in place, and removing the spark call from the modules list
in spark-ec2 launch script. Unfortunately, this resulted in the following
errors:

https://gist.github.com/quasiben/da0f4778fbc87d02c088

If a spark dev was willing to make some time in the near future, I'm sure
she/he and I could sort out these issues and give the Spark community a
python distro ready to go for numerical computing.  For instance, I'm not
sure how pyspark calls out to launching a python session on a slave?  Is
this done as root or as the hadoop user? (i believe i changed /etc/bashrc
to point to my anaconda bin directory so it shouldn't really matter.  Is
there something special about the py4j zip include in spark dir compared
with the py4j in pypi?

Thoughts?

--Ben

Reply via email to