Re: Anaconda Spark AMI

Jey Kottalam Thu, 03 Jul 2014 12:33:07 -0700

Hi Ben,

Has the PYSPARK_PYTHON environment variable been set in
spark/conf/spark-env.sh to the path of the new python binary?


FYI, there's a /root/copy-dirs script that can be handy when updating
files on an already-running cluster. You'll want to restart the spark
cluster for the changes to take effect, as described at
https://spark.apache.org/docs/latest/ec2-scripts.html

Hope that helps,
-Jey

On Thu, Jul 3, 2014 at 11:54 AM, Benjamin Zaitlen <quasi...@gmail.com> wrote:
> Hi All,
>
> I'm a dev a Continuum and we are developing a fair amount of tooling around
> Spark.  A few days ago someone expressed interest in numpy+pyspark and
> Anaconda came up as a reasonable solution.
>
> I spent a number of hours yesterday trying to rework the base Spark AMI on
> EC2 but sadly was defeated by a number of errors.
>
> Aggregations seemed to choke -- where as small takes executed as aspected
> (errors are linked to the gist):
>
>>>> sc.appName
> u'PySparkShell'
>>>> sc._conf.getAll()
> [(u'spark.executor.extraLibraryPath', u'/root/ephemeral-hdfs/lib/native/'),
> (u'spark.executor.memory', u'6154m'), (u'spark.submit.pyFiles', u''),
> (u'spark.app.name', u'
> PySparkShell'), (u'spark.executor.extraClassPath',
> u'/root/ephemeral-hdfs/conf'), (u'spark.master',
> u'spark://XXXXXXXX.compute-1.amazonaws.com:7077')]
>>>> file = sc.textFile("hdfs:///user/root/chekhov.txt")
>>>> file.take(2)
> [u"Project Gutenberg's Plays by Chekhov, Second Series, by Anton Chekhov",
> u'']
>
>>>> lines = file.filter(lambda x: len(x) > 0)
>>>> lines.count()
> VARIOUS ERROS DISCUSSED BELOW
>
> My first thought was that I could simply get away with including anaconda on
> the base AMI, point the path at /dir/anaconda/bin, and bake a new one.
> Doing so resulted in some strange py4j errors like the following:
>
> Py4JError: An error occurred while calling o17.partitions. Trace:
> py4j.Py4JException: Method partitions([]) does not exist
>
> At some point I also saw:
> SystemError: Objects/cellobject.c:24: bad argument to internal function
>
> which is really strange, possibly the result of a version mismatch?
>
> I had another thought of building spark from master on the AMI, leaving the
> spark directory in place, and removing the spark call from the modules list
> in spark-ec2 launch script. Unfortunately, this resulted in the following
> errors:
>
> https://gist.github.com/quasiben/da0f4778fbc87d02c088
>
> If a spark dev was willing to make some time in the near future, I'm sure
> she/he and I could sort out these issues and give the Spark community a
> python distro ready to go for numerical computing.  For instance, I'm not
> sure how pyspark calls out to launching a python session on a slave?  Is
> this done as root or as the hadoop user? (i believe i changed /etc/bashrc to
> point to my anaconda bin directory so it shouldn't really matter.  Is there
> something special about the py4j zip include in spark dir compared with the
> py4j in pypi?
>
> Thoughts?
>
> --Ben
>
>

Re: Anaconda Spark AMI

Reply via email to