[
https://issues.apache.org/jira/browse/BIGTOP-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225091#comment-14225091
]
Matthew Drury commented on BIGTOP-1546:
---------------------------------------
I'm the original reporter (aka mdrus), so I can answer 2).
I work as a data scientist in a large organization, and the cluster I have
available is an enterprise wide resource. As such, controls over installing
software on all the nodes as root are extremely tight. Consequently, it's very
useful for me to be able to build software in my home directory, and be able to
hack cluster software configuration to point at my personal builds. We already
take advantage of this heavily with map-reduce and virtualenv setups; this came
out of my effort to have the same freedom with spark.
> The pyspark command, by default, points to a script that contains a bug
> -----------------------------------------------------------------------
>
> Key: BIGTOP-1546
> URL: https://issues.apache.org/jira/browse/BIGTOP-1546
> Project: Bigtop
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Joao Salcedo
> Fix For: 0.9.0
>
> Attachments: BIGTOP-1546.patch
>
>
> First, I want to point out that I am not using the os default python on my
> client side:
> $ which python
> ~/work/anaconda/bin/python
> This is my own build of python which includes all the numeric python
> libraries. Now let me show where pyspark points:
> $ which pyspark
> /usr/bin/pyspark
> This is a symlink:
> $ ls -l /usr/bin/ | grep pyspark
> lrwxrwxrwx 1 root root 25 Oct 26 10:58 pyspark -> /etc/alternatives/pyspark
> $ ls -l /etc/alternatives/ | grep pyspark
> lrwxrwxrwx 1 root root 60 Oct 26 10:58 pyspark ->
> /opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/bin/pyspark
> which if you follow, links up to the file i am claiming is buggy.
> Now let me show you the effect this setup has on pyspark:
> $ pyspark --master yarn
> Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 14/11/18 09:44:17 INFO SparkContext: Starting job: count at <stdin>:1
> 14/11/18 09:44:17 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2
> output partitions (allowLocal=false)
> 14/11/18 09:44:17 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
> 14/11/18 09:44:17 INFO DAGScheduler: Parents of final stage: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Missing parents: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD
> at PythonRDD.scala:40), which has no missing parents
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0
> (PythonRDD[1] at RDD at PythonRDD.scala:40)
> 14/11/18 09:44:17 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2
> tasks
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
> executor 2: lxe0389.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:0 as 2604 bytes in
> 3 ms
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
> executor 1: lxe0553.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:1 as 2619 bytes in
> 1 ms
> 14/11/18 09:44:19 INFO RackResolver: Resolved lxe0389.allstate.com to
> /ro/rack18
> 14/11/18 09:44:19 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/11/18 09:44:19 WARN TaskSetManager: Loss was due to
> org.apache.spark.api.python.PythonException
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
> line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 191, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 123, in dump_stream
> for obj in iterator:
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 180, in _batched
> for item in iterator:
> File
> "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py",
> line 613, in func
> if acc is None:
> TypeError: an integer is required
> Ok, theres the trace. As I said, this is, to me, not that illumination as to
> what is actually going on. Let me try to explain. Notice that, when I first
> boot up pyspark, the client python is my own personal installation; you can
> see this from the bootup message the python interpreter gives. On the other
> hand, the workers are NOT using this python, they default to the system
> python, this is what the lines of code from my prior email is about:
> PYSPARK_PYTHON="python"
> This forces the worker pythons to use the os python is /usr/bin/python. Mine
> is py2.7 and the os is py2.6. I suspect there is an incompatability when
> passing messages between the two interpreters with pickle (python object
> serialization) that causes the above error. For example, switching the python
> used by the client back to the os python fixes this issue.
> pyspark provides two environment variables so that the end user can
> comfortably change which interpreter is used by the client and the workers,
> they are PYSPARK_PYTHON and SPARK_YARN_USER_ENV. The following setup should
> allow me to use my own install of the interpreter:
> $ export PYSPARK_PYTHON=/home/mdrus/work/anaconda/bin/python2.7
> mdrus@lxe0038 [.../spark/bin] $ export SPARK_YARN_USER_ENV=”PYSPARK_PYTHON
> =/home/mdrus/work/anaconda/bin/python2.7”
> But unfortunately, this does not work, and will give the same error as
> before. The reason is the line of code i pointed to before in
> /usr/bin/pyspark, which forcefully overrides my choice of PYSPARK_PYTHON at
> runtime. This is evidenced by the fact that the following:
> $ alias
> pyspark=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/bin/pyspark
> immediately fixes the issue:
> $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 3
> and also allows me to do fun numpy things like:
> $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>> import numpy as np
> >>> sc.parallelize([np.array([1, 2, 3]), np.array([4, 5,
> >>> 6])]).map(np.sum).sum()
> <snip>
> 21
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)