[
https://issues.apache.org/jira/browse/BIGTOP-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224923#comment-14224923
]
jay vyas edited comment on BIGTOP-1546 at 11/25/14 6:07 PM:
------------------------------------------------------------
[~manikandan.n]
1) Thanks for finding this. I wonder, should we add a smoke test for this
also, which confirms that PySpark environment variable is not overwritten when
we run {{pyspark .....}} ? just thinking out loud. Might not be a good idea,
but some kind of test would be great. otherwise, I can look more into it later
.
2) This is an interesting scenario. As a side question *Can I also ask why you
have a custom python build ? Cany you just install numpy on different nodes in
the cluster...?*
was (Author: jayunit100):
[~manikandan.n]
1) Thanks for finding this. I wonder, should we add a smoke test for this
also, which confirms that PySpark is not overwritten when we run {{pyspark
.....}} ? just thinking out loud. Might not be a good idea, but some kind of
test would be great. otherwise, I can look more into it later .
2) This is an interesting scenario. As a side question *Can I also ask why you
have a custom python build ? Cany you just install numpy on different nodes in
the cluster...?*
> The pyspark command, by default, points to a script that contains a bug
> -----------------------------------------------------------------------
>
> Key: BIGTOP-1546
> URL: https://issues.apache.org/jira/browse/BIGTOP-1546
> Project: Bigtop
> Issue Type: Bug
> Affects Versions: 0.8.0
> Reporter: Joao Salcedo
> Fix For: 0.9.0
>
> Attachments: BIGTOP-1546.patch
>
>
> First, I want to point out that I am not using the os default python on my
> client side:
> $ which python
> ~/work/anaconda/bin/python
> This is my own build of python which includes all the numeric python
> libraries. Now let me show where pyspark points:
> $ which pyspark
> /usr/bin/pyspark
> This is a symlink:
> $ ls -l /usr/bin/ | grep pyspark
> lrwxrwxrwx 1 root root 25 Oct 26 10:58 pyspark -> /etc/alternatives/pyspark
> $ ls -l /etc/alternatives/ | grep pyspark
> lrwxrwxrwx 1 root root 60 Oct 26 10:58 pyspark ->
> /opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/bin/pyspark
> which if you follow, links up to the file i am claiming is buggy.
> Now let me show you the effect this setup has on pyspark:
> $ pyspark --master yarn
> Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 14/11/18 09:44:17 INFO SparkContext: Starting job: count at <stdin>:1
> 14/11/18 09:44:17 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2
> output partitions (allowLocal=false)
> 14/11/18 09:44:17 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
> 14/11/18 09:44:17 INFO DAGScheduler: Parents of final stage: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Missing parents: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD
> at PythonRDD.scala:40), which has no missing parents
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0
> (PythonRDD[1] at RDD at PythonRDD.scala:40)
> 14/11/18 09:44:17 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2
> tasks
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on
> executor 2: lxe0389.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:0 as 2604 bytes in
> 3 ms
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on
> executor 1: lxe0553.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:1 as 2619 bytes in
> 1 ms
> 14/11/18 09:44:19 INFO RackResolver: Resolved lxe0389.allstate.com to
> /ro/rack18
> 14/11/18 09:44:19 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/11/18 09:44:19 WARN TaskSetManager: Loss was due to
> org.apache.spark.api.python.PythonException
> org.apache.spark.api.python.PythonException: Traceback (most recent call
> last):
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
> line 77, in main
> serializer.dump_stream(func(split_index, iterator), outfile)
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 191, in dump_stream
> self.serializer.dump_stream(self._batched(iterator), stream)
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 123, in dump_stream
> for obj in iterator:
> File
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 180, in _batched
> for item in iterator:
> File
> "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py",
> line 613, in func
> if acc is None:
> TypeError: an integer is required
> Ok, theres the trace. As I said, this is, to me, not that illumination as to
> what is actually going on. Let me try to explain. Notice that, when I first
> boot up pyspark, the client python is my own personal installation; you can
> see this from the bootup message the python interpreter gives. On the other
> hand, the workers are NOT using this python, they default to the system
> python, this is what the lines of code from my prior email is about:
> PYSPARK_PYTHON="python"
> This forces the worker pythons to use the os python is /usr/bin/python. Mine
> is py2.7 and the os is py2.6. I suspect there is an incompatability when
> passing messages between the two interpreters with pickle (python object
> serialization) that causes the above error. For example, switching the python
> used by the client back to the os python fixes this issue.
> pyspark provides two environment variables so that the end user can
> comfortably change which interpreter is used by the client and the workers,
> they are PYSPARK_PYTHON and SPARK_YARN_USER_ENV. The following setup should
> allow me to use my own install of the interpreter:
> $ export PYSPARK_PYTHON=/home/mdrus/work/anaconda/bin/python2.7
> mdrus@lxe0038 [.../spark/bin] $ export SPARK_YARN_USER_ENV=”PYSPARK_PYTHON
> =/home/mdrus/work/anaconda/bin/python2.7”
> But unfortunately, this does not work, and will give the same error as
> before. The reason is the line of code i pointed to before in
> /usr/bin/pyspark, which forcefully overrides my choice of PYSPARK_PYTHON at
> runtime. This is evidenced by the fact that the following:
> $ alias
> pyspark=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/bin/pyspark
> immediately fixes the issue:
> $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 3
> and also allows me to do fun numpy things like:
> $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>> import numpy as np
> >>> sc.parallelize([np.array([1, 2, 3]), np.array([4, 5,
> >>> 6])]).map(np.sum).sum()
> <snip>
> 21
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)