[ 
https://issues.apache.org/jira/browse/BIGTOP-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224923#comment-14224923
 ] 

jay vyas commented on BIGTOP-1546:
----------------------------------

[~manikandan.n]

1) Thanks for finding this. I wonder,  should we add a smoke test for this 
also, which confirms that PySpark is not overwritten when we run {{pyspark 
.....}} ?  just thinking out loud.  Might not be a good idea, but some kind of 
test would be great. otherwise, I can look more into it later .  

2) This is an interesting scenario.  As a side question *Can I also ask why you 
have a custom python build ?  Cany you just install numpy on different nodes in 
the cluster...?*  



> The pyspark command, by default, points to a script that contains a bug
> -----------------------------------------------------------------------
>
>                 Key: BIGTOP-1546
>                 URL: https://issues.apache.org/jira/browse/BIGTOP-1546
>             Project: Bigtop
>          Issue Type: Bug
>    Affects Versions: 0.8.0
>            Reporter: Joao Salcedo
>             Fix For: 0.9.0
>
>         Attachments: BIGTOP-1546.patch
>
>
> First, I want to point out that I am not using the os default python on my 
> client side:
> $ which python
> ~/work/anaconda/bin/python
> This is my own build of python which includes all the numeric python 
> libraries. Now let me show where pyspark points:
> $ which pyspark
> /usr/bin/pyspark
> This is a symlink:
> $ ls -l /usr/bin/ | grep pyspark
> lrwxrwxrwx 1 root root 25 Oct 26 10:58 pyspark -> /etc/alternatives/pyspark
>  $ ls -l /etc/alternatives/ | grep pyspark
> lrwxrwxrwx 1 root root 60 Oct 26 10:58 pyspark -> 
> /opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/bin/pyspark
> which if you follow, links up to the file i am claiming is buggy.
> Now let me show you the effect this setup has on pyspark:
> $ pyspark --master yarn
> Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 28 2013, 22:10:09)
> [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 14/11/18 09:44:17 INFO SparkContext: Starting job: count at <stdin>:1
> 14/11/18 09:44:17 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 2 
> output partitions (allowLocal=false)
> 14/11/18 09:44:17 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1)
> 14/11/18 09:44:17 INFO DAGScheduler: Parents of final stage: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Missing parents: List()
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at RDD 
> at PythonRDD.scala:40), which has no missing parents
> 14/11/18 09:44:17 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 
> (PythonRDD[1] at RDD at PythonRDD.scala:40)
> 14/11/18 09:44:17 INFO YarnClientClusterScheduler: Adding task set 0.0 with 2 
> tasks
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:0 as TID 0 on 
> executor 2: lxe0389.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:0 as 2604 bytes in 
> 3 ms
> 14/11/18 09:44:17 INFO TaskSetManager: Starting task 0.0:1 as TID 1 on 
> executor 1: lxe0553.allstate.com (PROCESS_LOCAL)
> 14/11/18 09:44:17 INFO TaskSetManager: Serialized task 0.0:1 as 2619 bytes in 
> 1 ms
> 14/11/18 09:44:19 INFO RackResolver: Resolved lxe0389.allstate.com to 
> /ro/rack18
> 14/11/18 09:44:19 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
> 14/11/18 09:44:19 WARN TaskSetManager: Loss was due to 
> org.apache.spark.api.python.PythonException
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File 
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
>  line 77, in main
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File 
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 191, in dump_stream
>     self.serializer.dump_stream(self._batched(iterator), stream)
>   File 
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 123, in dump_stream
>     for obj in iterator:
>   File 
> "/hadoop05/yarn/nm/usercache/mdrus/filecache/237/spark-assembly-1.0.0-cdh5.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 180, in _batched
>     for item in iterator:
>   File 
> "/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/python/pyspark/rdd.py",
>  line 613, in func
>     if acc is None:
> TypeError: an integer is required
> Ok, theres the trace. As I said, this is, to me, not that illumination as to 
> what is actually going on. Let me try to explain. Notice that, when I first 
> boot up pyspark, the client python is my own personal installation; you can 
> see this from the bootup message the python interpreter gives. On the other 
> hand, the workers are NOT using this python, they default to the system 
> python, this is what the lines of code from my prior email is about:
> PYSPARK_PYTHON="python"
> This forces the worker pythons to use the os python is /usr/bin/python. Mine 
> is py2.7 and the os is py2.6. I suspect there is an incompatability when 
> passing messages between the two interpreters with pickle (python object 
> serialization) that causes the above error. For example, switching the python 
> used by the client back to the os python fixes this issue.
> pyspark provides two environment variables so that the end user can 
> comfortably change which interpreter is used by the client and the workers, 
> they are PYSPARK_PYTHON and SPARK_YARN_USER_ENV. The following setup should 
> allow me to use my own install of the interpreter:
> $ export PYSPARK_PYTHON=/home/mdrus/work/anaconda/bin/python2.7
> mdrus@lxe0038 [.../spark/bin] $ export SPARK_YARN_USER_ENV=”PYSPARK_PYTHON 
> =/home/mdrus/work/anaconda/bin/python2.7”
> But unfortunately, this does not work, and will give the same error as 
> before. The reason is the line of code i pointed to before in 
> /usr/bin/pyspark, which forcefully overrides my choice of PYSPARK_PYTHON at 
> runtime. This is evidenced by the fact that the following:
> $ alias 
> pyspark=/opt/cloudera/parcels/CDH-5.1.0-1.cdh5.1.0.p0.53/lib/spark/bin/pyspark
> immediately fixes the issue:
> $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>>sc.parallelize([1, 2, 3]).count()
> <snip>
> 3
> and also allows me to do fun numpy things like:
>  $ pyspark --master yarn Python 2.7.5 |Anaconda 1.7.0 (64-bit)| (default, Jun 
> 28 2013, 22:10:09) [GCC 4.1.2 20080704 (Red Hat 4.1.2-54)] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information.
> <snip>
> >>> import numpy as np
> >>> sc.parallelize([np.array([1, 2, 3]), np.array([4, 5, 
> >>> 6])]).map(np.sum).sum()
> <snip>
> 21



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to