[ 
https://issues.apache.org/jira/browse/SPARK-3662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146639#comment-14146639
 ] 

Evan Samanas commented on SPARK-3662:
-------------------------------------

I wouldn't focus on the example, that I modified it, or whether I should be 
importing a small portion of pandas.  The issue here is that Spark breaks in 
this case because of a name collision.  Modifying the example is simply the one 
reproducer I've found.

I was modifying the example to learn about how Spark ships Python code to the 
cluster.  In this case, I expected pandas to only be imported in the driver 
program and not to be imported by any workers.  The workers do not have pandas 
installed, so expected behavior means the example would run to completion, and 
an ImportError would mean that the workers are importing things they don't need 
for the task at hand.

The way I expected Spark to work IS actually how Spark works...modules will 
only be imported by workers if a function passed to them uses the modules, but 
this error showed me false evidence to the contrary.  I'm assuming the error is 
in Spark's modifications to CloudPickle...not in the way the example is set up.

> Importing pandas breaks included pi.py example
> ----------------------------------------------
>
>                 Key: SPARK-3662
>                 URL: https://issues.apache.org/jira/browse/SPARK-3662
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, YARN
>    Affects Versions: 1.1.0
>         Environment: Xubuntu 14.04.  Yarn cluster running on Ubuntu 12.04.
>            Reporter: Evan Samanas
>
> If I add "import pandas" at the top of the included pi.py example and submit 
> using "spark-submit --master yarn-client", I get this stack trace:
> {code}
> Traceback (most recent call last):
>   File "/home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi.py", line 
> 39, in <module>
>     count = sc.parallelize(xrange(1, n+1), slices).map(f).reduce(add)
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 759, in reduce
>     vals = self.mapPartitions(func).collect()
>   File "/home/evan/pub_src/spark/python/pyspark/rdd.py", line 723, in collect
>     bytesInJava = self._jrdd.collect().iterator()
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 538, in __call__
>   File 
> "/home/evan/pub_src/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", 
> line 300, in get_return_value
> py4j.protocol.Py4JJavaError14/09/23 15:51:58 INFO TaskSetManager: Lost task 
> 2.3 in stage 0.0 (TID 10) on executor SERVERNAMEREMOVED: 
> org.apache.spark.api.python.PythonException (Traceback (most recent call 
> last):
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
>  line 75, in main
>     command = pickleSer._read_with_length(infile)
>   File 
> "/yarn/nm/usercache/evan/filecache/173/spark-assembly-1.1.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
>  line 150, in _read_with_length
>     return self.loads(obj)
> ImportError: No module named algos
> {code}
> The example works fine if I move the statement "from random import random" 
> from the top and into the function (def f(_)) defined in the example.  Near 
> as I can tell, "random" is getting confused with a function of the same name 
> within pandas.algos.  
> Submitting the same script using --master local works, but gives a 
> distressing amount of random characters to stdout or stderr and messes up my 
> terminal:
> {code}
> ...
> @J@J@J@J@J@J@J@J@J@J@J@J@J@JJ@J@J@J@J 
> @J!@J"@J#@J$@J%@J&@J'@J(@J)@J*@J+@J,@J-@J.@J/@J0@J1@J2@J3@J4@J5@J6@J7@J8@J9@J:@J;@J<@J=@J>@J?@J@@JA@JB@JC@JD@JE@JF@JG@JH@JI@JJ@JK@JL@JM@JN@JO@JP@JQ@JR@JS@JT@JU@JV@JW@JX@JY@JZ@J[@J\@J]@J^@J_@J`@Ja@Jb@Jc@Jd@Je@Jf@Jg@Jh@Ji@Jj@Jk@Jl@Jm@Jn@Jo@Jp@Jq@Jr@Js@Jt@Ju@Jv@Jw@Jx@Jy@Jz@J{@J|@J}@J~@J@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ�@J�@J�@J�@J�@J�@J�@J�@J�@J�@J�@JAJAJAJAJAJAJAJAAJ
>        AJ
> AJ
>   AJ
> AJAJAJAJAJAJAJAJAJAJAJAJAJAJJAJAJAJAJ 
> AJ!AJ"AJ#AJ$AJ%AJ&AJ'AJ(AJ)AJ*AJ+AJ,AJ-AJ.AJ/AJ0AJ1AJ2AJ3AJ4AJ5AJ6AJ7AJ8AJ9AJ:AJ;AJ<AJ=AJ>AJ?AJ@AJAAJBAJCAJDAJEAJFAJGAJHAJIAJJAJKAJLAJMAJNAJOAJPAJQAJRAJSAJTAJUAJVAJWAJXAJYAJZAJ[AJ\AJ]AJ^AJ_AJ`AJaAJbAJcAJdAJeAJfAJgAJhAJiAJjAJkAJlAJmAJnAJoAJpAJqAJrAJsAJtAJuAJvAJwAJxAJyAJzAJ{AJ|AJ}AJ~AJAJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�A14/09/23
>  15:42:09 INFO SparkContext: Job finished: reduce at 
> /home/evan/pub_src/spark-1.1.0/examples/src/main/python/pi_sframe.py:38, took 
> 11.276879779 s
> J�AJ�AJ�AJ�AJ�AJ�AJ�AJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJ�AJBJBJBJBJBJBJBJBBJ
>      BJ
> BJ
>   BJ
> BJBJBJBJBJBJBJBJBJBJBJBJBJBJJBJBJBJBJ 
> BJ!BJ"BJ#BJ$BJ%BJ&BJ'BJ(BJ)BJ*BJ+BJ,BJ-BJ.BJ/BJ0BJ1BJ2BJ3BJ4BJ5BJ6BJ7BJ8BJ9BJ:BJ;BJ<BJ=BJ>BJ?BJ@Be.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJ#1a.
> �]qJX4a.
> �]qJX4a.
> �]qJa.
> Pi is roughly 3.146136
> {code}
> No idea if that's related, but thought I'd include it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to