rdd.py portable_hash()

Ted Yu Sun, 29 Nov 2015 10:38:15 -0800

I think you should file a bug.


> On Nov 29, 2015, at 9:48 AM, Andy Davidson <a...@santacruzintegration.com> 
> wrote:
> 
> Hi Felix and Ted
> 
> This is how I am starting spark
> 
> Should I file a bug?
> 
> Andy
> 
> 
> export PYSPARK_PYTHON=python3.4
> export PYSPARK_DRIVER_PYTHON=python3.4
> export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  
> 
> $SPARK_ROOT/bin/pyspark \
>     --master $MASTER_URL \
>     --total-executor-cores $numCores \
>     --driver-memory 2G \
>     --executor-memory 2G \
>     $extraPkgs \
>     $*
> 
> From: Felix Cheung <felixcheun...@hotmail.com>
> Date: Saturday, November 28, 2015 at 12:11 AM
> To: Ted Yu <yuzhih...@gmail.com>
> Cc: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" 
> <user@spark.apache.org>
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> 
> Ah, it's there in spark-submit and pyspark.
> Seems like it should be added for spark_ec2
> 
> 
> _____________________________
> From: Ted Yu <yuzhih...@gmail.com>
> Sent: Friday, November 27, 2015 11:50 AM
> Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash()
> To: Felix Cheung <felixcheun...@hotmail.com>
> Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark 
> <user@spark.apache.org>
> 
> 
> ec2/spark-ec2 calls ./ec2/spark_ec2.py 
> 
> I don't see PYTHONHASHSEED defined in any of these scripts.
> 
> Andy reported this for ec2 cluster.
> 
> I think a JIRA should be opened.
> 
> 
>> On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung <felixcheun...@hotmail.com> 
>> wrote: 
>> May I ask how you are starting Spark? 
>> It looks like PYTHONHASHSEED is being set: 
>> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED 
>> 
>>   
>> Date: Thu, 26 Nov 2015 11:30:09 -0800 
>> Subject: possible bug spark/python/pyspark/rdd.py portable_hash() 
>> From: a...@santacruzintegration.com 
>> To: user@spark.apache.org 
>> 
>> I am using  spark-1.5.1-bin-hadoop2.6. I used  
>> spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster  and configured 
>> spark-env to use python3. I get and exception '  Randomness of hash of 
>> string should be disabled via PYTHONHASHSEED’.   Is there any reason rdd.py 
>> should not just set PYTHONHASHSEED ?
>> 
>>  Should I file a bug?
>> 
>> Kind regards
>> 
>> Andy
>> 
>> details
>> 
>> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract
>> 
>> Example does not work out of the box
>> 
>> Subtract( other,  numPartitions=None) 
>> Return each value in self that is not contained in other.
>> 
>> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = 
>> >>> sc.parallelize([("a", 3), ("c", None)])>>> 
>> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)]
>> It raises 
>> 
>>     if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:        
>> raise Exception("Randomness of hash of string should be disabled via 
>> PYTHONHASHSEED")
>> 
>> 
>> The following script fixes the problem 
>> 
>>  Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
>> Exception'Randomness of hash of string should be disabled via 
>> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> 
>> /root/spark/conf/spark-env.sh 
>> 
>>  sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh 
>> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` 
>> 
>>  Sudo for i in `cat slaves` ; do scp spark-env.sh 
>> root@$i:/root/spark/conf/spark-env.sh; done
> 
> 
>

Re: possible bug spark/python/pyspark/rdd.py portable_hash()

Reply via email to