I think you should file a bug.
> On Nov 29, 2015, at 9:48 AM, Andy Davidson <a...@santacruzintegration.com> > wrote: > > Hi Felix and Ted > > This is how I am starting spark > > Should I file a bug? > > Andy > > > export PYSPARK_PYTHON=python3.4 > export PYSPARK_DRIVER_PYTHON=python3.4 > export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN" > > $SPARK_ROOT/bin/pyspark \ > --master $MASTER_URL \ > --total-executor-cores $numCores \ > --driver-memory 2G \ > --executor-memory 2G \ > $extraPkgs \ > $* > > From: Felix Cheung <felixcheun...@hotmail.com> > Date: Saturday, November 28, 2015 at 12:11 AM > To: Ted Yu <yuzhih...@gmail.com> > Cc: Andrew Davidson <a...@santacruzintegration.com>, "user @spark" > <user@spark.apache.org> > Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash() > > Ah, it's there in spark-submit and pyspark. > Seems like it should be added for spark_ec2 > > > _____________________________ > From: Ted Yu <yuzhih...@gmail.com> > Sent: Friday, November 27, 2015 11:50 AM > Subject: Re: possible bug spark/python/pyspark/rdd.py portable_hash() > To: Felix Cheung <felixcheun...@hotmail.com> > Cc: Andy Davidson <a...@santacruzintegration.com>, user @spark > <user@spark.apache.org> > > > ec2/spark-ec2 calls ./ec2/spark_ec2.py > > I don't see PYTHONHASHSEED defined in any of these scripts. > > Andy reported this for ec2 cluster. > > I think a JIRA should be opened. > > >> On Fri, Nov 27, 2015 at 11:01 AM, Felix Cheung <felixcheun...@hotmail.com> >> wrote: >> May I ask how you are starting Spark? >> It looks like PYTHONHASHSEED is being set: >> https://github.com/apache/spark/search?utf8=%E2%9C%93&q=PYTHONHASHSEED >> >> >> Date: Thu, 26 Nov 2015 11:30:09 -0800 >> Subject: possible bug spark/python/pyspark/rdd.py portable_hash() >> From: a...@santacruzintegration.com >> To: user@spark.apache.org >> >> I am using spark-1.5.1-bin-hadoop2.6. I used >> spark-1.5.1-bin-hadoop2.6/ec2/s park-ec2 to create a cluster and configured >> spark-env to use python3. I get and exception ' Randomness of hash of >> string should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py >> should not just set PYTHONHASHSEED ? >> >> Should I file a bug? >> >> Kind regards >> >> Andy >> >> details >> >> http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract >> >> Example does not work out of the box >> >> Subtract( other, numPartitions=None) >> Return each value in self that is not contained in other. >> >> >>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])>>> y = >> >>> sc.parallelize([("a", 3), ("c", None)])>>> >> >>> sorted(x.subtract(y).collect())[('a', 1), ('b', 4), ('b', 5)] >> It raises >> >> if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ: >> raise Exception("Randomness of hash of string should be disabled via >> PYTHONHASHSEED") >> >> >> The following script fixes the problem >> >> Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate >> Exception'Randomness of hash of string should be disabled via >> PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> >> /root/spark/conf/spark-env.sh >> >> sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh >> /root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"` >> >> Sudo for i in `cat slaves` ; do scp spark-env.sh >> root@$i:/root/spark/conf/spark-env.sh; done > > >