[ https://issues.apache.org/jira/browse/SPARK-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065671#comment-14065671 ]
Matthew Farrellee commented on SPARK-2494: ------------------------------------------ thank you. i've confirmed this: {code} >>> rdd.groupByKey(10).collect() [((None, 1), <pyspark.resultiterable.ResultIterable object at 0x19d4410>), ((None, 1), <pyspark.resultiterable.ResultIterable object at 0x19d4310>), ((None, 1), <pyspark.resultiterable.ResultIterable object at 0x19d7290>)] {code} i have 3 workers in my cluster > Hash of None is different cross machines in CPython > --------------------------------------------------- > > Key: SPARK-2494 > URL: https://issues.apache.org/jira/browse/SPARK-2494 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.0, 1.0.1 > Environment: CPython 2.x > Reporter: Davies Liu > Priority: Blocker > Labels: pyspark, shuffle > Fix For: 1.0.0, 1.0.1 > > Original Estimate: 24h > Remaining Estimate: 24h > > The hash of None, also tuple with None in it, is different cross machines, so > the result will be wrong if None appear in the key of partitionBy(). > It should use an portable hash function as the default partition function, > which generate same hash for all the builtin immutable types, especially > tuple. -- This message was sent by Atlassian JIRA (v6.2#6252)