[
https://issues.apache.org/jira/browse/SPARK-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matei Zaharia resolved SPARK-1468.
----------------------------------
Resolution: Fixed
> The hash method used by partitionBy in Pyspark doesn't deal with None
> correctly.
> --------------------------------------------------------------------------------
>
> Key: SPARK-1468
> URL: https://issues.apache.org/jira/browse/SPARK-1468
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 0.9.0
> Reporter: Erik Selin
> Assignee: Erik Selin
> Fix For: 0.9.2, 1.0.1
>
>
> In python the default hash method uses the memory address of objects. Since
> None is an object None will get partitioned into different partitions
> depending on which python process it is run in. This causes some really odd
> results when None key's are used in the partitionBy.
> I've created a fix using a consistent hashing method that sends None to 0.
> That pr lives at https://github.com/apache/spark/pull/371
--
This message was sent by Atlassian JIRA
(v6.2#6252)