[ 
https://issues.apache.org/jira/browse/SPARK-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Selin updated SPARK-1468:
------------------------------

    Description: 
In python the default hash method uses the memory address of objects. Since 
None is an object None will get partitioned into different partitions depending 
on which python process it is run in. This causes some really odd results when 
None key's are used in the partitionBy.

I've created a fix using a consistent hashing method that sends None to 0. That 
pr lives at https://github.com/apache/spark/pull/371

  was:In python the default hash method uses the memory address of objects. 
Since None is an object None will get partitioned into different partitions 
depending on which python process it is run in. This causes some really odd 
results when None key's are used in the partitionBy.


> The hash method used by partitionBy in Pyspark doesn't deal with None 
> correctly.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-1468
>                 URL: https://issues.apache.org/jira/browse/SPARK-1468
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.0
>            Reporter: Erik Selin
>
> In python the default hash method uses the memory address of objects. Since 
> None is an object None will get partitioned into different partitions 
> depending on which python process it is run in. This causes some really odd 
> results when None key's are used in the partitionBy.
> I've created a fix using a consistent hashing method that sends None to 0. 
> That pr lives at https://github.com/apache/spark/pull/371



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to