[jira] [Resolved] (SPARK-1468) The hash method used by partitionBy in Pyspark doesn't deal with None correctly.

Matei Zaharia (JIRA) Tue, 03 Jun 2014 13:34:26 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matei Zaharia resolved SPARK-1468.
----------------------------------

    Resolution: Fixed

> The hash method used by partitionBy in Pyspark doesn't deal with None 
> correctly.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-1468
>                 URL: https://issues.apache.org/jira/browse/SPARK-1468
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 0.9.0
>            Reporter: Erik Selin
>            Assignee: Erik Selin
>             Fix For: 0.9.2, 1.0.1
>
>
> In python the default hash method uses the memory address of objects. Since 
> None is an object None will get partitioned into different partitions 
> depending on which python process it is run in. This causes some really odd 
> results when None key's are used in the partitionBy.
> I've created a fix using a consistent hashing method that sends None to 0. 
> That pr lives at https://github.com/apache/spark/pull/371



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1468) The hash method used by partitionBy in Pyspark doesn't deal with None correctly.

Reply via email to