[ 
https://issues.apache.org/jira/browse/SPARK-27560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27560:
------------------------------------

    Assignee:     (was: Apache Spark)

> HashPartitioner uses Object.hashCode which is not seeded
> --------------------------------------------------------
>
>                 Key: SPARK-27560
>                 URL: https://issues.apache.org/jira/browse/SPARK-27560
>             Project: Spark
>          Issue Type: Bug
>          Components: Java API
>    Affects Versions: 2.4.0
>         Environment: Notebook is running spark v2.4.0 local[*]
> Python 3.6.6 (default, Sep  6 2018, 13:10:03)
> [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
> I imagine this would reproduce on all operating systems and most versions of 
> spark though.
>            Reporter: Andrew McHarg
>            Priority: Minor
>
> Forgive the quality of the bug report here, I am a pyspark user and not super 
> familiar with the internals of spark, yet it seems I have a strange corner 
> case with the HashPartitioner.
> This may already be known but repartition with HashPartitioner seems to 
> assign everything the same partition if data that was partitioned by the same 
> column is only partially read (say one partition). I suppose it is obvious 
> concequence of Object.hashCode being deterministic but took some while to 
> track down. 
> Steps to repro:
>  # Get dataframe with a bunch of uuids say 10000
>  # repartition(100, 'uuid_column')
>  # save to parquet
>  # read from parquet
>  # collect()[:100] then filter using pyspark.sql.functions isin (yes I know 
> this is bad and sampleBy should probably be used here)
>  # repartition(10, 'uuid_column')
>  # Resulting dataframe will have all of its data in one single partition
> Jupyter notebook for the above: 
> https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e
> I think an easy fix would be to seed the HashPartitioner like many hashtable 
> libraries do to avoid denial of service attacks. It also might be the case 
> this is obvious behavior for more experienced spark users :)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to