[ https://issues.apache.org/jira/browse/SPARK-27560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-27560: ------------------------------------ Assignee: (was: Apache Spark) > HashPartitioner uses Object.hashCode which is not seeded > -------------------------------------------------------- > > Key: SPARK-27560 > URL: https://issues.apache.org/jira/browse/SPARK-27560 > Project: Spark > Issue Type: Bug > Components: Java API > Affects Versions: 2.4.0 > Environment: Notebook is running spark v2.4.0 local[*] > Python 3.6.6 (default, Sep 6 2018, 13:10:03) > [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin > I imagine this would reproduce on all operating systems and most versions of > spark though. > Reporter: Andrew McHarg > Priority: Minor > > Forgive the quality of the bug report here, I am a pyspark user and not super > familiar with the internals of spark, yet it seems I have a strange corner > case with the HashPartitioner. > This may already be known but repartition with HashPartitioner seems to > assign everything the same partition if data that was partitioned by the same > column is only partially read (say one partition). I suppose it is obvious > concequence of Object.hashCode being deterministic but took some while to > track down. > Steps to repro: > # Get dataframe with a bunch of uuids say 10000 > # repartition(100, 'uuid_column') > # save to parquet > # read from parquet > # collect()[:100] then filter using pyspark.sql.functions isin (yes I know > this is bad and sampleBy should probably be used here) > # repartition(10, 'uuid_column') > # Resulting dataframe will have all of its data in one single partition > Jupyter notebook for the above: > https://gist.github.com/robo-hamburger/4752a40cb643318464e58ab66cf7d23e > I think an easy fix would be to seed the HashPartitioner like many hashtable > libraries do to avoid denial of service attacks. It also might be the case > this is obvious behavior for more experienced spark users :) -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org