[ 
https://issues.apache.org/jira/browse/SPARK-18479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673800#comment-15673800
 ] 

Hamel Ajay Kothari commented on SPARK-18479:
--------------------------------------------

Hi [~sowen], thanks for the quick response. What do you mean by "hash the whole 
time"? If you're referring to hashing the java.sql.Timestamp column instead of 
the Long, I'd just like to point out that java.sql.Timestamp's hashCode is 
defined as the hashcode of it's internal millisecond value, so the behavior is 
identical. Is there something I'm missing here?

Can you also clarify what you mean by "later stages will run one partition"? I 
understand that 199 tasks on 200 cores wastes one core, if you have 200 cores, 
but who is even assuming that you have 200 cores? sql.shuffle.partitions is 
hard coded independently of cores, it's just set to a static 200. Odds are it's 
not going to match your exact number of cores anyways.

One more point regarding "there is no problem regardless of the number of 
buckets if the hash function is good": java's own HashMap implementation 
doesn't even take that for granted if you look at the source. It applies a 
second well known hash function to the provided hash value in order to ensure 
that bucketing behaves well, and when it does that, it requires the number of 
buckets to be a power of two.

If we're not going to rehash (and explicitly mistrust the hashfunction), it's 
much safer to use a prime number of buckets (or maybe at least something 
different from such a round number).

> spark.sql.shuffle.partitions defaults should be a prime number
> --------------------------------------------------------------
>
>                 Key: SPARK-18479
>                 URL: https://issues.apache.org/jira/browse/SPARK-18479
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Hamel Ajay Kothari
>
> For most hash bucketing use cases it is my understanding that a prime value, 
> such as 199, would be a safer value than the existing value of 200. Using a 
> non-prime value makes the likelihood of collisions much higher when the hash 
> function isn't great.
> Consider the case where you've got a Timestamp or Long column with 
> millisecond times at midnight each day. With the default value for 
> spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being 
> completely empty.
> Looking around there doesn't seem to be a good reason why we chose 200 so I 
> don't see a huge risk in changing it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to