[GitHub] [spark] nikitagkonda commented on a change in pull request #21149: [SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict

GitBox Tue, 05 Mar 2019 17:05:24 -0800

nikitagkonda commented on a change in pull request #21149: [SPARK-24076][SQL] 
Use different seed in HashAggregate to avoid hash conflict
URL: https://github.com/apache/spark/pull/21149#discussion_r262755515


 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala
 ##########
 @@ -755,7 +755,10 @@ case class HashAggregateExec(
     }
 
     // generate hash code for key
-    val hashExpr = Murmur3Hash(groupingExpressions, 42)
+    // SPARK-24076: HashAggregate uses the same hash algorithm on the same 
expressions
+    // as ShuffleExchange, it may lead to bad hash conflict when 
shuffle.partitions=8192*n,
+    // pick a different seed to avoid this conflict
+    val hashExpr = Murmur3Hash(groupingExpressions, 48)
 
 Review comment:
   @cloud-fan would this perform slower since now we are moving to interpreted 
version for hashcode generation? If not then why didn't we use 
`unsafeRowKeys.hashCode()` in the first place?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] nikitagkonda commented on a change in pull request #21149: [SPARK-24076][SQL] Use different seed in HashAggregate to avoid hash conflict

Reply via email to