[GitHub] [spark] sumeetgajjar commented on a change in pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

GitBox Mon, 31 Jan 2022 22:51:21 -0800


sumeetgajjar commented on a change in pull request #35047:
URL: https://github.com/apache/spark/pull/35047#discussion_r796302266




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3380,6 +3380,17 @@ object SQLConf {
       .checkValue(_ >= 0, "The value must be non-negative.")
       .createWithDefault(8)
 
+  val HASHED_RELATION_REORDER_FACTOR = 
buildConf("spark.sql.hashedRelationReorderFactor")
+    .doc("The HashedRelation will be reordered if the number of unique keys 
times this factor is " +
+      "less than equal to the total number of rows in the HashedRelation. " +
+      "The reordering places all rows with the same key adjacent to each other 
to improve " +
+      "spatial locality. This provides a performance boost while iterating 
over the rows for a " +
+      "given key due to increased cache hits")
+    .version("3.3.0")
+    .doubleConf
+    .checkValue(_ > 1, "The value must be greater than 1.")
+    .createOptional

Review comment:
       Apologies for the delayed response, I was stuck with some work stuff 
followed by a sick week due to covid.
   
   > @sumeetgajjar - could you help elaborate more why a global default value 
is sufficient per my question above?
   
   @c21 My rationale behind suggesting a global value of 4 was based on the 
experiment that I ran. I ran a synthetic workload with the HashOptimization 
enabled no matter the duplication factor of the keys. I gradually iterated over 
the duplication factor from 1 to 20, I noticed the optimization to be 
beneficial right after the duplication factor crossed a value of 4. Thus based 
on the experiment I conducted locally, I suggested a value of 4.
   
   > If the probe side has many looks up of keys with a lot of values, then we 
can see the improvement. But if the probe side does not look up much for these 
keys, then we probably cannot see the benefit. 
   
   I agree, the synthetic workload I was running queried the probe side such 
that the majority of the keys had multiple values. 
   
   Anyways, due to concerns over the added memory pressure introduced by this 
Optimization and the feedback received on the config being difficult to tweak, 
I've decided to close the PR. In case I find a better solution, I'll reopen the 
PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sumeetgajjar commented on a change in pull request #35047: [SPARK-37175][SQL] Performance improvement to hash joins with many duplicate keys

Reply via email to