sumeetgajjar commented on a change in pull request #35047:
URL: https://github.com/apache/spark/pull/35047#discussion_r796302266
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
##########
@@ -3380,6 +3380,17 @@ object SQLConf {
.checkValue(_ >= 0, "The value must be non-negative.")
.createWithDefault(8)
+ val HASHED_RELATION_REORDER_FACTOR =
buildConf("spark.sql.hashedRelationReorderFactor")
+ .doc("The HashedRelation will be reordered if the number of unique keys
times this factor is " +
+ "less than equal to the total number of rows in the HashedRelation. " +
+ "The reordering places all rows with the same key adjacent to each other
to improve " +
+ "spatial locality. This provides a performance boost while iterating
over the rows for a " +
+ "given key due to increased cache hits")
+ .version("3.3.0")
+ .doubleConf
+ .checkValue(_ > 1, "The value must be greater than 1.")
+ .createOptional
Review comment:
Apologies for the delayed response, I was stuck with some work stuff
followed by a sick week due to covid.
> @sumeetgajjar - could you help elaborate more why a global default value
is sufficient per my question above?
@c21 My rationale behind suggesting a global value of 4 was based on the
experiment that I ran. I ran a synthetic workload with the HashOptimization
enabled no matter the duplication factor of the keys. I gradually iterated over
the duplication factor from 1 to 20, I noticed the optimization to be
beneficial right after the duplication factor crossed a value of 4. Thus based
on the experiment I conducted locally, I suggested a value of 4.
> If the probe side has many looks up of keys with a lot of values, then we
can see the improvement. But if the probe side does not look up much for these
keys, then we probably cannot see the benefit.
I agree, the synthetic workload I was running queried the probe side such
that the majority of the keys had multiple values.
Anyways, due to concerns over the added memory pressure introduced by this
Optimization and the feedback received on the config being difficult to tweak,
I've decided to close the PR. In case I find a better solution, I'll reopen the
PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]