[GitHub] [spark] HeartSaVioR edited a comment on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Thu, 27 Jan 2022 14:10:37 -0800


HeartSaVioR edited a comment on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1023682745



   > Could you give me a concrete example of this?
   
   Actually it's hypothetical one; I tried to reproduce it but you may see 
#35341 that I failed to reproduce with built-in sources.
   
   > Currently the rule only skips shuffle in join if both sides report the 
same distribution
   
   If there is any chance for both sides to change the distributions 
altogether, rule will skip shuffle in join since they are already having same 
distribution, but stream-stream join should read from state as well which may 
be partitioned in different way and the partitioning is not flexible. 
   
   That said, state partitioning cannot be changed during query lifetime, at 
least for now. And we don't have a way to maintain the information of state 
partition (we only ensure the number of partitions of state, via sticking the 
config value for the number of "shuffle" partition), so the safest way for 
state is to follow the Spark's hash partitioning.
   
   The thing is the hash function - even if source is partitioned/bucketed in 
same criteria (columns), the hash function and the number of partitions must be 
same as Spark's state as well for stateful operators. That said, at least as of 
now, stateful operators cannot leverage the benefits on source side 
distribution. There's a room to improve this, like let state follow the 
distribution of inputs, but also store the distribution info to metadata so 
that the stateful operator forces the state distribution (triggering shuffle) 
if they are different in future runs.
   
   > with the first follow-up by @cloud-fan I think we've already restored the 
previous behavior.
   
   Could you please refer the commit/PR for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to