[GitHub] [spark] HeartSaVioR commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

GitBox Wed, 26 Jan 2022 22:38:51 -0800


HeartSaVioR commented on pull request #32875:
URL: https://github.com/apache/spark/pull/32875#issuecomment-1022894109



   @sunchao 
   
   Sorry for the post-review. I didn't know this PR may affect streaming query 
and indicated later. 
   
   I discussed with @cloud-fan about this change, and we are concerned about 
any possibility on skipping shuffle against grouping keys in stateful 
operators, "including stream-stream join".
   
   In Structured Streaming, state is partitioned with grouping keys based on 
Spark's internal hash function, and the number of partition is static. That 
said, if Spark does not respect the distribution of state against stateful 
operator, it leads to correctness problem.
   
   So please consider that same key is co-located for three aspects (left, 
right, state) in stream-stream join. It's going to apply the same for non-join 
case, e.g. aggregation against bucket table. other stateful operators will have 
two aspects, (key, state). In short, state must be considered.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #32875: [SPARK-35703][SQL] Relax constraint for bucket join and remove HashClusteredDistribution

Reply via email to