HeartSaVioR commented on pull request #32875: URL: https://github.com/apache/spark/pull/32875#issuecomment-1022894109
@sunchao Sorry for the post-review. I didn't know this PR may affect streaming query and indicated later. I discussed with @cloud-fan about this change, and we are concerned about any possibility on skipping shuffle against grouping keys in stateful operators, "including stream-stream join". In Structured Streaming, state is partitioned with grouping keys based on Spark's internal hash function, and the number of partition is static. That said, if Spark does not respect the distribution of state against stateful operator, it leads to correctness problem. So please consider that same key is co-located for three aspects (left, right, state) in stream-stream join. It's going to apply the same for non-join case, e.g. aggregation against bucket table. other stateful operators will have two aspects, (key, state). In short, state must be considered. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
