HeartSaVioR edited a comment on pull request #32875: URL: https://github.com/apache/spark/pull/32875#issuecomment-1023682745
> Could you give me a concrete example of this? Actually it's hypothetical one; I tried to reproduce it but you may see #35341 that I failed to reproduce with built-in sources. > Currently the rule only skips shuffle in join if both sides report the same distribution If there is any chance for both sides to change the distributions altogether, rule will skip shuffle in join since they are still having same distribution, but stream-stream join should read from state as well which may be partitioned in different way and the partitioning is not flexible. That said, state partitioning cannot be changed during query lifetime, at least for now. And we don't have a way to maintain the information of state partition (we only ensure the number of partitions of state, via sticking the config value for the number of "shuffle" partition), so the safest way for state is to constrain the state partitioning as Spark's hash partitioning. The thing is the hash function - even if source is partitioned/bucketed in same criteria (columns), the hash function and the number of partitions must be same as Spark's state as well for stateful operators. That said, at least as of now, stateful operators cannot leverage the benefits on source side distribution. There's a room to improve this, like let state follow the distribution of inputs, but also store the distribution info to metadata so that the stateful operator forces the state distribution (triggering shuffle) if they are different in future runs. > with the first follow-up by @cloud-fan I think we've already restored the previous behavior. Could you please refer the commit/PR for this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
