HeartSaVioR commented on pull request #35552: URL: https://github.com/apache/spark/pull/35552#issuecomment-1043865864
My bad, I should have picked up the simpler case. The case I described is complicated one. I'll update the PR description with simpler one. There is a much simpler case: suppose table is hash partitioned by k1, with a small number of partitions and the data is skewed. If we run GROUP BY k1, k2 against the table, Spark doesn't add a shuffle (expected) and the query will run super slowly. We seem to consider this as a "trade off", but it is not going to be acceptable if the elapsed times of the query are from mins to hours. We are not yet very smart about choosing the best behavior automatically for specific query, so we would like to provide a new config to end users to tune the behavior manually. The new config was missing in this PR (my bad again) and I'll add a new config in the PR. We'd like to narrow the scope of impact to aggregation for now since the case is obvious for aggregation. We could consider another config or broader scope of impact if we encounter more cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
