[GitHub] [spark] HeartSaVioR commented on pull request #35552: [SPARK-38237][SQL][SS] Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

GitBox Thu, 17 Feb 2022 20:21:56 -0800


HeartSaVioR commented on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1043865864



   My bad, I should have picked up the simpler case. The case I described is 
complicated one. I'll update the PR description with simpler one.
   
   There is a much simpler case: suppose table is hash partitioned by k1, with 
a small number of partitions and the data is skewed. If we run GROUP BY k1, k2 
against the table, Spark doesn't add a shuffle (expected) and the query will 
run super slowly.
   
   We seem to consider this as a "trade off", but it is not going to be 
acceptable if the elapsed times of the query are from mins to hours. We are not 
yet very smart about choosing the best behavior automatically for specific 
query, so we would like to provide a new config to end users to tune the 
behavior manually. The new config was missing in this PR (my bad again) and 
I'll add a new config in the PR.
   
   We'd like to narrow the scope of impact to aggregation for now since the 
case is obvious for aggregation. We could consider another config or broader 
scope of impact if we encounter more cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #35552: [SPARK-38237][SQL][SS] Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

Reply via email to