[GitHub] [spark] HeartSaVioR edited a comment on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

GitBox Fri, 18 Feb 2022 15:48:15 -0800


HeartSaVioR edited a comment on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1045348369



   Thanks all for the valuable inputs. I really appreciated all the inputs!
   
   First of all, I have to admit I missed the partial aggregation. My bad.
   
   If I understand correctly, we seem to be on the same page that 
   
   1) There are various cases skew happens before aggregation (e.g. join), and 
ClusteredDistribution wouldn't deal with skew since it is already clustered as 
it is required.
   2) In above cases, there are some cases Spark cannot deal automatically. 
(e.g. insufficient stats)
   3) We do partial aggregate hence simply changing the required child 
distribution to HashClusteredDistribution wouldn't help.
   
   (Please correct me if there is something we don't agree upon.)
   
   I agree this fix doesn't seem to go with right way, but seems like it is 
still valuable to discuss further for the better fix, regardless of the 
condition we address it in this PR or defer to other PR in the future.
   
   A.
   (First of all, sorry for the ignorance. I haven't dealt with pure SQL 
statement, so my question could be very silly.)
   
   I'd like to verify that there are alternatives end users can always leverage 
them for any cases. I agree DataFrame has no problem on this given existence of 
`repartition()`. But how about SQL statement? Do we provide the same for SQL 
statement? Could end users inject the hint anywhere before operator and it will 
take effect on the exact place?
   
   B.
   3) is valid and a great point I totally missed. But then there would be 
another question, "why we do partial aggregate even users express the intention 
they want to do full shuffle because they indicate a skew?" I suspect we have 
to skip it.
   
   If we go with skipping the partial aggregate altogether when end users give 
a hint (e.g. config, or better way if exists), would it work and would it be 
the acceptable solution for us?
   
   C. (out of scope on this PR, but to be future proof)
   Let's expand this to the broader operators leveraging ClusteredDistribution. 
3) doesn't even exist for them and we have a problem there as well. What would 
be the way to fix it if users indicate the issue and want to deal with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

Reply via email to