[GitHub] [spark] c21 commented on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

GitBox Fri, 18 Feb 2022 17:26:25 -0800


c21 commented on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1045502055



   > the same problem also exists in
   join(t1.x = t2.x) followed by window(t1.x, t1.y) or join(t1.x = t3.x and 
t1.y = t3.y)
   Note that AQE doesn't have a chance to kick in because there's no shuffle 
between those operators.
   
   @sigmod - I concur we don't have an existing way to work around this issue, 
if you cannot change the query (as you said for auto-generated query in 
reporting or visualization software). Although if you can change the query, you 
can add a `DISTRIBUTE BY (x, y)` between `join` and `window` to work around 
this.
   
   > I'd like to verify that there are alternatives end users can always 
leverage them for any cases. I agree DataFrame has no problem on this given 
existence of repartition(). But how about SQL statement? Do we provide the same 
for SQL statement? Could end users inject the hint anywhere before operator and 
it will take effect on the exact place?
   
   @HeartSaVioR - beside the hint @sunchao mentioned in 
https://github.com/apache/spark/pull/35552#issuecomment-1045438357, you can add 
a `DISTRIBUTE BY (columns)` SQL clause when you want to repartition in SQL 
query - 
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html
 .
   
   > is valid and a great point I totally missed. But then there would be 
another question, "why we do partial aggregate even users express the intention 
they want to do full shuffle because they indicate a skew?" I suspect we have 
to skip it.
   If we go with skipping the partial aggregate altogether when end users give 
a hint (e.g. config, or better way if exists), would it work and would it be 
the acceptable solution for us?
   
   I feel introducing more manually tuned config to allow user to disable 
partial aggregate, is not working at scale. I am actually in favor of query 
engine to adaptively make optimization under the hood, instead of leaving users 
to tune. I feel a better approach is to adaptively disable partial aggregate 
during runtime if reduction ratio is low - 
https://github.com/apache/spark/pull/28804#issuecomment-854089520 .
   
   > Let's expand this to the broader operators leveraging 
ClusteredDistribution. 3) doesn't even exist for them and we have a problem 
there as well. What would be the way to fix it if users indicate the issue and 
want to deal with?
   
   I think the data skew can happen, and we should allow users to work around. 
But I feel introducing `HashClusteredDistribution` back is kind of an overkill, 
how about the approach in https://github.com/apache/spark/pull/35574 ? Would 
love more opinions on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

Reply via email to