c21 commented on pull request #35552: URL: https://github.com/apache/spark/pull/35552#issuecomment-1045502055
> the same problem also exists in join(t1.x = t2.x) followed by window(t1.x, t1.y) or join(t1.x = t3.x and t1.y = t3.y) Note that AQE doesn't have a chance to kick in because there's no shuffle between those operators. @sigmod - I concur we don't have an existing way to work around this issue, if you cannot change the query (as you said for auto-generated query in reporting or visualization software). Although if you can change the query, you can add a `DISTRIBUTE BY (x, y)` between `join` and `window` to work around this. > I'd like to verify that there are alternatives end users can always leverage them for any cases. I agree DataFrame has no problem on this given existence of repartition(). But how about SQL statement? Do we provide the same for SQL statement? Could end users inject the hint anywhere before operator and it will take effect on the exact place? @HeartSaVioR - beside the hint @sunchao mentioned in https://github.com/apache/spark/pull/35552#issuecomment-1045438357, you can add a `DISTRIBUTE BY (columns)` SQL clause when you want to repartition in SQL query - https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-distribute-by.html . > is valid and a great point I totally missed. But then there would be another question, "why we do partial aggregate even users express the intention they want to do full shuffle because they indicate a skew?" I suspect we have to skip it. If we go with skipping the partial aggregate altogether when end users give a hint (e.g. config, or better way if exists), would it work and would it be the acceptable solution for us? I feel introducing more manually tuned config to allow user to disable partial aggregate, is not working at scale. I am actually in favor of query engine to adaptively make optimization under the hood, instead of leaving users to tune. I feel a better approach is to adaptively disable partial aggregate during runtime if reduction ratio is low - https://github.com/apache/spark/pull/28804#issuecomment-854089520 . > Let's expand this to the broader operators leveraging ClusteredDistribution. 3) doesn't even exist for them and we have a problem there as well. What would be the way to fix it if users indicate the issue and want to deal with? I think the data skew can happen, and we should allow users to work around. But I feel introducing `HashClusteredDistribution` back is kind of an overkill, how about the approach in https://github.com/apache/spark/pull/35574 ? Would love more opinions on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
