fangshil edited a comment on issue #20303: [SPARK-23128][SQL] A new approach to do adaptive execution in Spark SQL URL: https://github.com/apache/spark/pull/20303#issuecomment-469151637 Excited to see AE making progress in upstream:) We have used the new AE framework to add SQL optimization rules and the result looks very promising. We have a few comments for this patch in general: 1. The current patch handles shuffle parallelism on reducer side, as it starts with a relatively large number of mapper partitions(500), and merge into fewer reducer partitions by allowing each reducer to read multiple mappers. For large data scale, setting 10K to spark.sql.shuffle.partitions in non-AE VS maxNumPostShufflePartitions in AE should have same results since the reducer number won't change when data is large. I think with this patch, we haven't got the optimal performance since we only save the overhead of launching a certain number reduce tasks. A better approach would be dynamically estimating the initial/mapper parallelism between 0 and maxNumPostShufflePartitions. This should be made possible by AE as well, while this patch should be a solid foundation for future improvements. Hope we can merge it soon! 2. This patch uses submitMapStage API. The API would submit each stage as a new job, so AE breaks Spark's vanilla definition of a job. This is an issue we inherit from the original AE, not originating from this new AE.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
