fangshil edited a comment on issue #20303: [SPARK-23128][SQL] A new approach to 
do adaptive execution in Spark SQL
URL: https://github.com/apache/spark/pull/20303#issuecomment-469151637
 
 
   Excited to see AE making progress in upstream:) We have used the new AE 
framework to add SQL optimization rules and the result looks very promising. We 
have a few comments for this patch in general:
   
   1. The current patch handles shuffle parallelism on reducer side, as it 
starts with a relatively large number of mapper partitions(500), and merge into 
fewer reducer partitions by allowing each reducer to read multiple mappers. For 
large data scale, setting 10K to spark.sql.shuffle.partitions in non-AE VS 
maxNumPostShufflePartitions in AE should have same results since the reducer 
number won't change when data is large. I think with this patch, we haven't got 
the optimal performance since we only save the overhead of launching a certain 
number reduce tasks. A better approach would be dynamically estimating the 
initial/mapper parallelism between 0 and maxNumPostShufflePartitions. This 
should be made possible by AE as well, while this patch should be a solid 
foundation for future improvements. Hope we can merge it soon!
   
   2. This patch uses submitMapStage API. The API would submit each stage as a 
new job, so AE breaks Spark's vanilla definition of a job. This is an issue we 
inherit from the original AE, not originating from this new AE.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to