manuzhang edited a comment on pull request #29540:
URL: https://github.com/apache/spark/pull/29540#issuecomment-681198316


   @Dooyoung-Hwang Thanks for comments.
   
   > At the first glance, performance is more important than having small 
number of files.
   > We had better be careful because we don't want any performance regression.
   
   I thought so but let'm ask the question how much performance difference is a 
regression. Will user notice a 5 min increase of running time if the whole job 
takes more than 30 min ? Probably not from my experience but **users have 
immediately reported the increased number of small files compared with the 
previous day because that had impact on their downstream jobs in the 
pipeline**. 
   
   Moreover, I'm arguing about the uncertainty and complexity that fallback 
mechanism has brought in. Even if the job is running slower or even crashes due 
to coalescing, we can suggest user to tune down `advisoryTargetSize` or tune up 
executor memory. Once it works it's done. It won't change across several runs 
as in the case of falling back to default parallelism. Meanwhile we have to 
explain why `advisoryTargetSize` isn't working and why it's different from 
yesterday.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to