manuzhang commented on pull request #29540:
URL: https://github.com/apache/spark/pull/29540#issuecomment-681198316


   @Dooyoung-Hwang Thanks for comments.
   
   > At the first glance, performance is more important than having small 
number of files.
   > We had better be careful because we don't want any performance regression.
   
   I thought so but let'm ask the question how much performance difference is a 
regression. Will user notice a 5 min increase of running time if the whole job 
takes more than 30 min ? Probably not from my experience but **users have 
immediately reported the increased number of small files compared with the 
previous day because that had impact on their downstream jobs in the 
pipeline**. 
   
   Moreover, I'm arguing about the uncertainty and complexity that fallback 
mechanism have brought in. Even if the job is running slower or even crashes 
due to coalescing, we can suggest user to tune down `advisoryTargetSize` or 
tune up executor memory. Once it works it's done. It won't change across 
several runs as in the case of falling back to default parallelism. Meanwhile 
we have to explain why `advisoryTargetSize` isn't working and why it's 
different from yesterday.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to