manuzhang edited a comment on pull request #29540: URL: https://github.com/apache/spark/pull/29540#issuecomment-681198316
@Dooyoung-Hwang Thanks for comments. > At the first glance, performance is more important than having small number of files. > We had better be careful because we don't want any performance regression. I thought so but let'm ask the question how much performance difference is a regression. Will user notice a 5 min increase of running time if the whole job takes more than 30 min ? Probably not from my experience but **users have immediately reported the increased number of small files compared with the previous day because that had impact on their downstream jobs in the pipeline**. Moreover, I'm arguing about the uncertainty and complexity that fallback mechanism has brought in. Even if the job is running slower or even crashes due to coalescing, we can suggest user to tune down `advisoryTargetSize` or tune up executor memory. Once it works it's done. It won't change across several runs as in the case of falling back to default parallelism. Meanwhile we have to explain why `advisoryTargetSize` isn't working and why it's different from yesterday. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
