squito commented on issue #24817: [WIP][SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-504584260 > What if the executor is idle currently (no active job, can be removed), but later on the jobs depend one the previous shuffle stage? From my understanding that job will be failed to fetch the shuffle data and rerun the parent stages. My feeling is that if idle time is short, this exception will be quite common and confuse the users. I think that is correct. to be clear this isn't meant to be a perfect replacement for dynamic allocation + external shuffle service. But its a reasonable heuristic. It might make sense to put in a separate config for shuffle timeout (longer than cache timeout), since recomputing lost shuffle data is actually much more expensive. I'm actually more worried about the opposite problem, that it will not scale down aggressively if a long ETL job, with an early stage which is really big (say it uses 1000 executors) but then there are many stages after that which need only a small number of executors (say 10 executors), you'll hold on to all 1000 executors for the whole job. But, I don't think you can do any better without having something else to serve that shuffle data (a la SPARK-25299).
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
