squito commented on issue #24817: [WIP][SPARK-27963][core] Allow dynamic 
allocation without a shuffle service.
URL: https://github.com/apache/spark/pull/24817#issuecomment-504584260
 
 
   > What if the executor is idle currently (no active job, can be removed), 
but later on the jobs depend one the previous shuffle stage? From my 
understanding that job will be failed to fetch the shuffle data and rerun the 
parent stages. My feeling is that if idle time is short, this exception will be 
quite common and confuse the users.
   
   I think that is correct.  to be clear this isn't meant to be a perfect 
replacement for dynamic allocation + external shuffle service.  But its a 
reasonable heuristic.  It might make sense to put in a separate config for 
shuffle timeout (longer than cache timeout), since recomputing lost shuffle 
data is actually much more expensive.
   
   I'm actually more worried about the opposite problem, that it will not scale 
down aggressively if a long ETL job, with an early stage which is really big 
(say it uses 1000 executors) but then there are many stages after that which 
need only a small number of executors (say 10 executors), you'll hold on to all 
1000 executors for the whole job.  But, I don't think you can do any better 
without having something else to serve that shuffle data (a la SPARK-25299).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to