[ 
https://issues.apache.org/jira/browse/SPARK-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967193#comment-13967193
 ] 

Mridul Muralidharan commented on SPARK-1453:
--------------------------------------------

The timeout gets hit only when we dont get requested executors, right ? So it 
is more like max timeout (controlled by number of times we loop iirc).
The reason for keeping it stupid was simply because we have no gaurantees of 
number of containers which might be available to spark in a busy cluster : at 
times, it might not be practically possible to even get a fraction of the 
requested nodes (either due to busy cluster, or because of lack of resources - 
so infinite wait).

Ideally, I should have exposed the number of containers allocated - so that 
atleast user code could use it as spi and decide how to proceed for more 
complex cases. Missed out on this one.

I am not sure which usecases make sense.
a) Wait for X seconds or requested containers allocated.
b) Wait until minimum of Y containers allocated (out of X requested).
c) (b) with (a) - that is min containers and timeout on that.
d) (c) with exit if min containers not allocated ?

(d) is something which I keep hitting into (if I dont get my required minimum 
nodes, and job proceeds, I usually end up bringing down those nodes :-( )

> Improve the way Spark on Yarn waits for executors before starting
> -----------------------------------------------------------------
>
>                 Key: SPARK-1453
>                 URL: https://issues.apache.org/jira/browse/SPARK-1453
>             Project: Spark
>          Issue Type: Improvement
>          Components: YARN
>    Affects Versions: 1.0.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Currently Spark on Yarn just delays a few seconds between when the spark 
> context is initialized and when it allows the job to start.  If you are on a 
> busy hadoop cluster is might take longer to get the number of executors. 
> In the very least we could make this timeout a configurable value.  Its 
> currently hardcoded to 3 seconds.  
> Better yet would be to allow user to give a minimum number of executors it 
> wants to wait for, but that looks much more complex. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to