Github user nchammas commented on the pull request:

    https://github.com/apache/spark/pull/2339#issuecomment-55482806
  
    > Do you have any idea how long it takes to fork the sub-process and SSH 
into the machine?
    
    Ah, this is a valid concern. I've tested this with launching 50-node 
clusters, but not, say, with a 500-node cluster. 
    
    `all()` is [short-circuit evaluated](http://bugs.python.org/issue17255), so 
[this line of 
code](https://github.com/apache/spark/pull/2339/files#diff-ada66bbeb2f1327b508232ef6c3805a5R637)
 will only fork one more process than the number of nodes that have SSH 
available. So in your example, if I'm launching a 300-node cluster and only 10 
of them have SSH available when I test, I'll only fork 11 processes, assuming 
I'm lucky enough to hit the 10 nodes with SSH available first.
    
    To be extra safe, I can rewrite this `all()` statement as an explicit loop 
since the short-circuiting behavior is not guaranteed on Python 2.6.
    
    In addition to that, I can implement a simple, linear backoff on the SSH 
testing. For example, test SSH every `3 * num_attempts` seconds.
    
    How does that sound? Hopefully not too complex.
    
    > And I'm not sure whether it's too big of a deal.
    
    This is definitely a convenience feature. But I can share from my own 
experience of regularly spinning up 20-50 node clusters with `spark-ec2` that I 
often find myself restarting the launch with `--resume` because SSH took too 
long to come online, or I find myself waiting impatiently because I think I set 
`--wait` to too high a value. 
[Others](http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3ccank3dlkzlt2wtugo6oavpw2ckgfbhkbmdjxxtr7_ccvhdb8...@mail.gmail.com%3E)
 
[have](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAPtvcLjVPMrRuCH2+_xkRSxp1-=u-oxp+c32sj3rxxxvzw6...@mail.gmail.com%3E)
 posted to the user list in confusion, thinking that something is broken, when 
it is just that they didn't know to `--wait` long enough.
    
    It would be nice if `spark-ec2` just took care of this detail for the user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to