Github user nchammas commented on the pull request:
https://github.com/apache/spark/pull/2339#issuecomment-55482806
> Do you have any idea how long it takes to fork the sub-process and SSH
into the machine?
Ah, this is a valid concern. I've tested this with launching 50-node
clusters, but not, say, with a 500-node cluster.
`all()` is [short-circuit evaluated](http://bugs.python.org/issue17255), so
[this line of
code](https://github.com/apache/spark/pull/2339/files#diff-ada66bbeb2f1327b508232ef6c3805a5R637)
will only fork one more process than the number of nodes that have SSH
available. So in your example, if I'm launching a 300-node cluster and only 10
of them have SSH available when I test, I'll only fork 11 processes, assuming
I'm lucky enough to hit the 10 nodes with SSH available first.
To be extra safe, I can rewrite this `all()` statement as an explicit loop
since the short-circuiting behavior is not guaranteed on Python 2.6.
In addition to that, I can implement a simple, linear backoff on the SSH
testing. For example, test SSH every `3 * num_attempts` seconds.
How does that sound? Hopefully not too complex.
> And I'm not sure whether it's too big of a deal.
This is definitely a convenience feature. But I can share from my own
experience of regularly spinning up 20-50 node clusters with `spark-ec2` that I
often find myself restarting the launch with `--resume` because SSH took too
long to come online, or I find myself waiting impatiently because I think I set
`--wait` to too high a value.
[Others](http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3ccank3dlkzlt2wtugo6oavpw2ckgfbhkbmdjxxtr7_ccvhdb8...@mail.gmail.com%3E)
[have](http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCAPtvcLjVPMrRuCH2+_xkRSxp1-=u-oxp+c32sj3rxxxvzw6...@mail.gmail.com%3E)
posted to the user list in confusion, thinking that something is broken, when
it is just that they didn't know to `--wait` long enough.
It would be nice if `spark-ec2` just took care of this detail for the user.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]