[
https://issues.apache.org/jira/browse/STORM-150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14333431#comment-14333431
]
ASF GitHub Bot commented on STORM-150:
--------------------------------------
Github user d2r commented on the pull request:
https://github.com/apache/storm/pull/71#issuecomment-75572223
This pull request has been open for most of a year. Should we close this
for now if we are not planning to do anything about it?
> Replace jar distribution strategy with bittorent
> ------------------------------------------------
>
> Key: STORM-150
> URL: https://issues.apache.org/jira/browse/STORM-150
> Project: Apache Storm
> Issue Type: Improvement
> Reporter: James Xu
> Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/435
> Consider using http://turn.github.com/ttorrent/
> ----------
> ptgoetz: I've been looking into implementing this, but have a design question
> that boils down to this:
> Should the client (storm jar command) be the initial seeder, or should we
> wait and let nimbus do the seeding?
> The benefit of doing the seeding client-side is that we would only have to
> transfer a small .torrent through the nimbus thrift API. But I can imagine
> situations where the network environment would prevent BitTorrent clients
> from connecting back to the machine that's submitting the topology. This
> would create an indefinitely "stalled submission" since none of the cluster
> nodes would be able to connect to the seeder.
> The alternative would be to use the current technique of uploading the jar to
> nimbus, and have nimbus generate and distribute the .torrent file, and
> provide the initial seed. If the cluster is properly configured, we're pretty
> much guaranteed connectivity between nimbus and supervisor nodes.
> I'm leaning toward the latter approach, but would be interested in others'
> opinions.
> ----------
> nathanmarz: @ptgoetz I think Nimbus should do the seeding. That ensures that
> when the client finishes submitting, it can disconnect/go away without having
> to worry about making the topology unlaunchable.
> ----------
> jasonjckn: @nathanmarz How does this solve the nimbuses dependency on
> reliable local disk state (as you talked about in person)?
> What happens when zookeeper is offline for 1 hour? All the workers will die,
> and nimbus will be continually restarting. The onus is still on nimbus to
> store topology jars on local disk, so that when the workers and supervisors
> reboots it can seed all this again.
> You -can- solve the local disk persistence problem with replicated state to
> the non-elected nimbuses, but that's orthogonal to a distribution strategy.
> Yes there is some replication going on in bittorrent, but it's not really a
> protocol that delivers reliable persistence of state.
> I think it's still a good feature if it gives us performant topology submit
> times even with 500 workers, which take 3 minutes for us.
> Particularly with the worker heartbeat start-up timeout of 120s, you want to
> be able to start 500 workers within 120s, or even 1500 workers within 120s,
> the current distribution strategy is not scalable in that way.
> ----------
> nathanmarz: @jasonjckn On top of the bittorrent stuff we can ensure that a
> topology is considered submitted only when the artifacts exist on at least N
> nodes. Nimbus would only be the initial seed for topology files. Also, it
> wouldn't have to only be Nimbus that acts as a seed, that work could be
> shared by the supervisors. That's less relevant in the storm-mesos world, but
> you could still fairly easily run multiple Nimbus's to get replication.
> ----------
> jasonjckn: This PR might be aided by "topology packages" #557, as it bundles
> all the state that needs to be replicated.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)