[
https://issues.apache.org/jira/browse/STORM-151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214148#comment-14214148
]
Jungtaek Lim commented on STORM-151:
------------------------------------
I'm interested to this issue. Is it still valid?
Btw, I read description of this issue, but I don't understand what Nathan said.
(Currently I can only understand colinsurprenant's first approach, and it could
be valid if we don't think about performance.)
Could you share some sketches if you don't mind?
Thanks in advance!
> Support multilang in Trident
> ----------------------------
>
> Key: STORM-151
> URL: https://issues.apache.org/jira/browse/STORM-151
> Project: Apache Storm
> Issue Type: Improvement
> Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/353
> Since it's impractical to have a separate subprocess for every operation
> packed into a bolt, I think the best tradeoff is to allow multilang
> operations that will just be their own bolt. This is a low level exposure of
> the API and the multilang program will have to make sure to emit the batch id
> as the first field. Multilang operations would be treated similar to
> aggregators in that the tuples they produce are brand new with brand new
> fields.
> ----------
> colinsurprenant: I am trying to wrap my head around this issue. Would it make
> sense to expose classes like ShellFunction, ShellAggregator, ShellFilter, etc?
> A topology definition could be something like:
> topology.newDRPCStream("words")
> .each(new Fields("args"), new ShellFunction("ruby", "split.rb"), new
> Fields("word"))
> .groupBy(new Fields("word"))
> .stateQuery(wordCounts, new Fields("word"), new MapGet(), new
> Fields("count"))
> .each(new Fields("count"), new ShellFilter("ruby", "somefilter.rb"))
> .aggregate(new Fields("count"), new ShellCombinerAggregator("ruby",
> "someaggregator.rb"), new Fields("sum"));
> Then the builder would assign a new shell bolt for each of these shell
> operations.
> ----------
> nathanmarz: That would be one way to go about it – though that will have a
> lot of overhead with serializing back and forth between Java-land and
> multilang-land. This is especially true if you have many operations running
> in the same bolt (like 5 eaches in a row). Another way to go about it would
> be to expose a lower level facility, where the shell process is a regular
> bolt and has full control over the input and output tuples. Whatever fields
> it emits will be the fields available in the output tuples. Additionally, one
> of Trident's constraints is that the first field of output tuples contain the
> "batch id".
> ----------
> colinsurprenant:Right. My proposition is basically what you described as
> "impractical to have a separate subprocess for every operation" in the issue
> description.
> So what you are suggesting is to just expose a ShellBolt, which would
> basically respect the same semantic as the general Aggregator operation in
> that it can emit any number of tuples with any number of fields in regard to
> to input batch.
> ----------
> joshbronson: I'm probably missing something. But would exposing an interface
> for an aggregator and using it carefully to avoid spawning too many processes
> accomplish something similar, without having to worry about batch ids?
> We've run up against this problem recently and come up with a couple of ways
> to get around the many-processes issue. We've considered spawning a server
> that would listen on a standard TCP port range or Unix sockets. Is there a
> place to hook in, other than the prepare method, to launch such a beast?
> Other functions would presumably use their prepare method to discover and
> connect to the server, and it would be nice to avoid races or locks.
> Also, we're working on an adapter for "BaseFunction" at the moment, though
> it's not quite ready for public use, and I wonder if anything will go wrong
> if a function continues to emit tuples to its collector after its execute
> method has returned. We're operating under the assumption that something
> will, so we've required the shell command the tell me when it's done
> responding to the input we just gave it. Currently we do that by requiring it
> to emit a JSON array. If we go with the server approach, we may also require
> the script to handle a tab-delimited ID telling the server to delegate to the
> appropriate handler for a function. Presumably this would require a thin
> library in languages used to script storm. There seems to be an appetite here
> for open-sourcing the components required to do this in Ruby; we definitely
> want to make this easy for folks.
> ----------
> quintona: Hi Guys, I am needing the multilang support. From what I gather,
> the idea would be to hook the shell bolt into the trident topology, and have
> it ignore and handle the batch id field?
> If that understanding is correct, then it raises the interesting question
> around support for any bolt within a Trident topology.
> ----------
> nathanmarz: Well, the right way to go is to essentially expose the Aggregator
> interface via multilang. This is general enough to encompass functions,
> filters, and aggregation. We could also allow users to mark multilang
> aggregators as committers so that they can achieve exactly-once semantics
> when interacting with external state.
> ----------
> quintona: I wasn't intending to allow the updating of external state via a
> multilang function. Surely that is a multilang state discussion, which is
> fundamentally different? Or am I missing something?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)