James Xu created STORM-151:
------------------------------
Summary: Support multilang in Trident
Key: STORM-151
URL: https://issues.apache.org/jira/browse/STORM-151
Project: Apache Storm (Incubating)
Issue Type: Improvement
Reporter: James Xu
https://github.com/nathanmarz/storm/issues/353
Since it's impractical to have a separate subprocess for every operation packed
into a bolt, I think the best tradeoff is to allow multilang operations that
will just be their own bolt. This is a low level exposure of the API and the
multilang program will have to make sure to emit the batch id as the first
field. Multilang operations would be treated similar to aggregators in that the
tuples they produce are brand new with brand new fields.
----------
colinsurprenant: I am trying to wrap my head around this issue. Would it make
sense to expose classes like ShellFunction, ShellAggregator, ShellFilter, etc?
A topology definition could be something like:
topology.newDRPCStream("words")
.each(new Fields("args"), new ShellFunction("ruby", "split.rb"), new
Fields("word"))
.groupBy(new Fields("word"))
.stateQuery(wordCounts, new Fields("word"), new MapGet(), new
Fields("count"))
.each(new Fields("count"), new ShellFilter("ruby", "somefilter.rb"))
.aggregate(new Fields("count"), new ShellCombinerAggregator("ruby",
"someaggregator.rb"), new Fields("sum"));
Then the builder would assign a new shell bolt for each of these shell
operations.
----------
nathanmarz: That would be one way to go about it – though that will have a lot
of overhead with serializing back and forth between Java-land and
multilang-land. This is especially true if you have many operations running in
the same bolt (like 5 eaches in a row). Another way to go about it would be to
expose a lower level facility, where the shell process is a regular bolt and
has full control over the input and output tuples. Whatever fields it emits
will be the fields available in the output tuples. Additionally, one of
Trident's constraints is that the first field of output tuples contain the
"batch id".
----------
colinsurprenant:Right. My proposition is basically what you described as
"impractical to have a separate subprocess for every operation" in the issue
description.
So what you are suggesting is to just expose a ShellBolt, which would basically
respect the same semantic as the general Aggregator operation in that it can
emit any number of tuples with any number of fields in regard to to input batch.
----------
joshbronson: I'm probably missing something. But would exposing an interface
for an aggregator and using it carefully to avoid spawning too many processes
accomplish something similar, without having to worry about batch ids?
We've run up against this problem recently and come up with a couple of ways to
get around the many-processes issue. We've considered spawning a server that
would listen on a standard TCP port range or Unix sockets. Is there a place to
hook in, other than the prepare method, to launch such a beast? Other functions
would presumably use their prepare method to discover and connect to the
server, and it would be nice to avoid races or locks.
Also, we're working on an adapter for "BaseFunction" at the moment, though it's
not quite ready for public use, and I wonder if anything will go wrong if a
function continues to emit tuples to its collector after its execute method has
returned. We're operating under the assumption that something will, so we've
required the shell command the tell me when it's done responding to the input
we just gave it. Currently we do that by requiring it to emit a JSON array. If
we go with the server approach, we may also require the script to handle a
tab-delimited ID telling the server to delegate to the appropriate handler for
a function. Presumably this would require a thin library in languages used to
script storm. There seems to be an appetite here for open-sourcing the
components required to do this in Ruby; we definitely want to make this easy
for folks.
----------
quintona: Hi Guys, I am needing the multilang support. From what I gather, the
idea would be to hook the shell bolt into the trident topology, and have it
ignore and handle the batch id field?
If that understanding is correct, then it raises the interesting question
around support for any bolt within a Trident topology.
----------
nathanmarz: Well, the right way to go is to essentially expose the Aggregator
interface via multilang. This is general enough to encompass functions,
filters, and aggregation. We could also allow users to mark multilang
aggregators as committers so that they can achieve exactly-once semantics when
interacting with external state.
----------
quintona: I wasn't intending to allow the updating of external state via a
multilang function. Surely that is a multilang state discussion, which is
fundamentally different? Or am I missing something?
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)