James Xu created STORM-151:
------------------------------

             Summary: Support multilang in Trident
                 Key: STORM-151
                 URL: https://issues.apache.org/jira/browse/STORM-151
             Project: Apache Storm (Incubating)
          Issue Type: Improvement
            Reporter: James Xu


https://github.com/nathanmarz/storm/issues/353

Since it's impractical to have a separate subprocess for every operation packed 
into a bolt, I think the best tradeoff is to allow multilang operations that 
will just be their own bolt. This is a low level exposure of the API and the 
multilang program will have to make sure to emit the batch id as the first 
field. Multilang operations would be treated similar to aggregators in that the 
tuples they produce are brand new with brand new fields.

----------
colinsurprenant: I am trying to wrap my head around this issue. Would it make 
sense to expose classes like ShellFunction, ShellAggregator, ShellFilter, etc?

A topology definition could be something like:

topology.newDRPCStream("words")
      .each(new Fields("args"), new ShellFunction("ruby", "split.rb"), new 
Fields("word"))
      .groupBy(new Fields("word"))
      .stateQuery(wordCounts, new Fields("word"), new MapGet(), new 
Fields("count"))
      .each(new Fields("count"), new ShellFilter("ruby", "somefilter.rb"))
      .aggregate(new Fields("count"), new ShellCombinerAggregator("ruby", 
"someaggregator.rb"), new Fields("sum"));

Then the builder would assign a new shell bolt for each of these shell 
operations.

----------
nathanmarz: That would be one way to go about it – though that will have a lot 
of overhead with serializing back and forth between Java-land and 
multilang-land. This is especially true if you have many operations running in 
the same bolt (like 5 eaches in a row). Another way to go about it would be to 
expose a lower level facility, where the shell process is a regular bolt and 
has full control over the input and output tuples. Whatever fields it emits 
will be the fields available in the output tuples. Additionally, one of 
Trident's constraints is that the first field of output tuples contain the 
"batch id".

----------
colinsurprenant:Right. My proposition is basically what you described as 
"impractical to have a separate subprocess for every operation" in the issue 
description.

So what you are suggesting is to just expose a ShellBolt, which would basically 
respect the same semantic as the general Aggregator operation in that it can 
emit any number of tuples with any number of fields in regard to to input batch.

----------
joshbronson: I'm probably missing something. But would exposing an interface 
for an aggregator and using it carefully to avoid spawning too many processes 
accomplish something similar, without having to worry about batch ids?

We've run up against this problem recently and come up with a couple of ways to 
get around the many-processes issue. We've considered spawning a server that 
would listen on a standard TCP port range or Unix sockets. Is there a place to 
hook in, other than the prepare method, to launch such a beast? Other functions 
would presumably use their prepare method to discover and connect to the 
server, and it would be nice to avoid races or locks.

Also, we're working on an adapter for "BaseFunction" at the moment, though it's 
not quite ready for public use, and I wonder if anything will go wrong if a 
function continues to emit tuples to its collector after its execute method has 
returned. We're operating under the assumption that something will, so we've 
required the shell command the tell me when it's done responding to the input 
we just gave it. Currently we do that by requiring it to emit a JSON array. If 
we go with the server approach, we may also require the script to handle a 
tab-delimited ID telling the server to delegate to the appropriate handler for 
a function. Presumably this would require a thin library in languages used to 
script storm. There seems to be an appetite here for open-sourcing the 
components required to do this in Ruby; we definitely want to make this easy 
for folks.

----------
quintona: Hi Guys, I am needing the multilang support. From what I gather, the 
idea would be to hook the shell bolt into the trident topology, and have it 
ignore and handle the batch id field?

If that understanding is correct, then it raises the interesting question 
around support for any bolt within a Trident topology.

----------
nathanmarz: Well, the right way to go is to essentially expose the Aggregator 
interface via multilang. This is general enough to encompass functions, 
filters, and aggregation. We could also allow users to mark multilang 
aggregators as committers so that they can achieve exactly-once semantics when 
interacting with external state.

----------
quintona: I wasn't intending to allow the updating of external state via a 
multilang function. Surely that is a multilang state discussion, which is 
fundamentally different? Or am I missing something?




--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to