[jira] [Commented] (STORM-151) Support multilang in Trident

Jungtaek Lim (JIRA) Sun, 16 Nov 2014 17:00:06 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214148#comment-14214148
 ]


Jungtaek Lim commented on STORM-151:
------------------------------------

I'm interested to this issue. Is it still valid?

Btw, I read description of this issue, but I don't understand what Nathan said.
(Currently I can only understand colinsurprenant's first approach, and it could 
be valid if we don't think about performance.)

Could you share some sketches if you don't mind?
Thanks in advance!

> Support multilang in Trident
> ----------------------------
>
>                 Key: STORM-151
>                 URL: https://issues.apache.org/jira/browse/STORM-151
>             Project: Apache Storm
>          Issue Type: Improvement
>            Reporter: James Xu
>
> https://github.com/nathanmarz/storm/issues/353
> Since it's impractical to have a separate subprocess for every operation 
> packed into a bolt, I think the best tradeoff is to allow multilang 
> operations that will just be their own bolt. This is a low level exposure of 
> the API and the multilang program will have to make sure to emit the batch id 
> as the first field. Multilang operations would be treated similar to 
> aggregators in that the tuples they produce are brand new with brand new 
> fields.
> ----------
> colinsurprenant: I am trying to wrap my head around this issue. Would it make 
> sense to expose classes like ShellFunction, ShellAggregator, ShellFilter, etc?
> A topology definition could be something like:
> topology.newDRPCStream("words")
>       .each(new Fields("args"), new ShellFunction("ruby", "split.rb"), new 
> Fields("word"))
>       .groupBy(new Fields("word"))
>       .stateQuery(wordCounts, new Fields("word"), new MapGet(), new 
> Fields("count"))
>       .each(new Fields("count"), new ShellFilter("ruby", "somefilter.rb"))
>       .aggregate(new Fields("count"), new ShellCombinerAggregator("ruby", 
> "someaggregator.rb"), new Fields("sum"));
> Then the builder would assign a new shell bolt for each of these shell 
> operations.
> ----------
> nathanmarz: That would be one way to go about it – though that will have a 
> lot of overhead with serializing back and forth between Java-land and 
> multilang-land. This is especially true if you have many operations running 
> in the same bolt (like 5 eaches in a row). Another way to go about it would 
> be to expose a lower level facility, where the shell process is a regular 
> bolt and has full control over the input and output tuples. Whatever fields 
> it emits will be the fields available in the output tuples. Additionally, one 
> of Trident's constraints is that the first field of output tuples contain the 
> "batch id".
> ----------
> colinsurprenant:Right. My proposition is basically what you described as 
> "impractical to have a separate subprocess for every operation" in the issue 
> description.
> So what you are suggesting is to just expose a ShellBolt, which would 
> basically respect the same semantic as the general Aggregator operation in 
> that it can emit any number of tuples with any number of fields in regard to 
> to input batch.
> ----------
> joshbronson: I'm probably missing something. But would exposing an interface 
> for an aggregator and using it carefully to avoid spawning too many processes 
> accomplish something similar, without having to worry about batch ids?
> We've run up against this problem recently and come up with a couple of ways 
> to get around the many-processes issue. We've considered spawning a server 
> that would listen on a standard TCP port range or Unix sockets. Is there a 
> place to hook in, other than the prepare method, to launch such a beast? 
> Other functions would presumably use their prepare method to discover and 
> connect to the server, and it would be nice to avoid races or locks.
> Also, we're working on an adapter for "BaseFunction" at the moment, though 
> it's not quite ready for public use, and I wonder if anything will go wrong 
> if a function continues to emit tuples to its collector after its execute 
> method has returned. We're operating under the assumption that something 
> will, so we've required the shell command the tell me when it's done 
> responding to the input we just gave it. Currently we do that by requiring it 
> to emit a JSON array. If we go with the server approach, we may also require 
> the script to handle a tab-delimited ID telling the server to delegate to the 
> appropriate handler for a function. Presumably this would require a thin 
> library in languages used to script storm. There seems to be an appetite here 
> for open-sourcing the components required to do this in Ruby; we definitely 
> want to make this easy for folks.
> ----------
> quintona: Hi Guys, I am needing the multilang support. From what I gather, 
> the idea would be to hook the shell bolt into the trident topology, and have 
> it ignore and handle the batch id field?
> If that understanding is correct, then it raises the interesting question 
> around support for any bolt within a Trident topology.
> ----------
> nathanmarz: Well, the right way to go is to essentially expose the Aggregator 
> interface via multilang. This is general enough to encompass functions, 
> filters, and aggregation. We could also allow users to mark multilang 
> aggregators as committers so that they can achieve exactly-once semantics 
> when interacting with external state.
> ----------
> quintona: I wasn't intending to allow the updating of external state via a 
> multilang function. Surely that is a multilang state discussion, which is 
> fundamentally different? Or am I missing something?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-151) Support multilang in Trident

Reply via email to