[ 
https://issues.apache.org/jira/browse/SAMZA-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938082#comment-13938082
 ] 

Chris Riccomini commented on SAMZA-184:
---------------------------------------

My initial instinct is that we should favor simple convenient solutions over 
more performant solutions with this implementation. If performance is a major 
concern, developers should drop down into the JVM. That said, I think we're 
going to need to be able to write non-JVM libraries (Python, Go, etc) that can 
handle at least a few thousand messages per second (per container) in order for 
this to be at all useful.

bq. Should we start one subprocess per SamzaContainer, or one subprocess per 
StreamTask?

Starting one subprocess per StreamTask fits Samza's processing model a lot more 
than starting one subprocess per container. The problem with this approach is 
that it could potentially start 100s or even 1000s of processes on a single 
node in cases where a SamzaContainer is consuming from a large number of 
partitions. One could argue that a single container consuming 100s or 1000s of 
partitions should be written on the JVM to get proper performance. In that 
case, we have to be up front that this implementation is about convenience and 
not performance.

On the flip-side, if we start one subprocess per SamzaContainer, we need a way 
to share the subprocess' input/output transport connections between different 
StreamTask instances, all of whom would need to send messages to the 
subprocess. This could be done with a static variable, but that seems a bit 
hacky. If we agree on an HTTP/TCP based transport, we could use the 
TaskLifecycleListener (or add a ContainerLifecycleListener) to start the 
subprocess on container start. In this case, the StreamTasks just need to know 
the port to connect to to start sending messages to the process. This still 
requires getting the StreamTasks the port information, but I think we could 
come up with a way to do that.

bq. How should the parent interact with the subprocess at both the transport 
(stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level 
(protobuf, json, etc)?

It seems to me that the main trade-off here is really performance vs. 
convenience. On the transport side, stdin/stdout or HTTP seem the most 
convenient. On the serde side, JSON clearly the most convenient, but definitely 
not the most performant. This decision is closely tied to the one-at-a-time vs. 
batching question (below).

bq. What should the protocol look like? We should ideally support all of the 
operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.

Ideally, we'd support all operations in all existing interfaces. 

bq. Should the child process receive the messages in batches, or one at a time?

The one-at-a-time approach fits Samza's processing model much better. This 
affects our decision about transport and serde. If we stick with one message at 
a time, and pay heavy transport RPC and serde costs (HTTP/JSON), this might end 
up being too slow for even trivial use cases.

> Add thin multi-language support for SamzaContainer
> --------------------------------------------------
>
>                 Key: SAMZA-184
>                 URL: https://issues.apache.org/jira/browse/SAMZA-184
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>
> There has been some interest in supporting languages other than Java (or 
> JVM-based languages). We have already opened up SAMZA-18, which proposes 
> supporting a C implementation of SamzaContainer.
> A second solution to this problem is to have a StreamTask implementation that 
> starts a child process in another language, and acts as a bridge between the 
> child process and the java-based Samza APIs. This is the way that both Storm 
> [1] and Hadoop work.
> A lot of design decisions need to be fleshed out to support this, but most 
> people on the mailing list were very supportive of this approach. [2]
> Things that need to be decided:
> 1. Should we start one subprocess per SamzaContainer, or one subprocess per 
> StreamTask?
> 2. How should the parent interact with the subprocess at both the transport 
> (stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level 
> (protobuf, json, etc)?
> 3. What should the protocol look like? We should ideally support all of the 
> operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.
> 4. Should the child process receive the messages in batches, or one at a time?
> It'd be good to get a draft proposal up on the Wiki, so we can all discuss 
> this and converge on an implementation.
> [1] https://github.com/nathanmarz/storm/wiki/Multilang-protocol
> [2] 
> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201403.mbox/%3CCAB%2B2NVXX2Fq_61WfvH%2BAfW8ZW7vQbVfTN-JPGU%2Bd7AdZ73oPDQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to