[
https://issues.apache.org/jira/browse/SAMZA-184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938082#comment-13938082
]
Chris Riccomini commented on SAMZA-184:
---------------------------------------
My initial instinct is that we should favor simple convenient solutions over
more performant solutions with this implementation. If performance is a major
concern, developers should drop down into the JVM. That said, I think we're
going to need to be able to write non-JVM libraries (Python, Go, etc) that can
handle at least a few thousand messages per second (per container) in order for
this to be at all useful.
bq. Should we start one subprocess per SamzaContainer, or one subprocess per
StreamTask?
Starting one subprocess per StreamTask fits Samza's processing model a lot more
than starting one subprocess per container. The problem with this approach is
that it could potentially start 100s or even 1000s of processes on a single
node in cases where a SamzaContainer is consuming from a large number of
partitions. One could argue that a single container consuming 100s or 1000s of
partitions should be written on the JVM to get proper performance. In that
case, we have to be up front that this implementation is about convenience and
not performance.
On the flip-side, if we start one subprocess per SamzaContainer, we need a way
to share the subprocess' input/output transport connections between different
StreamTask instances, all of whom would need to send messages to the
subprocess. This could be done with a static variable, but that seems a bit
hacky. If we agree on an HTTP/TCP based transport, we could use the
TaskLifecycleListener (or add a ContainerLifecycleListener) to start the
subprocess on container start. In this case, the StreamTasks just need to know
the port to connect to to start sending messages to the process. This still
requires getting the StreamTasks the port information, but I think we could
come up with a way to do that.
bq. How should the parent interact with the subprocess at both the transport
(stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level
(protobuf, json, etc)?
It seems to me that the main trade-off here is really performance vs.
convenience. On the transport side, stdin/stdout or HTTP seem the most
convenient. On the serde side, JSON clearly the most convenient, but definitely
not the most performant. This decision is closely tied to the one-at-a-time vs.
batching question (below).
bq. What should the protocol look like? We should ideally support all of the
operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.
Ideally, we'd support all operations in all existing interfaces.
bq. Should the child process receive the messages in batches, or one at a time?
The one-at-a-time approach fits Samza's processing model much better. This
affects our decision about transport and serde. If we stick with one message at
a time, and pay heavy transport RPC and serde costs (HTTP/JSON), this might end
up being too slow for even trivial use cases.
> Add thin multi-language support for SamzaContainer
> --------------------------------------------------
>
> Key: SAMZA-184
> URL: https://issues.apache.org/jira/browse/SAMZA-184
> Project: Samza
> Issue Type: Bug
> Components: container
> Affects Versions: 0.6.0
> Reporter: Chris Riccomini
>
> There has been some interest in supporting languages other than Java (or
> JVM-based languages). We have already opened up SAMZA-18, which proposes
> supporting a C implementation of SamzaContainer.
> A second solution to this problem is to have a StreamTask implementation that
> starts a child process in another language, and acts as a bridge between the
> child process and the java-based Samza APIs. This is the way that both Storm
> [1] and Hadoop work.
> A lot of design decisions need to be fleshed out to support this, but most
> people on the mailing list were very supportive of this approach. [2]
> Things that need to be decided:
> 1. Should we start one subprocess per SamzaContainer, or one subprocess per
> StreamTask?
> 2. How should the parent interact with the subprocess at both the transport
> (stdin/stdout, unix sockets, TCP, HTTP, Thrift, etc) and serialization level
> (protobuf, json, etc)?
> 3. What should the protocol look like? We should ideally support all of the
> operations in StreamTask, InitableTask, WindowableTask, ClosableTask, etc.
> 4. Should the child process receive the messages in batches, or one at a time?
> It'd be good to get a draft proposal up on the Wiki, so we can all discuss
> this and converge on an implementation.
> [1] https://github.com/nathanmarz/storm/wiki/Multilang-protocol
> [2]
> http://mail-archives.apache.org/mod_mbox/incubator-samza-dev/201403.mbox/%3CCAB%2B2NVXX2Fq_61WfvH%2BAfW8ZW7vQbVfTN-JPGU%2Bd7AdZ73oPDQ%40mail.gmail.com%3E
--
This message was sent by Atlassian JIRA
(v6.2#6252)