RE: How do you serve the data computed by Samza?

Felix GV Tue, 31 Mar 2015 09:42:08 -0700

Hi Harlan and Vladimir,

I think the idea of serving data directly off of Samza has been mentioned a few 
times, but there are certain caveats that make this a risky proposition. For 
example:

  *   Samza does not have the same uptime constraints as a dedicated data 
serving platform. While I'm a big fan of the fault-tolerant state provided by 
Samza (especially when compared to Storm and Spark Streaming), we have to 
realize that the Samza model effectively is to have one instance (or container) 
up and running for each partition, and to regenerate the state of that instance 
elsewhere if the first one goes down. This means that while the fault-tolerance 
is automatic, it is not instantaneous, it may take a minute or more, depending 
on the size of the state to regenerate. This is okay for the vast majority of 
nearline stream processing use cases, but not so much for a serving platform, 
which typically expects more redundancy. This is related to 
https://issues.apache.org/jira/browse/SAMZA-406, though even with those 
changes, it's not clear whether the fail-over would be considered fast enough 
for online data serving needs.
  *   Samza is meant to support spiky workloads without too much problems, 
whether it is because of an actual spike in input traffic, or because of a 
re-processing use case (i.e.: the point #5 in the OP which I also clarified in 
my last email). A data serving system needs to much more wary of spikes, 
because it typically needs to maintain good p99 latency. Therefore, Samza and 
the serving system may have very different JVM tunings (if they even run on the 
JVM at all). Even if RocksDB was used in both Samza and the serving system, it 
may benefit from being tuned differently (i.e.: one for write throughput and 
the other for read performance).

--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

f...@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Harlan Iverson [har...@pubnub.com]
Sent: Saturday, March 28, 2015 5:30 PM
To: dev@samza.apache.org
Subject: Re: How do you serve the data computed by Samza?

Felix/Shekar,

Given that Samza itself uses RocksDB to create a materialized view of a
partition from earliest to latest offset, I imagine that to be a good
choice to begin evaluating. Two parts:

1. A method recommended in this article/video
<http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/>
seems
to suggest building a full in-process materialized view in each consuming
process. This increases performance at the cost of space, so if that is an
acceptable tradeoff then it should be golden.

One approach may be writing to a Samza topic-backed key-value (KV) store in
one upstream task and consuming it from one-or-more downstream KV-store on
the same topic (eg. a task to lookup derived info by sub_key in Jordan's
example above, and consume it later, keying the messages by sub_key). The
reasoning behind this is that the KV-store backing is simply a Kafka topic
partition that leverages the compaction model (keep only/at-least the last
message for a given key), and then sequentially populates the KV-store in
each consuming process upon process creation and keeps it updated during
operation. I believe a caveat is that any consuming tasks would then have
to use the KV-store as read only.

Given the single-threaded model of Samza and partition-level ordering of
Kafka, I think that if a) all topics in the pipeline have the same number
of partitions, b) messages are published with the same key at each step,
and c) Kafka leader write acking is used, then a store value should always
be committed before the downstream task consumes it, though I'm not sure
how to guarantee that it is actually consumed first downstream given that
consumers may be slightly behind. Is simply relying on "behind high
watermark"=0 on the KV consumer topic reliable enough to ensure this? If
no, could the latest Kafka offset of the KV stream be tagged onto to the
message upstream and then held downstream until the consumer's KV-store
reaches that offset?

2. As for how to then serve it, would it be a bad idea to embed a REST/http
server in a Samza task itself (effectively one per container/task instance)
and put them behind a dynamically updated load balancer, updating mappings
when the containers are launched/destroyed?

One could also directly materialize their own RocksDB or similar in
language/platform of choice by following the same protocol of feeding a
Kafka topic from oldest offset into an in-process KV store and serving
against it, creating a materialized view in the same fashion.

I imagine that a (forthcoming?) transaction model in Kafka
<https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka>
could
provide a consistent view across all consumers, but until then there may
need to be tolerance for them to be slightly out of sync.

--

We've not implemented any of this, but it's the approach that I'd first
think to take. To begin with the task KV-store route, there are some
scripts included with Samza to test KV-store performance that would give
some concrete r/w performance numbers, and could probably be further
explored by implementing the KV-store interfaces for any given JVM storage
engine/client.

Cheers

On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com> wrote:

> Felix/Jordan,
>
> 1 - 2 is exactly what I was looking for as well. I want to expose
> webservices call to Kafka/samza. As there is no concept of a session, I was
> wondering how to send back enriched data to the web services request.
> Or am I way off on this? Meaning, is this a completely wrong use case to
> use Kafka/Samza?
>
> - Shekar
>
> On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com> wrote:
>
> > Felix,
> > Here are my thoughts below
> >
> > 1 - 2) I think so far a majority of samza applications are internal so
> far.
> > However I've developed a Samza Publisher for PubNub that would allow you
> to
> > send data from process or window out over a Data Stream Network. Right
> now
> > it looks something like this:
> >
> > (.send collector (OutgoingMessageEnvelope. (SystemStream.
> > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).
> >
> > At smaller scale you could do the same with socket.io etc... If you're
> > interested in this I can send you the src or jar. If their is wider
> > interest I can open source it on github but it needs some cleanup first.
> >
> > 3) We currently don't have the need to warehouse our stream but we have
> > thought about piping samza generated data into some Hadoop based system
> for
> > longer term analysis. Then running Hive queries over that data or
> something
> > alike.
> >
> > 4) I can't comment on the throughput of the other systems (HBase etc..)
> but
> > our Kafka, Samza through put is pretty impressive considering the single
> > thread nature of the system. We are seeing raw throughput per partition
> > over well 10MB/s.
> >
> > 5) I haven't run into this to prevent data loss/backup if we can't
> process
> > a message we have considered dropping it into a "unprocessed topic" but
> we
> > haven't really run into this need. If you needed to reprocess all raw
> data
> > it would be pretty straightforward, you could just add a partition to
> > support the extra load.
> >
> > 6) Kafka is pretty good at ingesting things so could you elaborate more
> on
> > this?
> >
> > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com.invalid
> >
> > wrote:
> >
> > > Hi Samza devs, users and enthusiasts,
> > >
> > > I've kept an eye on the Samza project for a while and I think it's
> super
> > > cool! I hope it continues to mature and expand as it seems very
> > promising (:
> > >
> > > One thing I've been wondering for a while is: how do people serve the
> > data
> > > they computed on Samza? More specifically:
> > >
> > > 1. How do you expose the output of Samza jobs to online applications
> > > that need low-latency reads?
> > > 2. Are these online apps mostly internal (i.e.: analytics,
> dashboards,
> > > etc.) or public/user-facing?
> > > 3. What systems do you currently use (or plan to use in the
> > short-term)
> > > to host the data generated in Samza? HBase? Cassandra? MySQL? Druid?
> > Others?
> > > 4. Are you satisfied or are you facing challenges in terms of the
> > write
> > > throughput supported by these storage/serving systems? What about read
> > > throughput?
> > > 5. Are there situations where you wish to re-process all historical
> > > data when making improvements to your Samza job, which results in the
> > need
> > > to re-ingest all of the Samza output into your online serving system
> (as
> > > described in the Kappa Architecture<
> > >
> >
> http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
> > >)
> > > ? Is this easy breezy or painful? Do you need to throttle it lest your
> > > serving system will fall over?
> > > 6. If there was a highly-optimized and reliable way of ingesting
> > > partitioned streams quickly into your online serving system, would that
> > > help you leverage Samza more effectively?
> > >
> > > Your insights would be much appreciated!
> > >
> > >
> > > Thanks (:
> > >
> > >
> > > --
> > > Felix
> > >
> >
> >
> >
> > --
> > Jordan Shaw
> > Full Stack Software Engineer
> > PubNub Inc
> > 1045 17th St
> > San Francisco, CA 94107
> >
>

RE: How do you serve the data computed by Samza?

Reply via email to