RE: How do you serve the data computed by Samza?

Felix GV Wed, 01 Apr 2015 10:38:20 -0700

Hi Vladimir,

It seems like you have decided on an architecture which is fairly similar to 
the one I suggested (:


Out of curiosity, I have the following questions regarding your system:

  *   Are you saying the Samza tasks co-located with your k/v store nodes are 
also doing processing, and not just ingesting? And if they are doing processing:
     *   Would you mind sharing what kind of processing they do? (i.e.: joining 
streams, counting stuff, etc.?)
     *   Does the processing they do require local state? And does that local 
state and resulting IO compete with the resources of the co-located k/v store?
  *   Would you mind also sharing which external k/v store you are ingesting 
into?

Shekar,


I'm not 100% sure I understand your question... what do you mean the producer 
has no control over the consumer? Partitioning can allow you to decide which 
specific consumer will get a given key, and the same partitioning strategy can 
also be used by the web service to query the correct k/v store shard/node, 
right?


--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

f...@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Shekar Tippur [ctip...@gmail.com]
Sent: Wednesday, April 01, 2015 6:54 AM
To: dev@samza.apache.org
Subject: Re: How do you serve the data computed by Samza?

I am still not fully sure how this would pan out. Since at each stage,
producer sends an event and has no control over the consumer.

{Web services call} -> {samza enrichment1} -> {samza enrichment2}
||
||
V
V
{kv store1}
{kv store2}

In this case, how do we tie back the webservices call to data in kv store 2?

- Shekar

On Wed, Apr 1, 2015 at 1:49 AM, Vladimir Lebedev <w...@fastmail.fm> wrote:

> Dear Harlan and Felix,
>
> Many thanks for your input! In my particular case I decided to use
> external KV-store sharded by the same key as partition key in Kafka topic.
> KV-store shards will be colocated with samza tasks processing corresponding
> topic partitions in order to minimize latency.
>
> Best regards,
> Vladimir
>
> --
> Vladimir Lebedev
> http://linkedin.com/in/vlebedev
>
>
> On 03/31/2015 07:39 PM, Felix GV wrote:
>
>> Hi Harlan and Vladimir,
>>
>> I think the idea of serving data directly off of Samza has been mentioned
>> a few times, but there are certain caveats that make this a risky
>> proposition. For example:
>>
>> * Samza does not have the same uptime constraints as a dedicated
>> data serving platform. While I'm a big fan of the fault-tolerant
>> state provided by Samza (especially when compared to Storm and
>> Spark Streaming), we have to realize that the Samza model
>> effectively is to have one instance (or container) up and running
>> for each partition, and to regenerate the state of that instance
>> elsewhere if the first one goes down. This means that while the
>> fault-tolerance is automatic, it is not instantaneous, it may take
>> a minute or more, depending on the size of the state to
>> regenerate. This is okay for the vast majority of nearline stream
>> processing use cases, but not so much for a serving platform,
>> which typically expects more redundancy. This is related to
>> https://issues.apache.org/jira/browse/SAMZA-406, though even with
>> those changes, it's not clear whether the fail-over would be
>> considered fast enough for online data serving needs.
>> * Samza is meant to support spiky workloads without too much
>> problems, whether it is because of an actual spike in input
>> traffic, or because of a re-processing use case (i.e.: the point
>> #5 in the OP which I also clarified in my last email). A data
>> serving system needs to much more wary of spikes, because it
>> typically needs to maintain good p99 latency. Therefore, Samza and
>> the serving system may have very different JVM tunings (if they
>> even run on the JVM at all). Even if RocksDB was used in both
>> Samza and the serving system, it may benefit from being tuned
>> differently (i.e.: one for write throughput and the other for read
>> performance).
>>
>>
>> --
>>
>> Felix GV
>> Data Infrastructure Engineer
>> Distributed Data Systems
>> LinkedIn
>>
>> f...@linkedin.com
>> linkedin.com/in/felixgv
>>
>> ________________________________________
>> From: Harlan Iverson [har...@pubnub.com]
>> Sent: Saturday, March 28, 2015 5:30 PM
>> To: dev@samza.apache.org
>> Subject: Re: How do you serve the data computed by Samza?
>>
>> Felix/Shekar,
>>
>> Given that Samza itself uses RocksDB to create a materialized view of a
>> partition from earliest to latest offset, I imagine that to be a good
>> choice to begin evaluating. Two parts:
>>
>> 1. A method recommended in this article/video
>> <http://blog.confluent.io/2015/03/04/turning-the-
>> database-inside-out-with-apache-samza/>
>> seems
>> to suggest building a full in-process materialized view in each consuming
>> process. This increases performance at the cost of space, so if that is an
>> acceptable tradeoff then it should be golden.
>>
>> One approach may be writing to a Samza topic-backed key-value (KV) store
>> in
>> one upstream task and consuming it from one-or-more downstream KV-store on
>> the same topic (eg. a task to lookup derived info by sub_key in Jordan's
>> example above, and consume it later, keying the messages by sub_key). The
>> reasoning behind this is that the KV-store backing is simply a Kafka topic
>> partition that leverages the compaction model (keep only/at-least the last
>> message for a given key), and then sequentially populates the KV-store in
>> each consuming process upon process creation and keeps it updated during
>> operation. I believe a caveat is that any consuming tasks would then have
>> to use the KV-store as read only.
>>
>> Given the single-threaded model of Samza and partition-level ordering of
>> Kafka, I think that if a) all topics in the pipeline have the same number
>> of partitions, b) messages are published with the same key at each step,
>> and c) Kafka leader write acking is used, then a store value should always
>> be committed before the downstream task consumes it, though I'm not sure
>> how to guarantee that it is actually consumed first downstream given that
>> consumers may be slightly behind. Is simply relying on "behind high
>> watermark"=0 on the KV consumer topic reliable enough to ensure this? If
>> no, could the latest Kafka offset of the KV stream be tagged onto to the
>> message upstream and then held downstream until the consumer's KV-store
>> reaches that offset?
>>
>> 2. As for how to then serve it, would it be a bad idea to embed a
>> REST/http
>> server in a Samza task itself (effectively one per container/task
>> instance)
>> and put them behind a dynamically updated load balancer, updating mappings
>> when the containers are launched/destroyed?
>>
>> One could also directly materialize their own RocksDB or similar in
>> language/platform of choice by following the same protocol of feeding a
>> Kafka topic from oldest offset into an in-process KV store and serving
>> against it, creating a materialized view in the same fashion.
>>
>> I imagine that a (forthcoming?) transaction model in Kafka
>> <https://cwiki.apache.org/confluence/display/KAFKA/
>> Transactional+Messaging+in+Kafka>
>> could
>> provide a consistent view across all consumers, but until then there may
>> need to be tolerance for them to be slightly out of sync.
>>
>> --
>>
>> We've not implemented any of this, but it's the approach that I'd first
>> think to take. To begin with the task KV-store route, there are some
>> scripts included with Samza to test KV-store performance that would give
>> some concrete r/w performance numbers, and could probably be further
>> explored by implementing the KV-store interfaces for any given JVM storage
>> engine/client.
>>
>> Cheers
>>
>>
>> On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com> wrote:
>>
>> > Felix/Jordan,
>> >
>> > 1 - 2 is exactly what I was looking for as well. I want to expose
>> > webservices call to Kafka/samza. As there is no concept of a session, I
>> was
>> > wondering how to send back enriched data to the web services request.
>> > Or am I way off on this? Meaning, is this a completely wrong use case to
>> > use Kafka/Samza?
>> >
>> > - Shekar
>> >
>> > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com>
>> wrote:
>> >
>> > > Felix,
>> > > Here are my thoughts below
>> > >
>> > > 1 - 2) I think so far a majority of samza applications are internal so
>> > far.
>> > > However I've developed a Samza Publisher for PubNub that would allow
>> you
>> > to
>> > > send data from process or window out over a Data Stream Network. Right
>> > now
>> > > it looks something like this:
>> > >
>> > > (.send collector (OutgoingMessageEnvelope. (SystemStream.
>> > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).
>> > >
>> > > At smaller scale you could do the same with socket.io etc... If
>> you're
>> > > interested in this I can send you the src or jar. If their is wider
>> > > interest I can open source it on github but it needs some cleanup
>> first.
>> > >
>> > > 3) We currently don't have the need to warehouse our stream but we
>> have
>> > > thought about piping samza generated data into some Hadoop based
>> system
>> > for
>> > > longer term analysis. Then running Hive queries over that data or
>> > something
>> > > alike.
>> > >
>> > > 4) I can't comment on the throughput of the other systems (HBase
>> etc..)
>> > but
>> > > our Kafka, Samza through put is pretty impressive considering the
>> single
>> > > thread nature of the system. We are seeing raw throughput per
>> partition
>> > > over well 10MB/s.
>> > >
>> > > 5) I haven't run into this to prevent data loss/backup if we can't
>> > process
>> > > a message we have considered dropping it into a "unprocessed topic"
>> but
>> > we
>> > > haven't really run into this need. If you needed to reprocess all raw
>> > data
>> > > it would be pretty straightforward, you could just add a partition to
>> > > support the extra load.
>> > >
>> > > 6) Kafka is pretty good at ingesting things so could you elaborate
>> more
>> > on
>> > > this?
>> > >
>> > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com.
>> invalid
>> > >
>> > > wrote:
>> > >
>> > > > Hi Samza devs, users and enthusiasts,
>> > > >
>> > > > I've kept an eye on the Samza project for a while and I think it's
>> > super
>> > > > cool! I hope it continues to mature and expand as it seems very
>> > > promising (:
>> > > >
>> > > > One thing I've been wondering for a while is: how do people serve
>> the
>> > > data
>> > > > they computed on Samza? More specifically:
>> > > >
>> > > > 1. How do you expose the output of Samza jobs to online applications
>> > > > that need low-latency reads?
>> > > > 2. Are these online apps mostly internal (i.e.: analytics,
>> > dashboards,
>> > > > etc.) or public/user-facing?
>> > > > 3. What systems do you currently use (or plan to use in the
>> > > short-term)
>> > > > to host the data generated in Samza? HBase? Cassandra? MySQL? Druid?
>> > > Others?
>> > > > 4. Are you satisfied or are you facing challenges in terms of the
>> > > write
>> > > > throughput supported by these storage/serving systems? What about
>> read
>> > > > throughput?
>> > > > 5. Are there situations where you wish to re-process all historical
>> > > > data when making improvements to your Samza job, which results in
>> the
>> > > need
>> > > > to re-ingest all of the Samza output into your online serving system
>> > (as
>> > > > described in the Kappa Architecture<
>> > > >
>> > >
>> > http://radar.oreilly.com/2014/07/questioning-the-lambda-
>> architecture.html
>> > > >)
>> > > > ? Is this easy breezy or painful? Do you need to throttle it lest
>> your
>> > > > serving system will fall over?
>> > > > 6. If there was a highly-optimized and reliable way of ingesting
>> > > > partitioned streams quickly into your online serving system, would
>> that
>> > > > help you leverage Samza more effectively?
>> > > >
>> > > > Your insights would be much appreciated!
>> > > >
>> > > >
>> > > > Thanks (:
>> > > >
>> > > >
>> > > > --
>> > > > Felix
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Jordan Shaw
>> > > Full Stack Software Engineer
>> > > PubNub Inc
>> > > 1045 17th St
>> > > San Francisco, CA 94107
>> > >
>> >
>>
>
>

RE: How do you serve the data computed by Samza?

Reply via email to