Re: How do you serve the data computed by Samza?

Shekar Tippur Wed, 01 Apr 2015 12:51:07 -0700

Felix,

Since the webservices call is not maintained in a single session, how do we
tie up the incoming request to the enriched kv pair?


For example, lets take a use case
a given user (partitioned by user id) needs to be presented with all the
different regions he/she has been in to in a given time range.

Lets say that this involves 2 samza jobs to populate the kv store
(associated with a small time lag), should the webservices layer should
keep querying the kv store till the value is populated? Maybe I am not
thinking right?

- Shekar

On Wed, Apr 1, 2015 at 10:37 AM, Felix GV <fville...@linkedin.com.invalid>
wrote:

> Hi Vladimir,
>
> It seems like you have decided on an architecture which is fairly similar
> to the one I suggested (:
>
> Out of curiosity, I have the following questions regarding your system:
>
>   *   Are you saying the Samza tasks co-located with your k/v store nodes
> are also doing processing, and not just ingesting? And if they are doing
> processing:
>      *   Would you mind sharing what kind of processing they do? (i.e.:
> joining streams, counting stuff, etc.?)
>      *   Does the processing they do require local state? And does that
> local state and resulting IO compete with the resources of the co-located
> k/v store?
>   *   Would you mind also sharing which external k/v store you are
> ingesting into?
>
> Shekar,
>
>
> I'm not 100% sure I understand your question... what do you mean the
> producer has no control over the consumer? Partitioning can allow you to
> decide which specific consumer will get a given key, and the same
> partitioning strategy can also be used by the web service to query the
> correct k/v store shard/node, right?
>
>
> --
>
> Felix GV
> Data Infrastructure Engineer
> Distributed Data Systems
> LinkedIn
>
> f...@linkedin.com
> linkedin.com/in/felixgv
>
> ________________________________________
> From: Shekar Tippur [ctip...@gmail.com]
> Sent: Wednesday, April 01, 2015 6:54 AM
> To: dev@samza.apache.org
> Subject: Re: How do you serve the data computed by Samza?
>
> I am still not fully sure how this would pan out. Since at each stage,
> producer sends an event and has no control over the consumer.
>
> {Web services call} -> {samza enrichment1} -> {samza enrichment2}
> ||
> ||
> V
> V
> {kv store1}
> {kv store2}
>
> In this case, how do we tie back the webservices call to data in kv store
> 2?
>
> - Shekar
>
> On Wed, Apr 1, 2015 at 1:49 AM, Vladimir Lebedev <w...@fastmail.fm> wrote:
>
> > Dear Harlan and Felix,
> >
> > Many thanks for your input! In my particular case I decided to use
> > external KV-store sharded by the same key as partition key in Kafka
> topic.
> > KV-store shards will be colocated with samza tasks processing
> corresponding
> > topic partitions in order to minimize latency.
> >
> > Best regards,
> > Vladimir
> >
> > --
> > Vladimir Lebedev
> > http://linkedin.com/in/vlebedev
> >
> >
> > On 03/31/2015 07:39 PM, Felix GV wrote:
> >
> >> Hi Harlan and Vladimir,
> >>
> >> I think the idea of serving data directly off of Samza has been
> mentioned
> >> a few times, but there are certain caveats that make this a risky
> >> proposition. For example:
> >>
> >> * Samza does not have the same uptime constraints as a dedicated
> >> data serving platform. While I'm a big fan of the fault-tolerant
> >> state provided by Samza (especially when compared to Storm and
> >> Spark Streaming), we have to realize that the Samza model
> >> effectively is to have one instance (or container) up and running
> >> for each partition, and to regenerate the state of that instance
> >> elsewhere if the first one goes down. This means that while the
> >> fault-tolerance is automatic, it is not instantaneous, it may take
> >> a minute or more, depending on the size of the state to
> >> regenerate. This is okay for the vast majority of nearline stream
> >> processing use cases, but not so much for a serving platform,
> >> which typically expects more redundancy. This is related to
> >> https://issues.apache.org/jira/browse/SAMZA-406, though even with
> >> those changes, it's not clear whether the fail-over would be
> >> considered fast enough for online data serving needs.
> >> * Samza is meant to support spiky workloads without too much
> >> problems, whether it is because of an actual spike in input
> >> traffic, or because of a re-processing use case (i.e.: the point
> >> #5 in the OP which I also clarified in my last email). A data
> >> serving system needs to much more wary of spikes, because it
> >> typically needs to maintain good p99 latency. Therefore, Samza and
> >> the serving system may have very different JVM tunings (if they
> >> even run on the JVM at all). Even if RocksDB was used in both
> >> Samza and the serving system, it may benefit from being tuned
> >> differently (i.e.: one for write throughput and the other for read
> >> performance).
> >>
> >>
> >> --
> >>
> >> Felix GV
> >> Data Infrastructure Engineer
> >> Distributed Data Systems
> >> LinkedIn
> >>
> >> f...@linkedin.com
> >> linkedin.com/in/felixgv
> >>
> >> ________________________________________
> >> From: Harlan Iverson [har...@pubnub.com]
> >> Sent: Saturday, March 28, 2015 5:30 PM
> >> To: dev@samza.apache.org
> >> Subject: Re: How do you serve the data computed by Samza?
> >>
> >> Felix/Shekar,
> >>
> >> Given that Samza itself uses RocksDB to create a materialized view of a
> >> partition from earliest to latest offset, I imagine that to be a good
> >> choice to begin evaluating. Two parts:
> >>
> >> 1. A method recommended in this article/video
> >> <http://blog.confluent.io/2015/03/04/turning-the-
> >> database-inside-out-with-apache-samza/>
> >> seems
> >> to suggest building a full in-process materialized view in each
> consuming
> >> process. This increases performance at the cost of space, so if that is
> an
> >> acceptable tradeoff then it should be golden.
> >>
> >> One approach may be writing to a Samza topic-backed key-value (KV) store
> >> in
> >> one upstream task and consuming it from one-or-more downstream KV-store
> on
> >> the same topic (eg. a task to lookup derived info by sub_key in Jordan's
> >> example above, and consume it later, keying the messages by sub_key).
> The
> >> reasoning behind this is that the KV-store backing is simply a Kafka
> topic
> >> partition that leverages the compaction model (keep only/at-least the
> last
> >> message for a given key), and then sequentially populates the KV-store
> in
> >> each consuming process upon process creation and keeps it updated during
> >> operation. I believe a caveat is that any consuming tasks would then
> have
> >> to use the KV-store as read only.
> >>
> >> Given the single-threaded model of Samza and partition-level ordering of
> >> Kafka, I think that if a) all topics in the pipeline have the same
> number
> >> of partitions, b) messages are published with the same key at each step,
> >> and c) Kafka leader write acking is used, then a store value should
> always
> >> be committed before the downstream task consumes it, though I'm not sure
> >> how to guarantee that it is actually consumed first downstream given
> that
> >> consumers may be slightly behind. Is simply relying on "behind high
> >> watermark"=0 on the KV consumer topic reliable enough to ensure this? If
> >> no, could the latest Kafka offset of the KV stream be tagged onto to the
> >> message upstream and then held downstream until the consumer's KV-store
> >> reaches that offset?
> >>
> >> 2. As for how to then serve it, would it be a bad idea to embed a
> >> REST/http
> >> server in a Samza task itself (effectively one per container/task
> >> instance)
> >> and put them behind a dynamically updated load balancer, updating
> mappings
> >> when the containers are launched/destroyed?
> >>
> >> One could also directly materialize their own RocksDB or similar in
> >> language/platform of choice by following the same protocol of feeding a
> >> Kafka topic from oldest offset into an in-process KV store and serving
> >> against it, creating a materialized view in the same fashion.
> >>
> >> I imagine that a (forthcoming?) transaction model in Kafka
> >> <https://cwiki.apache.org/confluence/display/KAFKA/
> >> Transactional+Messaging+in+Kafka>
> >> could
> >> provide a consistent view across all consumers, but until then there may
> >> need to be tolerance for them to be slightly out of sync.
> >>
> >> --
> >>
> >> We've not implemented any of this, but it's the approach that I'd first
> >> think to take. To begin with the task KV-store route, there are some
> >> scripts included with Samza to test KV-store performance that would give
> >> some concrete r/w performance numbers, and could probably be further
> >> explored by implementing the KV-store interfaces for any given JVM
> storage
> >> engine/client.
> >>
> >> Cheers
> >>
> >>
> >> On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com>
> wrote:
> >>
> >> > Felix/Jordan,
> >> >
> >> > 1 - 2 is exactly what I was looking for as well. I want to expose
> >> > webservices call to Kafka/samza. As there is no concept of a session,
> I
> >> was
> >> > wondering how to send back enriched data to the web services request.
> >> > Or am I way off on this? Meaning, is this a completely wrong use case
> to
> >> > use Kafka/Samza?
> >> >
> >> > - Shekar
> >> >
> >> > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com>
> >> wrote:
> >> >
> >> > > Felix,
> >> > > Here are my thoughts below
> >> > >
> >> > > 1 - 2) I think so far a majority of samza applications are internal
> so
> >> > far.
> >> > > However I've developed a Samza Publisher for PubNub that would allow
> >> you
> >> > to
> >> > > send data from process or window out over a Data Stream Network.
> Right
> >> > now
> >> > > it looks something like this:
> >> > >
> >> > > (.send collector (OutgoingMessageEnvelope. (SystemStream.
> >> > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)).
> >> > >
> >> > > At smaller scale you could do the same with socket.io etc... If
> >> you're
> >> > > interested in this I can send you the src or jar. If their is wider
> >> > > interest I can open source it on github but it needs some cleanup
> >> first.
> >> > >
> >> > > 3) We currently don't have the need to warehouse our stream but we
> >> have
> >> > > thought about piping samza generated data into some Hadoop based
> >> system
> >> > for
> >> > > longer term analysis. Then running Hive queries over that data or
> >> > something
> >> > > alike.
> >> > >
> >> > > 4) I can't comment on the throughput of the other systems (HBase
> >> etc..)
> >> > but
> >> > > our Kafka, Samza through put is pretty impressive considering the
> >> single
> >> > > thread nature of the system. We are seeing raw throughput per
> >> partition
> >> > > over well 10MB/s.
> >> > >
> >> > > 5) I haven't run into this to prevent data loss/backup if we can't
> >> > process
> >> > > a message we have considered dropping it into a "unprocessed topic"
> >> but
> >> > we
> >> > > haven't really run into this need. If you needed to reprocess all
> raw
> >> > data
> >> > > it would be pretty straightforward, you could just add a partition
> to
> >> > > support the extra load.
> >> > >
> >> > > 6) Kafka is pretty good at ingesting things so could you elaborate
> >> more
> >> > on
> >> > > this?
> >> > >
> >> > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com.
> >> invalid
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Hi Samza devs, users and enthusiasts,
> >> > > >
> >> > > > I've kept an eye on the Samza project for a while and I think it's
> >> > super
> >> > > > cool! I hope it continues to mature and expand as it seems very
> >> > > promising (:
> >> > > >
> >> > > > One thing I've been wondering for a while is: how do people serve
> >> the
> >> > > data
> >> > > > they computed on Samza? More specifically:
> >> > > >
> >> > > > 1. How do you expose the output of Samza jobs to online
> applications
> >> > > > that need low-latency reads?
> >> > > > 2. Are these online apps mostly internal (i.e.: analytics,
> >> > dashboards,
> >> > > > etc.) or public/user-facing?
> >> > > > 3. What systems do you currently use (or plan to use in the
> >> > > short-term)
> >> > > > to host the data generated in Samza? HBase? Cassandra? MySQL?
> Druid?
> >> > > Others?
> >> > > > 4. Are you satisfied or are you facing challenges in terms of the
> >> > > write
> >> > > > throughput supported by these storage/serving systems? What about
> >> read
> >> > > > throughput?
> >> > > > 5. Are there situations where you wish to re-process all
> historical
> >> > > > data when making improvements to your Samza job, which results in
> >> the
> >> > > need
> >> > > > to re-ingest all of the Samza output into your online serving
> system
> >> > (as
> >> > > > described in the Kappa Architecture<
> >> > > >
> >> > >
> >> > http://radar.oreilly.com/2014/07/questioning-the-lambda-
> >> architecture.html
> >> > > >)
> >> > > > ? Is this easy breezy or painful? Do you need to throttle it lest
> >> your
> >> > > > serving system will fall over?
> >> > > > 6. If there was a highly-optimized and reliable way of ingesting
> >> > > > partitioned streams quickly into your online serving system, would
> >> that
> >> > > > help you leverage Samza more effectively?
> >> > > >
> >> > > > Your insights would be much appreciated!
> >> > > >
> >> > > >
> >> > > > Thanks (:
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Felix
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Jordan Shaw
> >> > > Full Stack Software Engineer
> >> > > PubNub Inc
> >> > > 1045 17th St
> >> > > San Francisco, CA 94107
> >> > >
> >> >
> >>
> >
> >
>

Re: How do you serve the data computed by Samza?

Reply via email to