Felix, Since the webservices call is not maintained in a single session, how do we tie up the incoming request to the enriched kv pair?
For example, lets take a use case a given user (partitioned by user id) needs to be presented with all the different regions he/she has been in to in a given time range. Lets say that this involves 2 samza jobs to populate the kv store (associated with a small time lag), should the webservices layer should keep querying the kv store till the value is populated? Maybe I am not thinking right? - Shekar On Wed, Apr 1, 2015 at 10:37 AM, Felix GV <fville...@linkedin.com.invalid> wrote: > Hi Vladimir, > > It seems like you have decided on an architecture which is fairly similar > to the one I suggested (: > > Out of curiosity, I have the following questions regarding your system: > > * Are you saying the Samza tasks co-located with your k/v store nodes > are also doing processing, and not just ingesting? And if they are doing > processing: > * Would you mind sharing what kind of processing they do? (i.e.: > joining streams, counting stuff, etc.?) > * Does the processing they do require local state? And does that > local state and resulting IO compete with the resources of the co-located > k/v store? > * Would you mind also sharing which external k/v store you are > ingesting into? > > Shekar, > > > I'm not 100% sure I understand your question... what do you mean the > producer has no control over the consumer? Partitioning can allow you to > decide which specific consumer will get a given key, and the same > partitioning strategy can also be used by the web service to query the > correct k/v store shard/node, right? > > > -- > > Felix GV > Data Infrastructure Engineer > Distributed Data Systems > LinkedIn > > f...@linkedin.com > linkedin.com/in/felixgv > > ________________________________________ > From: Shekar Tippur [ctip...@gmail.com] > Sent: Wednesday, April 01, 2015 6:54 AM > To: dev@samza.apache.org > Subject: Re: How do you serve the data computed by Samza? > > I am still not fully sure how this would pan out. Since at each stage, > producer sends an event and has no control over the consumer. > > {Web services call} -> {samza enrichment1} -> {samza enrichment2} > || > || > V > V > {kv store1} > {kv store2} > > In this case, how do we tie back the webservices call to data in kv store > 2? > > - Shekar > > On Wed, Apr 1, 2015 at 1:49 AM, Vladimir Lebedev <w...@fastmail.fm> wrote: > > > Dear Harlan and Felix, > > > > Many thanks for your input! In my particular case I decided to use > > external KV-store sharded by the same key as partition key in Kafka > topic. > > KV-store shards will be colocated with samza tasks processing > corresponding > > topic partitions in order to minimize latency. > > > > Best regards, > > Vladimir > > > > -- > > Vladimir Lebedev > > http://linkedin.com/in/vlebedev > > > > > > On 03/31/2015 07:39 PM, Felix GV wrote: > > > >> Hi Harlan and Vladimir, > >> > >> I think the idea of serving data directly off of Samza has been > mentioned > >> a few times, but there are certain caveats that make this a risky > >> proposition. For example: > >> > >> * Samza does not have the same uptime constraints as a dedicated > >> data serving platform. While I'm a big fan of the fault-tolerant > >> state provided by Samza (especially when compared to Storm and > >> Spark Streaming), we have to realize that the Samza model > >> effectively is to have one instance (or container) up and running > >> for each partition, and to regenerate the state of that instance > >> elsewhere if the first one goes down. This means that while the > >> fault-tolerance is automatic, it is not instantaneous, it may take > >> a minute or more, depending on the size of the state to > >> regenerate. This is okay for the vast majority of nearline stream > >> processing use cases, but not so much for a serving platform, > >> which typically expects more redundancy. This is related to > >> https://issues.apache.org/jira/browse/SAMZA-406, though even with > >> those changes, it's not clear whether the fail-over would be > >> considered fast enough for online data serving needs. > >> * Samza is meant to support spiky workloads without too much > >> problems, whether it is because of an actual spike in input > >> traffic, or because of a re-processing use case (i.e.: the point > >> #5 in the OP which I also clarified in my last email). A data > >> serving system needs to much more wary of spikes, because it > >> typically needs to maintain good p99 latency. Therefore, Samza and > >> the serving system may have very different JVM tunings (if they > >> even run on the JVM at all). Even if RocksDB was used in both > >> Samza and the serving system, it may benefit from being tuned > >> differently (i.e.: one for write throughput and the other for read > >> performance). > >> > >> > >> -- > >> > >> Felix GV > >> Data Infrastructure Engineer > >> Distributed Data Systems > >> LinkedIn > >> > >> f...@linkedin.com > >> linkedin.com/in/felixgv > >> > >> ________________________________________ > >> From: Harlan Iverson [har...@pubnub.com] > >> Sent: Saturday, March 28, 2015 5:30 PM > >> To: dev@samza.apache.org > >> Subject: Re: How do you serve the data computed by Samza? > >> > >> Felix/Shekar, > >> > >> Given that Samza itself uses RocksDB to create a materialized view of a > >> partition from earliest to latest offset, I imagine that to be a good > >> choice to begin evaluating. Two parts: > >> > >> 1. A method recommended in this article/video > >> <http://blog.confluent.io/2015/03/04/turning-the- > >> database-inside-out-with-apache-samza/> > >> seems > >> to suggest building a full in-process materialized view in each > consuming > >> process. This increases performance at the cost of space, so if that is > an > >> acceptable tradeoff then it should be golden. > >> > >> One approach may be writing to a Samza topic-backed key-value (KV) store > >> in > >> one upstream task and consuming it from one-or-more downstream KV-store > on > >> the same topic (eg. a task to lookup derived info by sub_key in Jordan's > >> example above, and consume it later, keying the messages by sub_key). > The > >> reasoning behind this is that the KV-store backing is simply a Kafka > topic > >> partition that leverages the compaction model (keep only/at-least the > last > >> message for a given key), and then sequentially populates the KV-store > in > >> each consuming process upon process creation and keeps it updated during > >> operation. I believe a caveat is that any consuming tasks would then > have > >> to use the KV-store as read only. > >> > >> Given the single-threaded model of Samza and partition-level ordering of > >> Kafka, I think that if a) all topics in the pipeline have the same > number > >> of partitions, b) messages are published with the same key at each step, > >> and c) Kafka leader write acking is used, then a store value should > always > >> be committed before the downstream task consumes it, though I'm not sure > >> how to guarantee that it is actually consumed first downstream given > that > >> consumers may be slightly behind. Is simply relying on "behind high > >> watermark"=0 on the KV consumer topic reliable enough to ensure this? If > >> no, could the latest Kafka offset of the KV stream be tagged onto to the > >> message upstream and then held downstream until the consumer's KV-store > >> reaches that offset? > >> > >> 2. As for how to then serve it, would it be a bad idea to embed a > >> REST/http > >> server in a Samza task itself (effectively one per container/task > >> instance) > >> and put them behind a dynamically updated load balancer, updating > mappings > >> when the containers are launched/destroyed? > >> > >> One could also directly materialize their own RocksDB or similar in > >> language/platform of choice by following the same protocol of feeding a > >> Kafka topic from oldest offset into an in-process KV store and serving > >> against it, creating a materialized view in the same fashion. > >> > >> I imagine that a (forthcoming?) transaction model in Kafka > >> <https://cwiki.apache.org/confluence/display/KAFKA/ > >> Transactional+Messaging+in+Kafka> > >> could > >> provide a consistent view across all consumers, but until then there may > >> need to be tolerance for them to be slightly out of sync. > >> > >> -- > >> > >> We've not implemented any of this, but it's the approach that I'd first > >> think to take. To begin with the task KV-store route, there are some > >> scripts included with Samza to test KV-store performance that would give > >> some concrete r/w performance numbers, and could probably be further > >> explored by implementing the KV-store interfaces for any given JVM > storage > >> engine/client. > >> > >> Cheers > >> > >> > >> On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com> > wrote: > >> > >> > Felix/Jordan, > >> > > >> > 1 - 2 is exactly what I was looking for as well. I want to expose > >> > webservices call to Kafka/samza. As there is no concept of a session, > I > >> was > >> > wondering how to send back enriched data to the web services request. > >> > Or am I way off on this? Meaning, is this a completely wrong use case > to > >> > use Kafka/Samza? > >> > > >> > - Shekar > >> > > >> > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com> > >> wrote: > >> > > >> > > Felix, > >> > > Here are my thoughts below > >> > > > >> > > 1 - 2) I think so far a majority of samza applications are internal > so > >> > far. > >> > > However I've developed a Samza Publisher for PubNub that would allow > >> you > >> > to > >> > > send data from process or window out over a Data Stream Network. > Right > >> > now > >> > > it looks something like this: > >> > > > >> > > (.send collector (OutgoingMessageEnvelope. (SystemStream. > >> > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)). > >> > > > >> > > At smaller scale you could do the same with socket.io etc... If > >> you're > >> > > interested in this I can send you the src or jar. If their is wider > >> > > interest I can open source it on github but it needs some cleanup > >> first. > >> > > > >> > > 3) We currently don't have the need to warehouse our stream but we > >> have > >> > > thought about piping samza generated data into some Hadoop based > >> system > >> > for > >> > > longer term analysis. Then running Hive queries over that data or > >> > something > >> > > alike. > >> > > > >> > > 4) I can't comment on the throughput of the other systems (HBase > >> etc..) > >> > but > >> > > our Kafka, Samza through put is pretty impressive considering the > >> single > >> > > thread nature of the system. We are seeing raw throughput per > >> partition > >> > > over well 10MB/s. > >> > > > >> > > 5) I haven't run into this to prevent data loss/backup if we can't > >> > process > >> > > a message we have considered dropping it into a "unprocessed topic" > >> but > >> > we > >> > > haven't really run into this need. If you needed to reprocess all > raw > >> > data > >> > > it would be pretty straightforward, you could just add a partition > to > >> > > support the extra load. > >> > > > >> > > 6) Kafka is pretty good at ingesting things so could you elaborate > >> more > >> > on > >> > > this? > >> > > > >> > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com. > >> invalid > >> > > > >> > > wrote: > >> > > > >> > > > Hi Samza devs, users and enthusiasts, > >> > > > > >> > > > I've kept an eye on the Samza project for a while and I think it's > >> > super > >> > > > cool! I hope it continues to mature and expand as it seems very > >> > > promising (: > >> > > > > >> > > > One thing I've been wondering for a while is: how do people serve > >> the > >> > > data > >> > > > they computed on Samza? More specifically: > >> > > > > >> > > > 1. How do you expose the output of Samza jobs to online > applications > >> > > > that need low-latency reads? > >> > > > 2. Are these online apps mostly internal (i.e.: analytics, > >> > dashboards, > >> > > > etc.) or public/user-facing? > >> > > > 3. What systems do you currently use (or plan to use in the > >> > > short-term) > >> > > > to host the data generated in Samza? HBase? Cassandra? MySQL? > Druid? > >> > > Others? > >> > > > 4. Are you satisfied or are you facing challenges in terms of the > >> > > write > >> > > > throughput supported by these storage/serving systems? What about > >> read > >> > > > throughput? > >> > > > 5. Are there situations where you wish to re-process all > historical > >> > > > data when making improvements to your Samza job, which results in > >> the > >> > > need > >> > > > to re-ingest all of the Samza output into your online serving > system > >> > (as > >> > > > described in the Kappa Architecture< > >> > > > > >> > > > >> > http://radar.oreilly.com/2014/07/questioning-the-lambda- > >> architecture.html > >> > > >) > >> > > > ? Is this easy breezy or painful? Do you need to throttle it lest > >> your > >> > > > serving system will fall over? > >> > > > 6. If there was a highly-optimized and reliable way of ingesting > >> > > > partitioned streams quickly into your online serving system, would > >> that > >> > > > help you leverage Samza more effectively? > >> > > > > >> > > > Your insights would be much appreciated! > >> > > > > >> > > > > >> > > > Thanks (: > >> > > > > >> > > > > >> > > > -- > >> > > > Felix > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Jordan Shaw > >> > > Full Stack Software Engineer > >> > > PubNub Inc > >> > > 1045 17th St > >> > > San Francisco, CA 94107 > >> > > > >> > > >> > > > > >