Hi Vladimir, It seems like you have decided on an architecture which is fairly similar to the one I suggested (:
Out of curiosity, I have the following questions regarding your system: * Are you saying the Samza tasks co-located with your k/v store nodes are also doing processing, and not just ingesting? And if they are doing processing: * Would you mind sharing what kind of processing they do? (i.e.: joining streams, counting stuff, etc.?) * Does the processing they do require local state? And does that local state and resulting IO compete with the resources of the co-located k/v store? * Would you mind also sharing which external k/v store you are ingesting into? Shekar, I'm not 100% sure I understand your question... what do you mean the producer has no control over the consumer? Partitioning can allow you to decide which specific consumer will get a given key, and the same partitioning strategy can also be used by the web service to query the correct k/v store shard/node, right? -- Felix GV Data Infrastructure Engineer Distributed Data Systems LinkedIn f...@linkedin.com linkedin.com/in/felixgv ________________________________________ From: Shekar Tippur [ctip...@gmail.com] Sent: Wednesday, April 01, 2015 6:54 AM To: dev@samza.apache.org Subject: Re: How do you serve the data computed by Samza? I am still not fully sure how this would pan out. Since at each stage, producer sends an event and has no control over the consumer. {Web services call} -> {samza enrichment1} -> {samza enrichment2} || || V V {kv store1} {kv store2} In this case, how do we tie back the webservices call to data in kv store 2? - Shekar On Wed, Apr 1, 2015 at 1:49 AM, Vladimir Lebedev <w...@fastmail.fm> wrote: > Dear Harlan and Felix, > > Many thanks for your input! In my particular case I decided to use > external KV-store sharded by the same key as partition key in Kafka topic. > KV-store shards will be colocated with samza tasks processing corresponding > topic partitions in order to minimize latency. > > Best regards, > Vladimir > > -- > Vladimir Lebedev > http://linkedin.com/in/vlebedev > > > On 03/31/2015 07:39 PM, Felix GV wrote: > >> Hi Harlan and Vladimir, >> >> I think the idea of serving data directly off of Samza has been mentioned >> a few times, but there are certain caveats that make this a risky >> proposition. For example: >> >> * Samza does not have the same uptime constraints as a dedicated >> data serving platform. While I'm a big fan of the fault-tolerant >> state provided by Samza (especially when compared to Storm and >> Spark Streaming), we have to realize that the Samza model >> effectively is to have one instance (or container) up and running >> for each partition, and to regenerate the state of that instance >> elsewhere if the first one goes down. This means that while the >> fault-tolerance is automatic, it is not instantaneous, it may take >> a minute or more, depending on the size of the state to >> regenerate. This is okay for the vast majority of nearline stream >> processing use cases, but not so much for a serving platform, >> which typically expects more redundancy. This is related to >> https://issues.apache.org/jira/browse/SAMZA-406, though even with >> those changes, it's not clear whether the fail-over would be >> considered fast enough for online data serving needs. >> * Samza is meant to support spiky workloads without too much >> problems, whether it is because of an actual spike in input >> traffic, or because of a re-processing use case (i.e.: the point >> #5 in the OP which I also clarified in my last email). A data >> serving system needs to much more wary of spikes, because it >> typically needs to maintain good p99 latency. Therefore, Samza and >> the serving system may have very different JVM tunings (if they >> even run on the JVM at all). Even if RocksDB was used in both >> Samza and the serving system, it may benefit from being tuned >> differently (i.e.: one for write throughput and the other for read >> performance). >> >> >> -- >> >> Felix GV >> Data Infrastructure Engineer >> Distributed Data Systems >> LinkedIn >> >> f...@linkedin.com >> linkedin.com/in/felixgv >> >> ________________________________________ >> From: Harlan Iverson [har...@pubnub.com] >> Sent: Saturday, March 28, 2015 5:30 PM >> To: dev@samza.apache.org >> Subject: Re: How do you serve the data computed by Samza? >> >> Felix/Shekar, >> >> Given that Samza itself uses RocksDB to create a materialized view of a >> partition from earliest to latest offset, I imagine that to be a good >> choice to begin evaluating. Two parts: >> >> 1. A method recommended in this article/video >> <http://blog.confluent.io/2015/03/04/turning-the- >> database-inside-out-with-apache-samza/> >> seems >> to suggest building a full in-process materialized view in each consuming >> process. This increases performance at the cost of space, so if that is an >> acceptable tradeoff then it should be golden. >> >> One approach may be writing to a Samza topic-backed key-value (KV) store >> in >> one upstream task and consuming it from one-or-more downstream KV-store on >> the same topic (eg. a task to lookup derived info by sub_key in Jordan's >> example above, and consume it later, keying the messages by sub_key). The >> reasoning behind this is that the KV-store backing is simply a Kafka topic >> partition that leverages the compaction model (keep only/at-least the last >> message for a given key), and then sequentially populates the KV-store in >> each consuming process upon process creation and keeps it updated during >> operation. I believe a caveat is that any consuming tasks would then have >> to use the KV-store as read only. >> >> Given the single-threaded model of Samza and partition-level ordering of >> Kafka, I think that if a) all topics in the pipeline have the same number >> of partitions, b) messages are published with the same key at each step, >> and c) Kafka leader write acking is used, then a store value should always >> be committed before the downstream task consumes it, though I'm not sure >> how to guarantee that it is actually consumed first downstream given that >> consumers may be slightly behind. Is simply relying on "behind high >> watermark"=0 on the KV consumer topic reliable enough to ensure this? If >> no, could the latest Kafka offset of the KV stream be tagged onto to the >> message upstream and then held downstream until the consumer's KV-store >> reaches that offset? >> >> 2. As for how to then serve it, would it be a bad idea to embed a >> REST/http >> server in a Samza task itself (effectively one per container/task >> instance) >> and put them behind a dynamically updated load balancer, updating mappings >> when the containers are launched/destroyed? >> >> One could also directly materialize their own RocksDB or similar in >> language/platform of choice by following the same protocol of feeding a >> Kafka topic from oldest offset into an in-process KV store and serving >> against it, creating a materialized view in the same fashion. >> >> I imagine that a (forthcoming?) transaction model in Kafka >> <https://cwiki.apache.org/confluence/display/KAFKA/ >> Transactional+Messaging+in+Kafka> >> could >> provide a consistent view across all consumers, but until then there may >> need to be tolerance for them to be slightly out of sync. >> >> -- >> >> We've not implemented any of this, but it's the approach that I'd first >> think to take. To begin with the task KV-store route, there are some >> scripts included with Samza to test KV-store performance that would give >> some concrete r/w performance numbers, and could probably be further >> explored by implementing the KV-store interfaces for any given JVM storage >> engine/client. >> >> Cheers >> >> >> On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com> wrote: >> >> > Felix/Jordan, >> > >> > 1 - 2 is exactly what I was looking for as well. I want to expose >> > webservices call to Kafka/samza. As there is no concept of a session, I >> was >> > wondering how to send back enriched data to the web services request. >> > Or am I way off on this? Meaning, is this a completely wrong use case to >> > use Kafka/Samza? >> > >> > - Shekar >> > >> > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com> >> wrote: >> > >> > > Felix, >> > > Here are my thoughts below >> > > >> > > 1 - 2) I think so far a majority of samza applications are internal so >> > far. >> > > However I've developed a Samza Publisher for PubNub that would allow >> you >> > to >> > > send data from process or window out over a Data Stream Network. Right >> > now >> > > it looks something like this: >> > > >> > > (.send collector (OutgoingMessageEnvelope. (SystemStream. >> > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)). >> > > >> > > At smaller scale you could do the same with socket.io etc... If >> you're >> > > interested in this I can send you the src or jar. If their is wider >> > > interest I can open source it on github but it needs some cleanup >> first. >> > > >> > > 3) We currently don't have the need to warehouse our stream but we >> have >> > > thought about piping samza generated data into some Hadoop based >> system >> > for >> > > longer term analysis. Then running Hive queries over that data or >> > something >> > > alike. >> > > >> > > 4) I can't comment on the throughput of the other systems (HBase >> etc..) >> > but >> > > our Kafka, Samza through put is pretty impressive considering the >> single >> > > thread nature of the system. We are seeing raw throughput per >> partition >> > > over well 10MB/s. >> > > >> > > 5) I haven't run into this to prevent data loss/backup if we can't >> > process >> > > a message we have considered dropping it into a "unprocessed topic" >> but >> > we >> > > haven't really run into this need. If you needed to reprocess all raw >> > data >> > > it would be pretty straightforward, you could just add a partition to >> > > support the extra load. >> > > >> > > 6) Kafka is pretty good at ingesting things so could you elaborate >> more >> > on >> > > this? >> > > >> > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com. >> invalid >> > > >> > > wrote: >> > > >> > > > Hi Samza devs, users and enthusiasts, >> > > > >> > > > I've kept an eye on the Samza project for a while and I think it's >> > super >> > > > cool! I hope it continues to mature and expand as it seems very >> > > promising (: >> > > > >> > > > One thing I've been wondering for a while is: how do people serve >> the >> > > data >> > > > they computed on Samza? More specifically: >> > > > >> > > > 1. How do you expose the output of Samza jobs to online applications >> > > > that need low-latency reads? >> > > > 2. Are these online apps mostly internal (i.e.: analytics, >> > dashboards, >> > > > etc.) or public/user-facing? >> > > > 3. What systems do you currently use (or plan to use in the >> > > short-term) >> > > > to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? >> > > Others? >> > > > 4. Are you satisfied or are you facing challenges in terms of the >> > > write >> > > > throughput supported by these storage/serving systems? What about >> read >> > > > throughput? >> > > > 5. Are there situations where you wish to re-process all historical >> > > > data when making improvements to your Samza job, which results in >> the >> > > need >> > > > to re-ingest all of the Samza output into your online serving system >> > (as >> > > > described in the Kappa Architecture< >> > > > >> > > >> > http://radar.oreilly.com/2014/07/questioning-the-lambda- >> architecture.html >> > > >) >> > > > ? Is this easy breezy or painful? Do you need to throttle it lest >> your >> > > > serving system will fall over? >> > > > 6. If there was a highly-optimized and reliable way of ingesting >> > > > partitioned streams quickly into your online serving system, would >> that >> > > > help you leverage Samza more effectively? >> > > > >> > > > Your insights would be much appreciated! >> > > > >> > > > >> > > > Thanks (: >> > > > >> > > > >> > > > -- >> > > > Felix >> > > > >> > > >> > > >> > > >> > > -- >> > > Jordan Shaw >> > > Full Stack Software Engineer >> > > PubNub Inc >> > > 1045 17th St >> > > San Francisco, CA 94107 >> > > >> > >> > >