Hi Harlan and Vladimir, I think the idea of serving data directly off of Samza has been mentioned a few times, but there are certain caveats that make this a risky proposition. For example:
* Samza does not have the same uptime constraints as a dedicated data serving platform. While I'm a big fan of the fault-tolerant state provided by Samza (especially when compared to Storm and Spark Streaming), we have to realize that the Samza model effectively is to have one instance (or container) up and running for each partition, and to regenerate the state of that instance elsewhere if the first one goes down. This means that while the fault-tolerance is automatic, it is not instantaneous, it may take a minute or more, depending on the size of the state to regenerate. This is okay for the vast majority of nearline stream processing use cases, but not so much for a serving platform, which typically expects more redundancy. This is related to https://issues.apache.org/jira/browse/SAMZA-406, though even with those changes, it's not clear whether the fail-over would be considered fast enough for online data serving needs. * Samza is meant to support spiky workloads without too much problems, whether it is because of an actual spike in input traffic, or because of a re-processing use case (i.e.: the point #5 in the OP which I also clarified in my last email). A data serving system needs to much more wary of spikes, because it typically needs to maintain good p99 latency. Therefore, Samza and the serving system may have very different JVM tunings (if they even run on the JVM at all). Even if RocksDB was used in both Samza and the serving system, it may benefit from being tuned differently (i.e.: one for write throughput and the other for read performance). -- Felix GV Data Infrastructure Engineer Distributed Data Systems LinkedIn f...@linkedin.com linkedin.com/in/felixgv ________________________________________ From: Harlan Iverson [har...@pubnub.com] Sent: Saturday, March 28, 2015 5:30 PM To: dev@samza.apache.org Subject: Re: How do you serve the data computed by Samza? Felix/Shekar, Given that Samza itself uses RocksDB to create a materialized view of a partition from earliest to latest offset, I imagine that to be a good choice to begin evaluating. Two parts: 1. A method recommended in this article/video <http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with-apache-samza/> seems to suggest building a full in-process materialized view in each consuming process. This increases performance at the cost of space, so if that is an acceptable tradeoff then it should be golden. One approach may be writing to a Samza topic-backed key-value (KV) store in one upstream task and consuming it from one-or-more downstream KV-store on the same topic (eg. a task to lookup derived info by sub_key in Jordan's example above, and consume it later, keying the messages by sub_key). The reasoning behind this is that the KV-store backing is simply a Kafka topic partition that leverages the compaction model (keep only/at-least the last message for a given key), and then sequentially populates the KV-store in each consuming process upon process creation and keeps it updated during operation. I believe a caveat is that any consuming tasks would then have to use the KV-store as read only. Given the single-threaded model of Samza and partition-level ordering of Kafka, I think that if a) all topics in the pipeline have the same number of partitions, b) messages are published with the same key at each step, and c) Kafka leader write acking is used, then a store value should always be committed before the downstream task consumes it, though I'm not sure how to guarantee that it is actually consumed first downstream given that consumers may be slightly behind. Is simply relying on "behind high watermark"=0 on the KV consumer topic reliable enough to ensure this? If no, could the latest Kafka offset of the KV stream be tagged onto to the message upstream and then held downstream until the consumer's KV-store reaches that offset? 2. As for how to then serve it, would it be a bad idea to embed a REST/http server in a Samza task itself (effectively one per container/task instance) and put them behind a dynamically updated load balancer, updating mappings when the containers are launched/destroyed? One could also directly materialize their own RocksDB or similar in language/platform of choice by following the same protocol of feeding a Kafka topic from oldest offset into an in-process KV store and serving against it, creating a materialized view in the same fashion. I imagine that a (forthcoming?) transaction model in Kafka <https://cwiki.apache.org/confluence/display/KAFKA/Transactional+Messaging+in+Kafka> could provide a consistent view across all consumers, but until then there may need to be tolerance for them to be slightly out of sync. -- We've not implemented any of this, but it's the approach that I'd first think to take. To begin with the task KV-store route, there are some scripts included with Samza to test KV-store performance that would give some concrete r/w performance numbers, and could probably be further explored by implementing the KV-store interfaces for any given JVM storage engine/client. Cheers On Fri, Mar 27, 2015 at 7:06 PM, Shekar Tippur <ctip...@gmail.com> wrote: > Felix/Jordan, > > 1 - 2 is exactly what I was looking for as well. I want to expose > webservices call to Kafka/samza. As there is no concept of a session, I was > wondering how to send back enriched data to the web services request. > Or am I way off on this? Meaning, is this a completely wrong use case to > use Kafka/Samza? > > - Shekar > > On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com> wrote: > > > Felix, > > Here are my thoughts below > > > > 1 - 2) I think so far a majority of samza applications are internal so > far. > > However I've developed a Samza Publisher for PubNub that would allow you > to > > send data from process or window out over a Data Stream Network. Right > now > > it looks something like this: > > > > (.send collector (OutgoingMessageEnvelope. (SystemStream. > > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)). > > > > At smaller scale you could do the same with socket.io etc... If you're > > interested in this I can send you the src or jar. If their is wider > > interest I can open source it on github but it needs some cleanup first. > > > > 3) We currently don't have the need to warehouse our stream but we have > > thought about piping samza generated data into some Hadoop based system > for > > longer term analysis. Then running Hive queries over that data or > something > > alike. > > > > 4) I can't comment on the throughput of the other systems (HBase etc..) > but > > our Kafka, Samza through put is pretty impressive considering the single > > thread nature of the system. We are seeing raw throughput per partition > > over well 10MB/s. > > > > 5) I haven't run into this to prevent data loss/backup if we can't > process > > a message we have considered dropping it into a "unprocessed topic" but > we > > haven't really run into this need. If you needed to reprocess all raw > data > > it would be pretty straightforward, you could just add a partition to > > support the extra load. > > > > 6) Kafka is pretty good at ingesting things so could you elaborate more > on > > this? > > > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com.invalid > > > > wrote: > > > > > Hi Samza devs, users and enthusiasts, > > > > > > I've kept an eye on the Samza project for a while and I think it's > super > > > cool! I hope it continues to mature and expand as it seems very > > promising (: > > > > > > One thing I've been wondering for a while is: how do people serve the > > data > > > they computed on Samza? More specifically: > > > > > > 1. How do you expose the output of Samza jobs to online applications > > > that need low-latency reads? > > > 2. Are these online apps mostly internal (i.e.: analytics, > dashboards, > > > etc.) or public/user-facing? > > > 3. What systems do you currently use (or plan to use in the > > short-term) > > > to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? > > Others? > > > 4. Are you satisfied or are you facing challenges in terms of the > > write > > > throughput supported by these storage/serving systems? What about read > > > throughput? > > > 5. Are there situations where you wish to re-process all historical > > > data when making improvements to your Samza job, which results in the > > need > > > to re-ingest all of the Samza output into your online serving system > (as > > > described in the Kappa Architecture< > > > > > > http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html > > >) > > > ? Is this easy breezy or painful? Do you need to throttle it lest your > > > serving system will fall over? > > > 6. If there was a highly-optimized and reliable way of ingesting > > > partitioned streams quickly into your online serving system, would that > > > help you leverage Samza more effectively? > > > > > > Your insights would be much appreciated! > > > > > > > > > Thanks (: > > > > > > > > > -- > > > Felix > > > > > > > > > > > -- > > Jordan Shaw > > Full Stack Software Engineer > > PubNub Inc > > 1045 17th St > > San Francisco, CA 94107 > > >