Hi, Renato, Thanks a lot! Please feel free to distribute this message if you encountered this question again. :)
Looking forward to your Kinesis and ActiveMQ patches! Cheers! -Yi On Sun, Oct 4, 2015 at 9:45 PM, Renato Marroquín Mogrovejo < renatoj.marroq...@gmail.com> wrote: > Hi Yi, > > Thanks a lot for sharing this POV, makes a lot more sense to me now what > both (KStreams and Samza) are targeting to. > > > Best, > > Renato M. > > 2015-10-03 20:13 GMT+02:00 Roger Hoover <roger.hoo...@gmail.com>: > > > Hi Yi, > > > > Thank you for sharing this update and perspective. I tend to agree that > > for simple, stateless cases, things could be easier and hopefully > KStreams > > may help with that. I also appreciate a lot of features that Samza > already > > supports for operations. As previously discussed, the biggest request I > > have is being able to run Samza without YARN, under something like > > Kubernetes instead. > > > > Also, I'm curious. What's the current state of the Samza SQL physical > > operators? Are they used in production yet? Is there a doc on how to > use > > them? > > > > Thanks, > > > > Roger > > > > > > > > On Fri, Oct 2, 2015 at 1:54 PM, Yi Pan <nickpa...@gmail.com> wrote: > > > > > Hi, all Samza-lovers, > > > > > > This question on the relationship of Kafka KStream (KIP-28) and Samza > has > > > come up a couple times recently. So we wanted to clarify where we stand > > at > > > LinkedIn in terms of this discussion. > > > > > > Samza has historically had a symbiotic relationship with Kafka and will > > > continue to work very well with Kafka. Earlier in the year, we had an > > > in-depth discussion exploring an even deeper integration with Kafka. > > After > > > hitting multiple practical issues (e.g. Apache rules) and technical > > issues > > > we had to give up on that idea. As a fall out of the discussion, the > > Kafka > > > community is adding some of the basic event processing capabilities > into > > > Kafka core directly. The basic callback/push style programming model by > > > itself is a great addition to the Kafka API set. > > > > > > However at LinkedIn, we continue to be firmly committed to Samza as our > > > stream processing framework. Although KStream is a nice addition to > Kafka > > > stack, our goals for Samza are broader. There are some key technical > > > differences that makes Samza the right strategy for us. > > > > > > 1. Support for non-kafka systems : > > > > > > At LinkedIn a larger percentage of our streaming jobs use Databus as an > > > input source. For any such non-Kafka source, although the CopyCat > > > connector framework gives a common model for pulling data out of a > source > > > and pushing it into Kafka, it introduces yet another piece of > > > infrastructure that we have to operate and manage. Also for any > > companies > > > who are already on AWS, Google Compute, Azure etc. asking them to > deploy > > > and operate kafka in AWS instead of using the natively supported > services > > > like Kinesis, Google Cloud pub-sub, etc. etc. can potentially be a > > > non-starter. With more acquisitions at LinkedIn that use AWS we are > also > > > facing this first hand. The Samza community has a healthy set of > system > > > producers/consumers which are in the works (Kinesis, ActiveMQ, > > > ElasticSearch, HDFS, etc.). > > > > > > 2. We run Samza as a Stream Processing Service at LinkedIn. This is > > > fundamentally different from KStream. > > > > > > This is similar to AWS Lambda and Google Cloud Dataflow, Azure Stream > > > Insight and similar services. The service makes it much easier for > > > developers to get their stream processing jobs up and running in > > production > > > by helping with the most common problems like monitoring, dashboards, > > > auto-scale, rolling upgrades and such. > > > > > > The truth is that if the stream processing application is stateless > then > > > some of these common problems are not as involved and can be solved > even > > on > > > regular IaaS platforms like EC2 and such. Arguably stateless > > applications > > > can be built easily on top of the native APIs from the input source > like > > > kafka, kinesis etc. > > > > > > The place where Samza shines is with stateful stream processing > > > applications. When a Samza application uses the local rocks DB based > > > state, the application needs special care in terms of rolling upgrades, > > > addition/removal of containers/machines, temporary machine failures, > > > capacity management. We have already done some really good work in > Samza > > > 0.10 to make sure that we don't reseed the state from kafka (i.e. > > > host-affinity feature that allows to reuse the local states). In the > > > absence of this feature, we had multiple production issues caused due > to > > > network saturation during state reseeding from kafka. The problems > with > > > stateful applications are similar to problems encountered when building > > > no-sql databases and other data systems. > > > > > > There are surely some scenarios where customers don't want any YARN > > > dependency and want to run their stream processing application on a > > > dedicated set of nodes. This is where KStream clearly has an advantage > > > over current Samza. Prior to KStream we had a patch in Samza which also > > > solved the same problem (SAMZA-516). We do expect to finish this patch > > soon > > > and formally support Stand Alone Samza. > > > > > > 3. Operators for Stream Processing and SQL : > > > > > > At LinkedIn, there is a growing need to iterate Samza jobs faster in > the > > > loop of implement, deploy, revise the code, and deploy again. A key > > > bottleneck that slows down this iteration is the implementation of a > > Samza > > > job. It has long-been recognized in the Samza community that there is a > > > strong need for a high-level language support to shorten this iterative > > > process. Since last year, we have identified SQL as the user-facing > > > high-level language and completed the high-level design and started > > > prototyping it in Samza. The prototype starts with a set of physical > > > operators which are crucial to the correctness of streaming processing, > > > namely, the window operator, aggregation, and join. KStream adopts some > > of > > > these core ideas. However, our view in Samza’s SQL support goes beyond > > > what’s covered in KStream. We want Samza’s SQL support to be as easy as > > > Google Dataflow and Azure Stream Analytics, in which a user can upload > a > > > query statement and the system will parse the query, translate it into > a > > > distributed execution plan, allocate the containers and stream > resources > > in > > > a cluster, and deploy it. To support this grand vision, our effort in > > > building the SQL operators API, the query planner and optimizers is > > vastly > > > different from what KStream covers, which only covers a single node > > > programming interface. > > > > > > Independent of these strategic differences, one big aspect for us is > also > > > the fact that Samza is an established and mature system which we have > > > successfully operationalized and has been running in production for a > few > > > years. > > > > > > > > > > > > Thanks! > > > > > > -Yi Pan > > > Samza @ LinkedIn > > > > > >