Just to make it explicitly clear what I am proposing, here is a version of more detailed description:
The fourth option (in addition to what Jakob summarized) we are proposing is: - Recharter Samza to “stream processing as a service” - The current Samza core (the basic transformation API w/ basic partition and offset management build-in) will be moved to Kafka Streams (i.e. part of Kafka) and supports “run-as-a-library” - Deprecate the SystemConsumers and SystemProducers APIs and move them to Copycat - The current SQL development: * physical operators and a Trident-like stream API should stay in Kafka Streams as libraries, enabling any standalone deployment to use the core window/join functions * query parser/planner and execution on top of a distributed service should stay in new Samza (i.e. “stream processing as a service”) - Advanced features related to job scheduling/state management stays in new Samza (i.e. “streaming processing as a service”) * Any advanced PartitionManager implementation that can be plugged into Kafka Streams * Any auto-scaling, dynamic configuration via coordinator stream * Any advanced state management s.t. host-affinity etc. Pros: - W/ the current Samza core as Kafka Streams and move the ingestion to Copycat, we achieved most of the goals in the initial proposal: * Tighter coupling w/ Kafka * Reuse Kafka’s build-in functionalities, such as offset manager, basic partition distribution * Separation of ingestion vs transformation APIs * offload a lot of system-specific configuration to Kafka Streams and Copycat (i.e. SystemFactory configure, serde configure, etc.) * remove YARN dependency and make standalone deployment easy. As Guozhang mentioned, it would be really easy to start a process that internally run Kafka Streams as library. - By re-chartering Samza as “stream processing as a service”, we address the concern regarding to * Pluggable partition management * Running in a distributed cluster to manage process lifecycle, fault-tolerance, resource-allocation, etc. * More advanced features s.t. host-affinity, auto-scaling, and dynamic configure changes, etc. Regarding to the code and community organization, I think the following may be the best: Code: - A Kafka sub-project Kafka Streams to hold samza-core, samza-kv-store, and the physical operator layer as library in SQL: this would allow better alignment w/ Kafka, in code, doc, and branding - Retain the current Samza project just to keep * A pluggable explicit partition management in Kafka Streams client * Integration w/ cluster-management systems for advanced features: * host-affinity, auto-scaling,, dynamic configuration, etc. * It will fully depend on the Kafka Streams API and remove all support for SystemConsumers/SystemProducers in the future Community: (this is almost the same as what Chris proposed) - Kafka Streams: the current Samza community should be supporting this effort together with some Kafka members, since most of the code here will be from samza-core, samza-kv-store, and samza-sql. - new Samza: the current Samza community should continue serve the course to support more advanced features to run Kafka Streams as a service. Arguably, the new Samza framework may be used to run Copycat workers as well, at least to manage Copycat worker’s lifecycle in a clustered environment. Hence, it would stay as a general stream processing framework that takes in any source and output to any destination, just the transport system is fixed to Kafka. On Sun, Jul 12, 2015 at 7:29 PM, Yi Pan <nickpa...@gmail.com> wrote: > Hi, Chris, > > Thanks for sending out this concrete set of points here. I agree w/ all > but have a slight different point view on 8). > > My view on this is: instead of sunset Samza as TLP, can we re-charter the > scope of Samza to be the home for "running streaming process as a service"? > > My main motivation is from the following points from a long internal > discussion in LinkedIn: > > - There is a clear ask for pluggable partition management, like we do in > LinkedIn, and as Ben Kirwin has mentioned in > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccacux-d-yjx++2gnf_1laf10kyuvyamg7up_dt19v0znmmhb...@mail.gmail.com%3E > - There are concerns on lack of support for running stream processing in a > cluster: lifecycle management, resource allocation, fault tolerance, etc. > - There is a question to how to support more advanced features s.t. > host-affinity, auto-scaling, and dynamic configuration in Samza jobs, as > raised by Martin here: > http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3c0d66efd0-b7cd-4e4e-8b2f-2716167c3...@kleppmann.com%3E > > We have use cases that need to address all the above three cases and most > of the functions are all in the current Samza project, in some flavor. We > are all supporting to merge the samza-core functionalities into Kafka > Streams, but there is a question where we keep these functions in the > future. One option is to start a new project that includes these functions > that are closely related w/ "run stream-processing as-a-service", while > another personally more attractive option is to re-charter Samza project > just do "run stream processing as-a-service". We can avoid the overhead of > re-starting another community for this project. Personally, I felt that > here are the benefits we should be getting: > > 1. We have already agreed mostly that Kafka Streams API would allow some > pluggable partition management functions. Hence, the advanced partition > management can live out-side the new Kafka Streams core w/o affecting the > run-as-a-library model in Kafka Streams. > 2. The integration w/ cluster management system and advanced features > listed above stays in the same project and allow existing users enjoy > no-impact migration to Kafka Stream as the core. That also addresses Tim's > question on "removing the support for YARN". > 3. A separate project for stream-processing-as-a-service also allow the > new Kafka Streams being independent to any cluster management and just > focusing on stream process core functions, while leaving the functions that > requires cluster-resource and state management to a separate layer. > > Please feel free to comment. Thanks! > > On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <criccom...@apache.org> > wrote: > >> Hey all, >> >> I want to start by saying that I'm absolutely thrilled to be a part of >> this >> community. The amount of level-headed, thoughtful, educated discussion >> that's gone on over the past ~10 days is overwhelming. Wonderful. >> >> It seems like discussion is waning a bit, and we've reached some >> conclusions. There are several key emails in this threat, which I want to >> call out: >> >> 1. Jakob's summary of the three potential ways forward. >> >> >> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E >> 2. Julian's call out that we should be focusing on community over code. >> >> >> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E >> 3. Martin's summary about the benefits of merging communities. >> >> >> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E >> 4. Jakob's comments about the distinction between community and code >> paths. >> >> >> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E >> >> I agree with the comments on all of these emails. I think Martin's summary >> of his position aligns very closely with my own. To that end, I think we >> should get concrete about what the proposal is, and call a vote on it. >> Given that Jay, Martin, and I seem to be aligning fairly closely, I think >> we should start with: >> >> 1. [community] Make Samza a subproject of Kafka. >> 2. [community] Make all Samza PMC/committers committers of the subproject. >> 3. [community] Migrate Samza's website/documentation into Kafka's. >> 4. [code] Have the Samza community and the Kafka community start a >> from-scratch reboot together in the new Kafka subproject. We can >> borrow/copy & paste significant chunks of code from Samza's code base. >> 5. [code] The subproject would intentionally eliminate support for both >> other streaming systems and all deployment systems. >> 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26 >> (copy cat) >> 7. [code] Attempt to provide a bridge from the new subproject's processor >> interface to our legacy StreamTask interface. >> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka >> subproject that has a fault-tolerant container with state management. >> >> It's likely that (6) and (7) won't be fully drop-in. Still, the closer we >> can get, the better it's going to be for our existing community. >> >> One thing that I didn't touch on with (2) is whether any Samza PMC members >> should be rolled into Kafka PMC membership as well (though, Jay and Jakob >> are already PMC members on both). I think that Samza's community deserves >> a >> voice on the PMC, so I'd propose that we roll at least a few PMC members >> into the Kafka PMC, but I don't have a strong framework for which people >> to >> pick. >> >> Before (8), I think that Samza's TLP can continue to commit bug fixes and >> patches as it sees fit, provided that we openly communicate that we won't >> necessarily migrate new features to the new subproject, and that the TLP >> will be shut down after the migration to the Kafka subproject occurs. >> >> Jakob, I could use your guidance here about about how to achieve this from >> an Apache process perspective (sorry). >> >> * Should I just call a vote on this proposal? >> * Should it happen on dev or private? >> * Do committers have binding votes, or just PMC? >> >> Having trouble finding much detail on the Apache wikis. :( >> >> Cheers, >> Chris >> >> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <yanfang...@gmail.com> wrote: >> >> > Thanks, Jay. This argument persuaded me actually. :) >> > >> > Fang, Yan >> > yanfang...@gmail.com >> > >> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io> wrote: >> > >> > > Hey Yan, >> > > >> > > Yeah philosophically I think the argument is that you should capture >> the >> > > stream in Kafka independent of the transformation. This is obviously a >> > > Kafka-centric view point. >> > > >> > > Advantages of this: >> > > - In practice I think this is what e.g. Storm people often end up >> doing >> > > anyway. You usually need to throttle any access to a live serving >> > database. >> > > - Can have multiple subscribers and they get the same thing without >> > > additional load on the source system. >> > > - Applications can tap into the stream if need be by subscribing. >> > > - You can debug your transformation by tailing the Kafka topic with >> the >> > > console consumer >> > > - Can tee off the same data stream for batch analysis or Lambda arch >> > style >> > > re-processing >> > > >> > > The disadvantage is that it will use Kafka resources. But the idea is >> > > eventually you will have multiple subscribers to any data source (at >> > least >> > > for monitoring) so you will end up there soon enough anyway. >> > > >> > > Down the road the technical benefit is that I think it gives us a good >> > path >> > > towards end-to-end exactly once semantics from source to destination. >> > > Basically the connectors need to support idempotence when talking to >> > Kafka >> > > and we need the transactional write feature in Kafka to make the >> > > transformation atomic. This is actually pretty doable if you separate >> > > connector=>kafka problem from the generic transformations which are >> > always >> > > kafka=>kafka. However I think it is quite impossible to do in a >> > all_things >> > > => all_things environment. Today you can say "well the semantics of >> the >> > > Samza APIs depend on the connectors you use" but it is actually worse >> > then >> > > that because the semantics actually depend on the pairing of >> > connectors--so >> > > not only can you probably not get a usable "exactly once" guarantee >> > > end-to-end it can actually be quite hard to reverse engineer what >> > property >> > > (if any) your end-to-end flow has if you have heterogenous systems. >> > > >> > > -Jay >> > > >> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com> >> wrote: >> > > >> > > > {quote} >> > > > maintained in a separate repository and retaining the existing >> > > > committership but sharing as much else as possible (website, etc) >> > > > {quote} >> > > > >> > > > Overall, I agree on this idea. Now the question is more about "how >> to >> > do >> > > > it". >> > > > >> > > > On the other hand, one thing I want to point out is that, if we >> decide >> > to >> > > > go this way, how do we want to support >> > > > otherSystem-transformation-otherSystem use case? >> > > > >> > > > Basically, there are four user groups here: >> > > > >> > > > 1. Kafka-transformation-Kafka >> > > > 2. Kafka-transformation-otherSystem >> > > > 3. otherSystem-transformation-Kafka >> > > > 4. otherSystem-transformation-otherSystem >> > > > >> > > > For group 1, they can easily use the new Samza library to achieve. >> For >> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka or >> > Kafka-> >> > > > transformation -> copyCat. >> > > > >> > > > The problem is for group 4. Do we want to abandon this or still >> support >> > > it? >> > > > Of course, this use case can be achieved by using copyCat -> >> > > transformation >> > > > -> Kafka -> transformation -> copyCat, the thing is how we persuade >> > them >> > > to >> > > > do this long chain. If yes, it will also be a win for Kafka too. Or >> if >> > > > there is no one in this community actually doing this so far, maybe >> ok >> > to >> > > > not support the group 4 directly. >> > > > >> > > > Thanks, >> > > > >> > > > Fang, Yan >> > > > yanfang...@gmail.com >> > > > >> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io> >> wrote: >> > > > >> > > > > Yeah I agree with this summary. I think there are kind of two >> > questions >> > > > > here: >> > > > > 1. Technically does alignment/reliance on Kafka make sense >> > > > > 2. Branding wise (naming, website, concepts, etc) does alignment >> with >> > > > Kafka >> > > > > make sense >> > > > > >> > > > > Personally I do think both of these things would be really >> valuable, >> > > and >> > > > > would dramatically alter the trajectory of the project. >> > > > > >> > > > > My preference would be to see if people can mostly agree on a >> > direction >> > > > > rather than splintering things off. From my point of view the >> ideal >> > > > outcome >> > > > > of all the options discussed would be to make Samza a closely >> aligned >> > > > > subproject, maintained in a separate repository and retaining the >> > > > existing >> > > > > committership but sharing as much else as possible (website, >> etc). No >> > > > idea >> > > > > about how these things work, Jacob, you probably know more. >> > > > > >> > > > > No discussion amongst the Kafka folks has happened on this, but >> > likely >> > > we >> > > > > should figure out what the Samza community actually wants first. >> > > > > >> > > > > I admit that this is a fairly radical departure from how things >> are. >> > > > > >> > > > > If that doesn't fly, I think, yeah we could leave Samza as it is >> and >> > do >> > > > the >> > > > > more radical reboot inside Kafka. From my point of view that does >> > leave >> > > > > things in a somewhat confusing state since now there are two >> stream >> > > > > processing systems more or less coupled to Kafka in large part >> made >> > by >> > > > the >> > > > > same people. But, arguably that might be a cleaner way to make the >> > > > cut-over >> > > > > and perhaps less risky for Samza community since if it works >> people >> > can >> > > > > switch and if it doesn't nothing will have changed. Dunno, how do >> > > people >> > > > > feel about this? >> > > > > >> > > > > -Jay >> > > > > >> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <jgho...@gmail.com> >> > > wrote: >> > > > > >> > > > > > > This leads me to thinking that merging projects and >> communities >> > > > might >> > > > > > be a good idea: with the union of experience from both >> communities, >> > > we >> > > > > will >> > > > > > probably build a better system that is better for users. >> > > > > > Is this what's being proposed though? Merging the projects seems >> > like >> > > > > > a consequence of at most one of the three directions under >> > > discussion: >> > > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka >> for >> > > > > > configuration, etc. (to a greater or lesser extent to be >> > determined) >> > > > > > but the Samza community would not automatically merge withe >> Kafka >> > > > > > community (the Phoenix/HBase example is a good one here). >> > > > > > 2) Samza Reboot: The Samza community continues to exist with a >> > > limited >> > > > > > project scope, but similarly would not need to be part of the >> Kafka >> > > > > > community (ie given committership) to progress. Here, maybe the >> > > Samza >> > > > > > team would become a subproject of Kafka (the Board frowns on >> > > > > > subprojects at the moment, so I'm not sure if that's even >> > feasible), >> > > > > > but that would not be required. >> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the >> > Kafka >> > > > > > team builds its own streaming library, possibly off of Jay's >> > > > > > prototype, which has not direct lineage to the Samza team. >> There's >> > > no >> > > > > > reason for the Kafka team to bring in the Samza team. >> > > > > > >> > > > > > Is the Kafka community on board with this? >> > > > > > >> > > > > > To be clear, all three options under discussion are interesting, >> > > > > > technically valid and likely healthy directions for the project. >> > > > > > Also, they are not mutually exclusive. The Samza community >> could >> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community >> went >> > > > > > forward with 'Hey Samza!' My points above are directed >> entirely at >> > > > > > the community aspect of these choices. >> > > > > > -Jakob >> > > > > > >> > > > > > On 10 July 2015 at 09:10, Roger Hoover <roger.hoo...@gmail.com> >> > > wrote: >> > > > > > > That's great. Thanks, Jay. >> > > > > > > >> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <j...@confluent.io> >> > > wrote: >> > > > > > > >> > > > > > >> Yeah totally agree. I think you have this issue even today, >> > right? >> > > > > I.e. >> > > > > > if >> > > > > > >> you need to make a simple config change and you're running in >> > YARN >> > > > > today >> > > > > > >> you end up bouncing the job which then rebuilds state. I >> think >> > the >> > > > fix >> > > > > > is >> > > > > > >> exactly what you described which is to have a long timeout on >> > > > > partition >> > > > > > >> movement for stateful jobs so that if a job is just getting >> > > bounced, >> > > > > and >> > > > > > >> the cluster manager (or admin) is smart enough to restart it >> on >> > > the >> > > > > same >> > > > > > >> host when possible, it can optimistically reuse any existing >> > state >> > > > it >> > > > > > finds >> > > > > > >> on disk (if it is valid). >> > > > > > >> >> > > > > > >> So in this model the charter of the CM is to place processes >> as >> > > > > > stickily as >> > > > > > >> possible and to restart or re-place failed processes. The >> > charter >> > > of >> > > > > the >> > > > > > >> partition management system is to control the assignment of >> work >> > > to >> > > > > > these >> > > > > > >> processes. The nice thing about this is that the work >> > assignment, >> > > > > > timeouts, >> > > > > > >> behavior, configs, and code will all be the same across all >> > > cluster >> > > > > > >> managers. >> > > > > > >> >> > > > > > >> So I think that prototype would actually give you exactly >> what >> > you >> > > > > want >> > > > > > >> today for any cluster manager (or manual placement + restart >> > > script) >> > > > > > that >> > > > > > >> was sticky in terms of host placement since there is already >> a >> > > > > > configurable >> > > > > > >> partition movement timeout and task-by-task state reuse with >> a >> > > check >> > > > > on >> > > > > > >> state validity. >> > > > > > >> >> > > > > > >> -Jay >> > > > > > >> >> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover < >> > > > roger.hoo...@gmail.com >> > > > > > >> > > > > > >> wrote: >> > > > > > >> >> > > > > > >> > That would be great to let Kafka do as much heavy lifting >> as >> > > > > possible >> > > > > > and >> > > > > > >> > make it easier for other languages to implement Samza apis. >> > > > > > >> > >> > > > > > >> > One thing to watch out for is the interplay between Kafka's >> > > group >> > > > > > >> > management and the external scheduler/process manager's >> fault >> > > > > > tolerance. >> > > > > > >> > If a container dies, the Kafka group membership protocol >> will >> > > try >> > > > to >> > > > > > >> assign >> > > > > > >> > it's tasks to other containers while at the same time the >> > > process >> > > > > > manager >> > > > > > >> > is trying to relaunch the container. Without some >> > consideration >> > > > for >> > > > > > this >> > > > > > >> > (like a configurable amount of time to wait before Kafka >> > alters >> > > > the >> > > > > > group >> > > > > > >> > membership), there may be thrashing going on which is >> > especially >> > > > bad >> > > > > > for >> > > > > > >> > containers with large amounts of local state. >> > > > > > >> > >> > > > > > >> > Someone else pointed this out already but I thought it >> might >> > be >> > > > > worth >> > > > > > >> > calling out again. >> > > > > > >> > >> > > > > > >> > Cheers, >> > > > > > >> > >> > > > > > >> > Roger >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps < >> j...@confluent.io> >> > > > > wrote: >> > > > > > >> > >> > > > > > >> > > Hey Roger, >> > > > > > >> > > >> > > > > > >> > > I couldn't agree more. We spent a bunch of time talking >> to >> > > > people >> > > > > > and >> > > > > > >> > that >> > > > > > >> > > is exactly the stuff we heard time and again. What makes >> it >> > > > hard, >> > > > > of >> > > > > > >> > > course, is that there is some tension between >> compatibility >> > > with >> > > > > > what's >> > > > > > >> > > there now and making things better for new users. >> > > > > > >> > > >> > > > > > >> > > I also strongly agree with the importance of >> multi-language >> > > > > > support. We >> > > > > > >> > are >> > > > > > >> > > talking now about Java, but for application development >> use >> > > > cases >> > > > > > >> people >> > > > > > >> > > want to work in whatever language they are using >> elsewhere. >> > I >> > > > > think >> > > > > > >> > moving >> > > > > > >> > > to a model where Kafka itself does the group membership, >> > > > lifecycle >> > > > > > >> > control, >> > > > > > >> > > and partition assignment has the advantage of putting all >> > that >> > > > > > complex >> > > > > > >> > > stuff behind a clean api that the clients are already >> going >> > to >> > > > be >> > > > > > >> > > implementing for their consumer, so the added >> functionality >> > > for >> > > > > > stream >> > > > > > >> > > processing beyond a consumer becomes very minor. >> > > > > > >> > > >> > > > > > >> > > -Jay >> > > > > > >> > > >> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover < >> > > > > > roger.hoo...@gmail.com> >> > > > > > >> > > wrote: >> > > > > > >> > > >> > > > > > >> > > > Metamorphosis...nice. :) >> > > > > > >> > > > >> > > > > > >> > > > This has been a great discussion. As a user of Samza >> > who's >> > > > > > recently >> > > > > > >> > > > integrated it into a relatively large organization, I >> just >> > > > want >> > > > > to >> > > > > > >> add >> > > > > > >> > > > support to a few points already made. >> > > > > > >> > > > >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it >> currently >> > > > exists >> > > > > > that >> > > > > > >> > I've >> > > > > > >> > > > experienced are: >> > > > > > >> > > > 1) YARN - YARN is overly complex in many environments >> > where >> > > > > Puppet >> > > > > > >> > would >> > > > > > >> > > do >> > > > > > >> > > > just fine but it was the only mechanism to get fault >> > > > tolerance. >> > > > > > >> > > > 2) Configuration - I think I like the idea of >> configuring >> > > most >> > > > > of >> > > > > > the >> > > > > > >> > job >> > > > > > >> > > > in code rather than config files. In general, I think >> the >> > > > goal >> > > > > > >> should >> > > > > > >> > be >> > > > > > >> > > > to make it harder to make mistakes, especially of the >> kind >> > > > where >> > > > > > the >> > > > > > >> > code >> > > > > > >> > > > expects something and the config doesn't match. The >> > current >> > > > > > config >> > > > > > >> is >> > > > > > >> > > > quite intricate and error-prone. For example, the >> > > application >> > > > > > logic >> > > > > > >> > may >> > > > > > >> > > > depend on bootstrapping a topic but rather than >> asserting >> > > that >> > > > > in >> > > > > > the >> > > > > > >> > > code, >> > > > > > >> > > > you have to rely on getting the config right. Likewise >> > with >> > > > > > serdes, >> > > > > > >> > the >> > > > > > >> > > > Java representations produced by various serdes (JSON, >> > Avro, >> > > > > etc.) >> > > > > > >> are >> > > > > > >> > > not >> > > > > > >> > > > equivalent so you cannot just reconfigure a serde >> without >> > > > > changing >> > > > > > >> the >> > > > > > >> > > > code. It would be nice for jobs to be able to assert >> > what >> > > > they >> > > > > > >> expect >> > > > > > >> > > > from their input topics in terms of partitioning. >> This is >> > > > > > getting a >> > > > > > >> > > little >> > > > > > >> > > > off topic but I was even thinking about creating a >> "Samza >> > > > config >> > > > > > >> > linter" >> > > > > > >> > > > that would sanity check a set of configs. Especially >> in >> > > > > > >> organizations >> > > > > > >> > > > where config is managed by a different team than the >> > > > application >> > > > > > >> > > developer, >> > > > > > >> > > > it's very hard to get avoid config mistakes. >> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially >> > > DevOps-type >> > > > > > >> folks), >> > > > > > >> > > the >> > > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak >> > command >> > > > > line >> > > > > > >> > > support, >> > > > > > >> > > > configuration over convention) really inhibits >> > productivity. >> > > > As >> > > > > > more >> > > > > > >> > and >> > > > > > >> > > > more high-quality clients become available for Kafka, I >> > hope >> > > > > > they'll >> > > > > > >> > > follow >> > > > > > >> > > > Samza's model. Not sure how much it affects the >> proposals >> > > in >> > > > > this >> > > > > > >> > thread >> > > > > > >> > > > but please consider other languages in the ecosystem as >> > > well. >> > > > > > From >> > > > > > >> > what >> > > > > > >> > > > I've heard, Spark has more Python users than >> Java/Scala. >> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > >> > > > > > >> > >> > > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza >> > > > > > >> > > > and are working on a Yeoman generator >> > > > > > >> > > > https://github.com/Quantiply/generator-rico for >> > > Jython/Samza >> > > > > > >> projects >> > > > > > >> > to >> > > > > > >> > > > alleviate some of the pain) >> > > > > > >> > > > >> > > > > > >> > > > I also want to underscore Jay's point about improving >> the >> > > user >> > > > > > >> > > experience. >> > > > > > >> > > > That's a very important factor for adoption. I think >> the >> > > goal >> > > > > > should >> > > > > > >> > be >> > > > > > >> > > to >> > > > > > >> > > > make Samza as easy to get started with as something >> like >> > > > > Logstash. >> > > > > > >> > > > Logstash is vastly inferior in terms of capabilities to >> > > Samza >> > > > > but >> > > > > > >> it's >> > > > > > >> > > easy >> > > > > > >> > > > to get started and that makes a big difference. >> > > > > > >> > > > >> > > > > > >> > > > Cheers, >> > > > > > >> > > > >> > > > > > >> > > > Roger >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci >> > > > Morales < >> > > > > > >> > > > g...@apache.org> wrote: >> > > > > > >> > > > >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka >> Metamorphosis >> > > is >> > > > a >> > > > > > clear >> > > > > > >> > > > winner >> > > > > > >> > > > > :) >> > > > > > >> > > > > >> > > > > > >> > > > > -- >> > > > > > >> > > > > Gianmarco >> > > > > > >> > > > > >> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci >> Morales >> > < >> > > > > > >> > > g...@apache.org >> > > > > > >> > > > > >> > > > > > >> > > > > wrote: >> > > > > > >> > > > > >> > > > > > >> > > > > > Hi, >> > > > > > >> > > > > > >> > > > > > >> > > > > > @Martin, thanks for you comments. >> > > > > > >> > > > > > Maybe I'm missing some important point, but I think >> > > > coupling >> > > > > > the >> > > > > > >> > > > releases >> > > > > > >> > > > > > is actually a *good* thing. >> > > > > > >> > > > > > To make an example, would it be better if the MR >> and >> > > HDFS >> > > > > > >> > components >> > > > > > >> > > of >> > > > > > >> > > > > > Hadoop had different release schedules? >> > > > > > >> > > > > > >> > > > > > >> > > > > > Actually, keeping the discussion in a single place >> > would >> > > > > make >> > > > > > >> > > agreeing >> > > > > > >> > > > on >> > > > > > >> > > > > > releases (and backwards compatibility) much >> easier, as >> > > > > > everybody >> > > > > > >> > > would >> > > > > > >> > > > be >> > > > > > >> > > > > > responsible for the whole codebase. >> > > > > > >> > > > > > >> > > > > > >> > > > > > That said, I like the idea of absorbing samza-core >> as >> > a >> > > > > > >> > sub-project, >> > > > > > >> > > > and >> > > > > > >> > > > > > leave the fancy stuff separate. >> > > > > > >> > > > > > It probably gives 90% of the benefits we have been >> > > > > discussing >> > > > > > >> here. >> > > > > > >> > > > > > >> > > > > > >> > > > > > Cheers, >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > >> > > > > > Gianmarco >> > > > > > >> > > > > > >> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps < >> > jay.kr...@gmail.com >> > > > >> > > > > > wrote: >> > > > > > >> > > > > > >> > > > > > >> > > > > >> Hey Martin, >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> I agree coupling release schedules is a downside. >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> Definitely we can try to solve some of the >> > integration >> > > > > > problems >> > > > > > >> in >> > > > > > >> > > > > >> Confluent Platform or in other distributions. But >> I >> > > think >> > > > > > this >> > > > > > >> > ends >> > > > > > >> > > up >> > > > > > >> > > > > >> being really shallow. I guess I feel to really >> get a >> > > good >> > > > > > user >> > > > > > >> > > > > experience >> > > > > > >> > > > > >> the two systems have to kind of feel like part of >> the >> > > > same >> > > > > > thing >> > > > > > >> > and >> > > > > > >> > > > you >> > > > > > >> > > > > >> can't really add that in later--you can put both >> in >> > the >> > > > > same >> > > > > > >> > > > > downloadable >> > > > > > >> > > > > >> tar file but it doesn't really give a very >> cohesive >> > > > > feeling. >> > > > > > I >> > > > > > >> > agree >> > > > > > >> > > > > that >> > > > > > >> > > > > >> ultimately any of the project stuff is as much >> social >> > > and >> > > > > > naming >> > > > > > >> > as >> > > > > > >> > > > > >> anything else--theoretically two totally >> independent >> > > > > projects >> > > > > > >> > could >> > > > > > >> > > > work >> > > > > > >> > > > > >> to >> > > > > > >> > > > > >> tightly align. In practice this seems to be quite >> > > > difficult >> > > > > > >> > though. >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> For the frameworks--totally agree it would be >> good to >> > > > > > maintain >> > > > > > >> the >> > > > > > >> > > > > >> framework support with the project. In some cases >> > there >> > > > may >> > > > > > not >> > > > > > >> be >> > > > > > >> > > too >> > > > > > >> > > > > >> much >> > > > > > >> > > > > >> there since the integration gets lighter but I >> think >> > > > > whatever >> > > > > > >> > stubs >> > > > > > >> > > > you >> > > > > > >> > > > > >> need should be included. So no I definitely wasn't >> > > trying >> > > > > to >> > > > > > >> imply >> > > > > > >> > > > > >> dropping >> > > > > > >> > > > > >> support for these frameworks, just making the >> > > integration >> > > > > > >> lighter >> > > > > > >> > by >> > > > > > >> > > > > >> separating process management from partition >> > > management. >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> You raise two good points we would have to figure >> out >> > > if >> > > > we >> > > > > > went >> > > > > > >> > > down >> > > > > > >> > > > > the >> > > > > > >> > > > > >> alignment path: >> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the >> first >> > > > > question >> > > > > > is >> > > > > > >> > > > whether >> > > > > > >> > > > > >> some "re-branding" would be worth it. If so then I >> > > think >> > > > we >> > > > > > can >> > > > > > >> > > have a >> > > > > > >> > > > > big >> > > > > > >> > > > > >> thread on the name. I'm definitely not set on >> Kafka >> > > > > > Streaming or >> > > > > > >> > > Kafka >> > > > > > >> > > > > >> Streams I was just using them to be kind of >> > > > illustrative. I >> > > > > > >> agree >> > > > > > >> > > with >> > > > > > >> > > > > >> your >> > > > > > >> > > > > >> critique of these names, though I think people >> would >> > > get >> > > > > the >> > > > > > >> idea. >> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how to >> > > "factor" >> > > > > it. >> > > > > > >> Here >> > > > > > >> > > are >> > > > > > >> > > > > the >> > > > > > >> > > > > >> options I see (I could get enthusiastic about any >> of >> > > > them): >> > > > > > >> > > > > >> a. One repo for both Kafka and Samza >> > > > > > >> > > > > >> b. Two repos, retaining the current seperation >> > > > > > >> > > > > >> c. Two repos, the equivalent of samza-api and >> > > > samza-core >> > > > > > is >> > > > > > >> > > > absorbed >> > > > > > >> > > > > >> almost like a third client >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> Cheers, >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> -Jay >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann < >> > > > > > >> > > > mar...@kleppmann.com> >> > > > > > >> > > > > >> wrote: >> > > > > > >> > > > > >> >> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few >> > > follow-up >> > > > > > >> > comments. >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or >> > becoming >> > > a >> > > > > > >> > subproject: >> > > > > > >> > > > the >> > > > > > >> > > > > >> > reasons you mention are good. The risk I see is >> > that >> > > > > > release >> > > > > > >> > > > schedules >> > > > > > >> > > > > >> > become coupled to each other, which can slow >> > everyone >> > > > > down, >> > > > > > >> and >> > > > > > >> > > > large >> > > > > > >> > > > > >> > projects with many contributors are harder to >> > manage. >> > > > > > (Jakob, >> > > > > > >> > can >> > > > > > >> > > > you >> > > > > > >> > > > > >> speak >> > > > > > >> > > > > >> > from experience, having seen a wider range of >> > Hadoop >> > > > > > ecosystem >> > > > > > >> > > > > >> projects?) >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > Some of the goals of a better unified developer >> > > > > experience >> > > > > > >> could >> > > > > > >> > > > also >> > > > > > >> > > > > be >> > > > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka >> > > > > > distribution >> > > > > > >> > (such >> > > > > > >> > > > as >> > > > > > >> > > > > >> > Confluent's). I'm not against merging projects >> if >> > we >> > > > > decide >> > > > > > >> > that's >> > > > > > >> > > > the >> > > > > > >> > > > > >> way >> > > > > > >> > > > > >> > to go, just pointing out the same goals can >> perhaps >> > > > also >> > > > > be >> > > > > > >> > > achieved >> > > > > > >> > > > > in >> > > > > > >> > > > > >> > other ways. >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency: >> are >> > > you >> > > > > > >> proposing >> > > > > > >> > > > that >> > > > > > >> > > > > >> > Samza doesn't give any help to people wanting to >> > run >> > > on >> > > > > > >> > > > > >> YARN/Mesos/AWS/etc? >> > > > > > >> > > > > >> > So the docs would basically have a link to >> Slider >> > and >> > > > > > nothing >> > > > > > >> > > else? >> > > > > > >> > > > Or >> > > > > > >> > > > > >> > would we maintain integrations with a bunch of >> > > popular >> > > > > > >> > deployment >> > > > > > >> > > > > >> methods >> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to >> make >> > > > Samza >> > > > > > work >> > > > > > >> > with >> > > > > > >> > > > > >> Slider)? >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > I absolutely think it's a good idea to have the >> > "as a >> > > > > > library" >> > > > > > >> > and >> > > > > > >> > > > > "as a >> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for >> people >> > who >> > > > > want >> > > > > > >> them, >> > > > > > >> > > > but I >> > > > > > >> > > > > >> > think there should also be a low-friction path >> for >> > > > common >> > > > > > "as >> > > > > > >> a >> > > > > > >> > > > > service" >> > > > > > >> > > > > >> > deployment methods, for which we probably need >> to >> > > > > maintain >> > > > > > >> > > > > integrations. >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to >> me, >> > > > > because >> > > > > > >> Kafka >> > > > > > >> > > is >> > > > > > >> > > > > all >> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka >> Transformers" >> > > or >> > > > > > "Kafka >> > > > > > >> > > > Filters" >> > > > > > >> > > > > >> > would be more apt? >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza >> (stream >> > > > > > >> transformation >> > > > > > >> > > > with >> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a >> library" >> > > bit) >> > > > > > could >> > > > > > >> > > become >> > > > > > >> > > > > >> part of >> > > > > > >> > > > > >> > Kafka, while higher-level tools such as >> streaming >> > SQL >> > > > and >> > > > > > >> > > > integrations >> > > > > > >> > > > > >> with >> > > > > > >> > > > > >> > deployment frameworks remain in a separate >> project? >> > > In >> > > > > > other >> > > > > > >> > > words, >> > > > > > >> > > > > >> Kafka >> > > > > > >> > > > > >> > would absorb the proven, stable core of Samza, >> > which >> > > > > would >> > > > > > >> > become >> > > > > > >> > > > the >> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this >> > thread. >> > > > The >> > > > > > Samza >> > > > > > >> > > > project >> > > > > > >> > > > > >> > would then target that third Kafka client as its >> > base >> > > > > API, >> > > > > > and >> > > > > > >> > the >> > > > > > >> > > > > >> project >> > > > > > >> > > > > >> > would be freed up to explore more experimental >> new >> > > > > > horizons. >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > Martin >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps < >> > > > jay.kr...@gmail.com> >> > > > > > >> wrote: >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > > Hey Martin, >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually >> > don't >> > > > > think >> > > > > > it >> > > > > > >> > ties >> > > > > > >> > > > our >> > > > > > >> > > > > >> > hands >> > > > > > >> > > > > >> > > at all, all it does is refactor things. The >> > > division >> > > > of >> > > > > > >> > > > > >> responsibility is >> > > > > > >> > > > > >> > > that Samza core is responsible for task >> > lifecycle, >> > > > > state, >> > > > > > >> and >> > > > > > >> > > > > >> partition >> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator) but >> it >> > is >> > > > NOT >> > > > > > >> > > > responsible >> > > > > > >> > > > > >> for >> > > > > > >> > > > > >> > > packaging, configuration deployment or >> execution >> > of >> > > > > > >> processes. >> > > > > > >> > > The >> > > > > > >> > > > > >> > problem >> > > > > > >> > > > > >> > > of packaging and starting these processes is >> > > > > > >> > > > > >> > > framework/environment-specific. This leaves >> > > > individual >> > > > > > >> > > frameworks >> > > > > > >> > > > to >> > > > > > >> > > > > >> be >> > > > > > >> > > > > >> > as >> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can get >> > > simple >> > > > > > >> stateless >> > > > > > >> > > > > >> support in >> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app >> > > > > framework >> > > > > > >> > > (Slider, >> > > > > > >> > > > > >> > Marathon, >> > > > > > >> > > > > >> > > etc). These are well known by people and have >> > nice >> > > > UIs >> > > > > > and a >> > > > > > >> > lot >> > > > > > >> > > > of >> > > > > > >> > > > > >> > > flexibility. I don't think they have node >> > affinity >> > > > as a >> > > > > > >> built >> > > > > > >> > in >> > > > > > >> > > > > >> option >> > > > > > >> > > > > >> > > (though I could be wrong). So if we want that >> we >> > > can >> > > > > > either >> > > > > > >> > wait >> > > > > > >> > > > for >> > > > > > >> > > > > >> them >> > > > > > >> > > > > >> > > to add it or do a custom framework to add that >> > > > feature >> > > > > > (as >> > > > > > >> > now). >> > > > > > >> > > > > >> > Obviously >> > > > > > >> > > > > >> > > if you manage things with old-school ops tools >> > > > > > >> > (puppet/chef/etc) >> > > > > > >> > > > you >> > > > > > >> > > > > >> get >> > > > > > >> > > > > >> > > locality easily. The nice thing, though, is >> that >> > > all >> > > > > the >> > > > > > >> samza >> > > > > > >> > > > > >> "business >> > > > > > >> > > > > >> > > logic" around partition management and fault >> > > > tolerance >> > > > > > is in >> > > > > > >> > > Samza >> > > > > > >> > > > > >> core >> > > > > > >> > > > > >> > so >> > > > > > >> > > > > >> > > it is shared across frameworks and the >> framework >> > > > > specific >> > > > > > >> bit >> > > > > > >> > is >> > > > > > >> > > > > just >> > > > > > >> > > > > >> > > whether it is smart enough to try to get the >> same >> > > > host >> > > > > > when >> > > > > > >> a >> > > > > > >> > > job >> > > > > > >> > > > is >> > > > > > >> > > > > >> > > restarted. >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I >> think >> > > the >> > > > > > goal >> > > > > > >> > would >> > > > > > >> > > > be >> > > > > > >> > > > > >> (a) >> > > > > > >> > > > > >> > > actually get better alignment in user >> experience, >> > > and >> > > > > (b) >> > > > > > >> > > express >> > > > > > >> > > > > >> this in >> > > > > > >> > > > > >> > > the naming and project branding. Specifically: >> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the >> > > > > > "transformation" >> > > > > > >> api >> > > > > > >> > > to >> > > > > > >> > > > be >> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be >> able >> > > to >> > > > > > explain >> > > > > > >> > > when >> > > > > > >> > > > to >> > > > > > >> > > > > >> use >> > > > > > >> > > > > >> > > the consumer and when to use the stream >> > processing >> > > > > > >> > functionality >> > > > > > >> > > > and >> > > > > > >> > > > > >> lead >> > > > > > >> > > > > >> > > people into that experience. >> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2 >> (or >> > > > > > whatever) >> > > > > > >> > that >> > > > > > >> > > > has >> > > > > > >> > > > > >> both >> > > > > > >> > > > > >> > > Kafka and the stream processing part and they >> > > > actually >> > > > > > work >> > > > > > >> > > > > together. >> > > > > > >> > > > > >> > > 3. Unify the programming experience so the >> client >> > > and >> > > > > > Samza >> > > > > > >> > api >> > > > > > >> > > > > share >> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc. >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > I think sub-projects keep separate committers >> and >> > > can >> > > > > > have a >> > > > > > >> > > > > separate >> > > > > > >> > > > > >> > repo, >> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't >> find a >> > > > > > definition >> > > > > > >> > of a >> > > > > > >> > > > > >> > subproject >> > > > > > >> > > > > >> > > in Apache). >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > Basically at a high-level you want the >> experience >> > > to >> > > > > > "feel" >> > > > > > >> > > like a >> > > > > > >> > > > > >> single >> > > > > > >> > > > > >> > > system, not to relatively independent things >> that >> > > are >> > > > > > kind >> > > > > > >> of >> > > > > > >> > > > > >> awkwardly >> > > > > > >> > > > > >> > > glued together. >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > I think if we did that they having naming or >> > > branding >> > > > > > like >> > > > > > >> > > "kafka >> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something >> like >> > > that >> > > > > > would >> > > > > > >> > > > actually >> > > > > > >> > > > > >> do a >> > > > > > >> > > > > >> > > good job of conveying what it is. I do that >> this >> > > > would >> > > > > > help >> > > > > > >> > > > adoption >> > > > > > >> > > > > >> > quite >> > > > > > >> > > > > >> > > a lot as it would correctly convey that using >> > Kafka >> > > > > > >> Streaming >> > > > > > >> > > with >> > > > > > >> > > > > >> Kafka >> > > > > > >> > > > > >> > is >> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is >> pretty >> > > > > heavily >> > > > > > >> > adopted >> > > > > > >> > > > at >> > > > > > >> > > > > >> this >> > > > > > >> > > > > >> > > point. >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > Fwiw we actually considered this model >> originally >> > > > when >> > > > > > open >> > > > > > >> > > > sourcing >> > > > > > >> > > > > >> > Samza, >> > > > > > >> > > > > >> > > however at that time Kafka was relatively >> unknown >> > > and >> > > > > we >> > > > > > >> > decided >> > > > > > >> > > > not >> > > > > > >> > > > > >> to >> > > > > > >> > > > > >> > do >> > > > > > >> > > > > >> > > it since we felt it would be limiting. From my >> > > point >> > > > of >> > > > > > view >> > > > > > >> > the >> > > > > > >> > > > > three >> > > > > > >> > > > > >> > > things have changed (1) Kafka is now really >> > heavily >> > > > > used >> > > > > > for >> > > > > > >> > > > stream >> > > > > > >> > > > > >> > > processing, (2) we learned that abstracting >> out >> > the >> > > > > > stream >> > > > > > >> > well >> > > > > > >> > > is >> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is >> really >> > > > hard >> > > > > to >> > > > > > >> keep >> > > > > > >> > > the >> > > > > > >> > > > > two >> > > > > > >> > > > > >> > > things feeling like a single product. >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > -Jay >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin >> Kleppmann >> > < >> > > > > > >> > > > > >> mar...@kleppmann.com> >> > > > > > >> > > > > >> > > wrote: >> > > > > > >> > > > > >> > > >> > > > > > >> > > > > >> > >> Hi all, >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> Lots of good thoughts here. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> I agree with the general philosophy of tying >> > Samza >> > > > > more >> > > > > > >> > firmly >> > > > > > >> > > to >> > > > > > >> > > > > >> Kafka. >> > > > > > >> > > > > >> > >> After I spent a while looking at integrating >> > other >> > > > > > message >> > > > > > >> > > > brokers >> > > > > > >> > > > > >> (e.g. >> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the >> > > > conclusion >> > > > > > that >> > > > > > >> > > > > >> > SystemConsumer >> > > > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's >> > that >> > > > > pretty >> > > > > > >> much >> > > > > > >> > > > > nobody >> > > > > > >> > > > > >> but >> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is >> > perhaps >> > > an >> > > > > > >> > exception, >> > > > > > >> > > > but >> > > > > > >> > > > > >> it >> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus, >> > > making >> > > > > > Samza >> > > > > > >> > > fully >> > > > > > >> > > > > >> > dependent >> > > > > > >> > > > > >> > >> on Kafka acknowledges that the >> > system-independence >> > > > was >> > > > > > >> never >> > > > > > >> > as >> > > > > > >> > > > > real >> > > > > > >> > > > > >> as >> > > > > > >> > > > > >> > we >> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of code >> > reuse >> > > > are >> > > > > > >> real. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has >> also >> > > > always >> > > > > > been >> > > > > > >> > > > > >> appealing to >> > > > > > >> > > > > >> > >> me, for various reasons already mentioned in >> > this >> > > > > > thread. >> > > > > > >> > > > Although >> > > > > > >> > > > > >> > making >> > > > > > >> > > > > >> > >> Samza jobs deployable on anything >> > > > (YARN/Mesos/AWS/etc) >> > > > > > >> seems >> > > > > > >> > > > > >> laudable, >> > > > > > >> > > > > >> > I am >> > > > > > >> > > > > >> > >> a little concerned that it will restrict us >> to a >> > > > > lowest >> > > > > > >> > common >> > > > > > >> > > > > >> > denominator. >> > > > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617) >> > still >> > > > be >> > > > > > >> > possible? >> > > > > > >> > > > For >> > > > > > >> > > > > >> jobs >> > > > > > >> > > > > >> > >> with large amounts of state, I think >> SAMZA-617 >> > > would >> > > > > be >> > > > > > a >> > > > > > >> big >> > > > > > >> > > > boon, >> > > > > > >> > > > > >> > since >> > > > > > >> > > > > >> > >> restoring state off the changelog on every >> > single >> > > > > > restart >> > > > > > >> is >> > > > > > >> > > > > painful, >> > > > > > >> > > > > >> > due >> > > > > > >> > > > > >> > >> to long recovery times. It would be a shame >> if >> > the >> > > > > > >> decoupling >> > > > > > >> > > > from >> > > > > > >> > > > > >> YARN >> > > > > > >> > > > > >> > >> made host affinity impossible. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> Jay, a question about the proposed API for >> > > > > > instantiating a >> > > > > > >> > job >> > > > > > >> > > in >> > > > > > >> > > > > >> code >> > > > > > >> > > > > >> > >> (rather than a properties file): when >> > submitting a >> > > > job >> > > > > > to a >> > > > > > >> > > > > cluster, >> > > > > > >> > > > > >> is >> > > > > > >> > > > > >> > the >> > > > > > >> > > > > >> > >> idea that the instantiation code runs on a >> > client >> > > > > > >> somewhere, >> > > > > > >> > > > which >> > > > > > >> > > > > >> then >> > > > > > >> > > > > >> > >> pokes the necessary endpoints on >> > > YARN/Mesos/AWS/etc? >> > > > > Or >> > > > > > >> does >> > > > > > >> > > that >> > > > > > >> > > > > >> code >> > > > > > >> > > > > >> > run >> > > > > > >> > > > > >> > >> on each container that is part of the job (in >> > > which >> > > > > > case, >> > > > > > >> how >> > > > > > >> > > > does >> > > > > > >> > > > > >> the >> > > > > > >> > > > > >> > job >> > > > > > >> > > > > >> > >> submission to the cluster work)? >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel >> right to >> > > > make >> > > > > a >> > > > > > 1.0 >> > > > > > >> > > > release >> > > > > > >> > > > > >> > with a >> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if >> > this >> > > > is >> > > > > > going >> > > > > > >> > to >> > > > > > >> > > > > >> happen, I >> > > > > > >> > > > > >> > >> think it would be more honest to stick with >> 0.* >> > > > > version >> > > > > > >> > numbers >> > > > > > >> > > > > until >> > > > > > >> > > > > >> > the >> > > > > > >> > > > > >> > >> library-ified Samza has been implemented, is >> > > stable >> > > > > and >> > > > > > >> > widely >> > > > > > >> > > > > used. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of >> Kafka? >> > > There >> > > > > is >> > > > > > >> > > precedent >> > > > > > >> > > > > for >> > > > > > >> > > > > >> > >> tight coupling between different Apache >> projects >> > > > (e.g. >> > > > > > >> > Curator >> > > > > > >> > > > and >> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think >> > > remaining >> > > > > > >> separate >> > > > > > >> > > > would >> > > > > > >> > > > > >> be >> > > > > > >> > > > > >> > ok. >> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka, >> there >> > > is >> > > > > > enough >> > > > > > >> > > > > substance >> > > > > > >> > > > > >> in >> > > > > > >> > > > > >> > >> Samza that it warrants being a separate >> project. >> > > An >> > > > > > >> argument >> > > > > > >> > in >> > > > > > >> > > > > >> favour >> > > > > > >> > > > > >> > of >> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a much >> > > > stronger >> > > > > > >> "brand >> > > > > > >> > > > > >> presence" >> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If >> the >> > > Kafka >> > > > > > >> project >> > > > > > >> > is >> > > > > > >> > > > > >> willing >> > > > > > >> > > > > >> > to >> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of doing >> > > > stateful >> > > > > > >> stream >> > > > > > >> > > > > >> > >> transformations, that would probably have >> much >> > the >> > > > > same >> > > > > > >> > effect >> > > > > > >> > > as >> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream >> Processors" >> > or >> > > > > > suchlike. >> > > > > > >> > > Close >> > > > > > >> > > > > >> > >> collaboration between the two projects will >> be >> > > > needed >> > > > > in >> > > > > > >> any >> > > > > > >> > > > case. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> From a project management perspective, I >> guess >> > the >> > > > > "new >> > > > > > >> > Samza" >> > > > > > >> > > > > would >> > > > > > >> > > > > >> > have >> > > > > > >> > > > > >> > >> to be developed on a branch alongside ongoing >> > > > > > maintenance >> > > > > > >> of >> > > > > > >> > > the >> > > > > > >> > > > > >> current >> > > > > > >> > > > > >> > >> line of development? I think it would be >> > important >> > > > to >> > > > > > >> > continue >> > > > > > >> > > > > >> > supporting >> > > > > > >> > > > > >> > >> existing users, and provide a graceful >> migration >> > > > path >> > > > > to >> > > > > > >> the >> > > > > > >> > > new >> > > > > > >> > > > > >> > version. >> > > > > > >> > > > > >> > >> Leaving the current versions unsupported and >> > > forcing >> > > > > > people >> > > > > > >> > to >> > > > > > >> > > > > >> rewrite >> > > > > > >> > > > > >> > >> their jobs would send a bad signal. >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> Best, >> > > > > > >> > > > > >> > >> Martin >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps < >> > > > j...@confluent.io> >> > > > > > >> wrote: >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >>> Hey Garry, >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy >> to >> > > chat >> > > > > > more >> > > > > > >> > about >> > > > > > >> > > > > this >> > > > > > >> > > > > >> if >> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I >> > started >> > > > with >> > > > > > the >> > > > > > >> > idea >> > > > > > >> > > > of >> > > > > > >> > > > > >> "what >> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass >> > ingestion >> > > > > tool" >> > > > > > but >> > > > > > >> > > > > >> ultimately >> > > > > > >> > > > > >> > we >> > > > > > >> > > > > >> > >>> kind of came around to the idea that >> ingestion >> > > and >> > > > > > >> > > > transformation >> > > > > > >> > > > > >> had >> > > > > > >> > > > > >> > >>> pretty different needs and coupling the two >> > made >> > > > > things >> > > > > > >> > hard. >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26) >> > > > actually >> > > > > > will >> > > > > > >> > do >> > > > > > >> > > > what >> > > > > > >> > > > > >> you >> > > > > > >> > > > > >> > >> are >> > > > > > >> > > > > >> > >>> looking for. >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>> With regard to your point about slider, I >> don't >> > > > > > >> necessarily >> > > > > > >> > > > > >> disagree. >> > > > > > >> > > > > >> > >> But I >> > > > > > >> > > > > >> > >>> think getting good YARN support is quite >> doable >> > > > and I >> > > > > > >> think >> > > > > > >> > we >> > > > > > >> > > > can >> > > > > > >> > > > > >> make >> > > > > > >> > > > > >> > >>> that work well. I think the issue this >> proposal >> > > > > solves >> > > > > > is >> > > > > > >> > that >> > > > > > >> > > > > >> > >> technically >> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple >> cluster >> > > > > > management >> > > > > > >> > > systems >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > way >> > > > > > >> > > > > >> > >>> things are now, you need to write an "app >> > master" >> > > > or >> > > > > > >> > > "framework" >> > > > > > >> > > > > for >> > > > > > >> > > > > >> > each >> > > > > > >> > > > > >> > >>> and they are all a little different so >> testing >> > is >> > > > > > really >> > > > > > >> > hard. >> > > > > > >> > > > In >> > > > > > >> > > > > >> the >> > > > > > >> > > > > >> > >>> absence of this we have been stuck with just >> > YARN >> > > > > which >> > > > > > >> has >> > > > > > >> > > > > >> fantastic >> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org, >> but >> > > > zero >> > > > > > >> > > penetration >> > > > > > >> > > > > >> > >> elsewhere. >> > > > > > >> > > > > >> > >>> Given the huge amount of work being put in >> to >> > > > slider, >> > > > > > >> > > marathon, >> > > > > > >> > > > > aws >> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen related >> > > > packaging >> > > > > > >> > > > technologies >> > > > > > >> > > > > >> > people >> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various >> > > > > cloud-specific >> > > > > > >> > deploy >> > > > > > >> > > > > >> tools, >> > > > > > >> > > > > >> > >> etc) >> > > > > > >> > > > > >> > >>> I really think it is important to get this >> > right. >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>> -Jay >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry >> > Turkington >> > > < >> > > > > > >> > > > > >> > >>> g.turking...@improvedigital.com> wrote: >> > > > > > >> > > > > >> > >>> >> > > > > > >> > > > > >> > >>>> Hi all, >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> I think the question below re does Samza >> > become >> > > a >> > > > > > >> > sub-project >> > > > > > >> > > > of >> > > > > > >> > > > > >> Kafka >> > > > > > >> > > > > >> > >>>> highlights the broader point around >> migration. >> > > > Chris >> > > > > > >> > mentions >> > > > > > >> > > > > >> Samza's >> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release >> but >> > I'm >> > > > not >> > > > > > sure >> > > > > > >> > it >> > > > > > >> > > > > feels >> > > > > > >> > > > > >> > >> right to >> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to >> deprecate >> > > > most >> > > > > of >> > > > > > >> it. >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some guys >> > who >> > > > have >> > > > > > >> > started >> > > > > > >> > > > > >> working >> > > > > > >> > > > > >> > >> with >> > > > > > >> > > > > >> > >>>> Samza and building some new >> > consumers/producers >> > > > was >> > > > > > next >> > > > > > >> > up. >> > > > > > >> > > > > Sounds >> > > > > > >> > > > > >> > like >> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to >> go. I >> > > need >> > > > > to >> > > > > > >> look >> > > > > > >> > > into >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > KIP >> > > > > > >> > > > > >> > >> in >> > > > > > >> > > > > >> > >>>> more detail but for me the attractiveness >> of >> > > > adding >> > > > > > new >> > > > > > >> > Samza >> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they >> > were >> > > > > doing >> > > > > > was >> > > > > > >> > > > really >> > > > > > >> > > > > >> > getting >> > > > > > >> > > > > >> > >>>> data into and out of Kafka -- was to avoid >> > > > having >> > > > > to >> > > > > > >> > worry >> > > > > > >> > > > > about >> > > > > > >> > > > > >> the >> > > > > > >> > > > > >> > >>>> lifecycle management of external clients. >> If >> > > there >> > > > > is >> > > > > > a >> > > > > > >> > > generic >> > > > > > >> > > > > >> Kafka >> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new >> > > > connector >> > > > > > into >> > > > > > >> > and >> > > > > > >> > > > > have >> > > > > > >> > > > > >> a >> > > > > > >> > > > > >> > >> lot of >> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability >> > done >> > > > for >> > > > > me >> > > > > > >> then >> > > > > > >> > > it >> > > > > > >> > > > > >> gives >> > > > > > >> > > > > >> > me >> > > > > > >> > > > > >> > >> all >> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers would. >> If >> > > not >> > > > > > then it >> > > > > > >> > > > > >> complicates >> > > > > > >> > > > > >> > my >> > > > > > >> > > > > >> > >>>> operational deployments. >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> Which is similar to my other question with >> the >> > > > > > proposal >> > > > > > >> -- >> > > > > > >> > if >> > > > > > >> > > > we >> > > > > > >> > > > > >> > build a >> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the >> > > > requisite >> > > > > > >> shims >> > > > > > >> > to >> > > > > > >> > > > > >> > integrate >> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may >> be a >> > > lot >> > > > > more >> > > > > > >> work >> > > > > > >> > > > than >> > > > > > >> > > > > we >> > > > > > >> > > > > >> > >> think. >> > > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer >> to >> > get >> > > > > > >> something >> > > > > > >> > > > > running >> > > > > > >> > > > > >> but >> > > > > > >> > > > > >> > >>>> having them step up and get a reliable >> > > production >> > > > > > >> > deployment >> > > > > > >> > > > may >> > > > > > >> > > > > >> still >> > > > > > >> > > > > >> > >>>> dominate mailing list traffic, if for >> > different >> > > > > > reasons >> > > > > > >> > than >> > > > > > >> > > > > >> today. >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with >> > > making >> > > > > the >> > > > > > >> Samza >> > > > > > >> > > > > >> dependency >> > > > > > >> > > > > >> > >> on >> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely >> see >> > > the >> > > > > > >> benefits >> > > > > > >> > > in >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing >> > > > > > >> > > > terminologies/abstractions >> > > > > > >> > > > > >> that >> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library >> would >> > > > likely >> > > > > > be a >> > > > > > >> > very >> > > > > > >> > > > > nice >> > > > > > >> > > > > >> > tool >> > > > > > >> > > > > >> > >> to >> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the >> > > > concerns >> > > > > > >> above >> > > > > > >> > re >> > > > > > >> > > > the >> > > > > > >> > > > > >> > >>>> operational side. >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> Garry >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> -----Original Message----- >> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales >> [mailto: >> > > > > > >> > g...@apache.org >> > > > > > >> > > ] >> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56 >> > > > > > >> > > > > >> > >>>> To: dev@samza.apache.org >> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on >> > Samza >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> Very interesting thoughts. >> > > > > > >> > > > > >> > >>>> From outside, I have always perceived Samza >> > as a >> > > > > > >> computing >> > > > > > >> > > > layer >> > > > > > >> > > > > >> over >> > > > > > >> > > > > >> > >>>> Kafka. >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is >> > > "should >> > > > > > Samza >> > > > > > >> be >> > > > > > >> > a >> > > > > > >> > > > > >> > sub-project >> > > > > > >> > > > > >> > >>>> of Kafka then?" >> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a >> separate >> > > > > project >> > > > > > >> > with a >> > > > > > >> > > > > >> separate >> > > > > > >> > > > > >> > >>>> governance? >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> Cheers, >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> -- >> > > > > > >> > > > > >> > >>>> Gianmarco >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang < >> > > > > > yanfang...@gmail.com> >> > > > > > >> > > > wrote: >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more >> > > > tightly. >> > > > > > >> > Because >> > > > > > >> > > > > Samza >> > > > > > >> > > > > >> de >> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should >> > leverage >> > > > > what >> > > > > > >> Kafka >> > > > > > >> > > > has. >> > > > > > >> > > > > At >> > > > > > >> > > > > >> > the >> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent >> > what >> > > > > Samza >> > > > > > >> > > already >> > > > > > >> > > > > >> has. I >> > > > > > >> > > > > >> > >>>>> also like the idea of separating the >> > ingestion >> > > > and >> > > > > > >> > > > > transformation. >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to >> image >> > > how >> > > > > the >> > > > > > >> Samza >> > > > > > >> > > > will >> > > > > > >> > > > > >> look >> > > > > > >> > > > > >> > >>>> like. >> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little >> > > difference >> > > > > in >> > > > > > >> terms >> > > > > > >> > > of >> > > > > > >> > > > > how >> > > > > > >> > > > > >> > >>>>> Samza should look like. >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code >> shows >> > (A >> > > > > > client of >> > > > > > >> > > > Kakfa) >> > > > > > >> > > > > ? >> > > > > > >> > > > > >> And >> > > > > > >> > > > > >> > >>>>> user's application code calls this client? >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka >> > (like >> > > > > what >> > > > > > the >> > > > > > >> > > code >> > > > > > >> > > > > >> shows), >> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and >> > > > > fault-tolerance? >> > > > > > >> Are >> > > > > > >> > > they >> > > > > > >> > > > > >> taken >> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other >> mechanism, >> > > such >> > > > > as >> > > > > > >> > "Samza >> > > > > > >> > > > > >> worker" >> > > > > > >> > > > > >> > >>>>> (just make up the name) ? >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as >> > > > auto-scaling, >> > > > > > >> shared >> > > > > > >> > > > > state, >> > > > > > >> > > > > >> > >>>>> monitoring? >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this >> > what >> > > > > Chris >> > > > > > >> > > > suggests?) >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa >> > and >> > > > > > produce >> > > > > > >> to >> > > > > > >> > > it. >> > > > > > >> > > > > >> Then it >> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like >> > now, >> > > > > > except it >> > > > > > >> > > does >> > > > > > >> > > > > not >> > > > > > >> > > > > >> > rely >> > > > > > >> > > > > >> > >>>>> on Yarn anymore. >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it >> leverage >> > > > Kafka's >> > > > > > >> > metrics, >> > > > > > >> > > > > logs, >> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency? >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> Thanks, >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> Fang, Yan >> > > > > > >> > > > > >> > >>>>> yanfang...@gmail.com >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang >> > Wang < >> > > > > > >> > > > > wangg...@gmail.com >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >>>> wrote: >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>>>> Read through the code example and it >> looks >> > > good >> > > > to >> > > > > > me. >> > > > > > >> A >> > > > > > >> > > few >> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment: >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable >> runnable >> > > like: >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh >> > > --config-factory=... >> > > > > > >> > > > > >> > >>>> --config-path=file://... >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying >> > Samza >> > > > > more >> > > > > > as >> > > > > > >> > > > embedded >> > > > > > >> > > > > >> > >>>>>> libraries in user application code >> (ignoring >> > > the >> > > > > > >> > > terminology >> > > > > > >> > > > > >> since >> > > > > > >> > > > > >> > >>>>>> it is not the >> > > > > > >> > > > > >> > >>>>> same >> > > > > > >> > > > > >> > >>>>>> as the prototype code): >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> StreamTask task = new >> MyStreamTask(configs); >> > > > > Thread >> > > > > > >> > thread >> > > > > > >> > > = >> > > > > > >> > > > > new >> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start(); >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> I think both of these deployment modes >> are >> > > > > important >> > > > > > >> for >> > > > > > >> > > > > >> different >> > > > > > >> > > > > >> > >>>>>> types >> > > > > > >> > > > > >> > >>>>> of >> > > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza >> > purely >> > > > > > >> standalone >> > > > > > >> > is >> > > > > > >> > > > > still >> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or library >> > > modes. >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> Guozhang >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay >> Kreps >> > < >> > > > > > >> > > > j...@confluent.io> >> > > > > > >> > > > > >> > wrote: >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code >> example, >> > it >> > > > was >> > > > > > >> > supposed >> > > > > > >> > > > to >> > > > > > >> > > > > >> look >> > > > > > >> > > > > >> > >>>>>>> like >> > > > > > >> > > > > >> > >>>>>>> this: >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties(); >> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers", >> > > > "localhost:4242"); >> > > > > > >> > > > > >> StreamingConfig >> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props); >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", >> > > > "test-topic-2"); >> > > > > > >> > > > > >> > >>>>>>> >> > > config.processor(ExampleStreamProcessor.class); >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new >> > StringSerializer(), >> > > > new >> > > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming >> > > > container = >> > > > > > new >> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run(); >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>>> -Jay >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay >> > Kreps < >> > > > > > >> > > > j...@confluent.io >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > >>>> wrote: >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> Hey guys, >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations >> Chris >> > > and >> > > > I >> > > > > > were >> > > > > > >> > > having >> > > > > > >> > > > > >> > >>>>>>>> around >> > > > > > >> > > > > >> > >>>>>>> whether >> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a >> kind >> > > of >> > > > > data >> > > > > > >> > > > ingestion >> > > > > > >> > > > > >> > >>>>> framework >> > > > > > >> > > > > >> > >>>>>>> for >> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26 >> > > > > "copycat"). >> > > > > > >> This >> > > > > > >> > > > kind >> > > > > > >> > > > > of >> > > > > > >> > > > > >> > >>>>>> combined >> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN >> and >> > > the >> > > > > > >> > discussion >> > > > > > >> > > > > >> around >> > > > > > >> > > > > >> > >>>>>>>> how >> > > > > > >> > > > > >> > >>>>> to >> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given >> that >> > > > Samza >> > > > > > was >> > > > > > >> > > > basically >> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if >> > you >> > > > just >> > > > > > >> > embraced >> > > > > > >> > > > > that >> > > > > > >> > > > > >> > >>>>>>>> and turned it >> > > > > > >> > > > > >> > >>>>>> into >> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight >> > framework >> > > > and >> > > > > > more >> > > > > > >> > > like a >> > > > > > >> > > > > >> > >>>>>>>> third >> > > > > > >> > > > > >> > >>>>> Kafka >> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer" >> > with >> > > > > state >> > > > > > >> > > > management >> > > > > > >> > > > > >> > >>>>>> facilities. >> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a >> complex >> > > > stream >> > > > > > >> > > processing >> > > > > > >> > > > > >> > >>>>>>>> framework >> > > > > > >> > > > > >> > >>>>>>> this >> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing, >> not >> > > > much >> > > > > > more >> > > > > > >> > > > > >> complicated >> > > > > > >> > > > > >> > >>>>>>>> to >> > > > > > >> > > > > >> > >>>>> use >> > > > > > >> > > > > >> > >>>>>>> or >> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris >> > said >> > > > we >> > > > > > >> thought >> > > > > > >> > > > about >> > > > > > >> > > > > >> it >> > > > > > >> > > > > >> > >>>>>>>> a >> > > > > > >> > > > > >> > >>>>> lot >> > > > > > >> > > > > >> > >>>>>> of >> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream >> > processing >> > > > > > systems >> > > > > > >> > were >> > > > > > >> > > > > doing) >> > > > > > >> > > > > >> > >>>>> seemed >> > > > > > >> > > > > >> > >>>>>>> like >> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output >> data >> > to >> > > > and >> > > > > > from >> > > > > > >> > the >> > > > > > >> > > > > stream >> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked >> > into >> > > > how >> > > > > > that >> > > > > > >> > > would >> > > > > > >> > > > > >> > >>>>>>>> work, >> > > > > > >> > > > > >> > >>>>> Samza >> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion >> > > framework >> > > > > > for a >> > > > > > >> > > bunch >> > > > > > >> > > > of >> > > > > > >> > > > > >> > >>>>> reasons. >> > > > > > >> > > > > >> > >>>>>> To >> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty >> > > > different >> > > > > > >> > internal >> > > > > > >> > > > > data >> > > > > > >> > > > > >> > >>>>>>>> model >> > > > > > >> > > > > >> > >>>>>> and >> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them >> and >> > > had >> > > > > an >> > > > > > api >> > > > > > >> > for >> > > > > > >> > > > > Kafka >> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26) >> and a >> > > > > separate >> > > > > > >> api >> > > > > > >> > > for >> > > > > > >> > > > > >> Kafka >> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza). >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> This would also allow really embracing >> the >> > > > same >> > > > > > >> > > terminology >> > > > > > >> > > > > and >> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the >> > current >> > > > > > state is >> > > > > > >> > > that >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > >>>>>>>> two >> > > > > > >> > > > > >> > >>>>>>> systems >> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology >> like >> > > > > "stream" >> > > > > > vs >> > > > > > >> > > > "topic" >> > > > > > >> > > > > >> and >> > > > > > >> > > > > >> > >>>>>>> different >> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you >> > kind >> > > > of >> > > > > > have >> > > > > > >> to >> > > > > > >> > > > learn >> > > > > > >> > > > > >> > >>>>>>>> Kafka's >> > > > > > >> > > > > >> > >>>>>>> way, >> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different >> way, >> > > > then >> > > > > > kind >> > > > > > >> of >> > > > > > >> > > > > >> > >>>>>>>> understand >> > > > > > >> > > > > >> > >>>>> how >> > > > > > >> > > > > >> > >>>>>>> they >> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having walked >> a >> > few >> > > > > > people >> > > > > > >> > > through >> > > > > > >> > > > > >> this >> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to >> get. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of >> time >> > on >> > > > > > >> airplanes I >> > > > > > >> > > > > hacked >> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat >> incomplete >> > > > > > prototype >> > > > > > >> of >> > > > > > >> > > > what >> > > > > > >> > > > > >> > >>>>>>>> this would >> > > > > > >> > > > > >> > >>>>> look >> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously >> dumped >> > > into >> > > > > > Kafka >> > > > > > >> as >> > > > > > >> > > it >> > > > > > >> > > > > >> > >>>>>>>> required a >> > > > > > >> > > > > >> > >>>>>> few >> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is >> the >> > > code: >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > > > > > >> > >> > > > > > >> > > >> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org >> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just >> > > > > liberally >> > > > > > >> > renamed >> > > > > > >> > > > > >> > >>>>>>>> everything >> > > > > > >> > > > > >> > >>>>> to >> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no >> regard >> > > for >> > > > > > >> > > > compatibility. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like >> this: >> > > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties(); >> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers", >> > > > > "localhost:4242"); >> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new >> > > > > > >> > > > > >> > >>>>> StreamingConfig(props); >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1", >> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2"); >> > > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class); >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new >> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new >> > > StringDeserializer()); >> > > > > > >> > > > KafkaStreaming >> > > > > > >> > > > > >> > >>>>>> container = >> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config); >> > container.run(); >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the >> > > > SamzaContainer; >> > > > > > >> > > > > StreamProcessor >> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class >> names >> > > in >> > > > a >> > > > > > file >> > > > > > >> > and >> > > > > > >> > > > then >> > > > > > >> > > > > >> > >>>>>>>> having >> > > > > > >> > > > > >> > >>>>>> the >> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just >> > > > > instantiate >> > > > > > the >> > > > > > >> > > > > container >> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over >> > > > however >> > > > > > many >> > > > > > >> > > > > instances >> > > > > > >> > > > > >> > >>>>>>>> of >> > > > > > >> > > > > >> > >>>>> this >> > > > > > >> > > > > >> > >>>>>>> are >> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance >> > dies, >> > > > new >> > > > > > >> tasks >> > > > > > >> > > are >> > > > > > >> > > > > >> added >> > > > > > >> > > > > >> > >>>>>>>> to >> > > > > > >> > > > > >> > >>>>> the >> > > > > > >> > > > > >> > >>>>>>>> existing containers without shutting >> them >> > > > down). >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for running >> > this >> > > > > stuff >> > > > > > in >> > > > > > >> > YARN >> > > > > > >> > > > via >> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS >> using >> > > some >> > > > > of >> > > > > > >> their >> > > > > > >> > > > tools >> > > > > > >> > > > > >> > >>>>>>>> but from the >> > > > > > >> > > > > >> > >>>>>> point >> > > > > > >> > > > > >> > >>>>>>> of >> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream >> > > > processing >> > > > > > jobs >> > > > > > >> > are >> > > > > > >> > > > > just >> > > > > > >> > > > > >> > >>>>>> stateless >> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and >> expand >> > and >> > > > > > contract >> > > > > > >> > at >> > > > > > >> > > > > will. >> > > > > > >> > > > > >> > >>>>>>>> There >> > > > > > >> > > > > >> > >>>>> is >> > > > > > >> > > > > >> > >>>>>>> no >> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details: >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> 1. It is only ~1300 lines of code, it >> > would >> > > > get >> > > > > > >> larger >> > > > > > >> > > if >> > > > > > >> > > > we >> > > > > > >> > > > > >> > >>>>>>>> productionized but not vastly larger. >> We >> > > > really >> > > > > > do >> > > > > > >> > get a >> > > > > > >> > > > ton >> > > > > > >> > > > > >> > >>>>>>>> of >> > > > > > >> > > > > >> > >>>>>>> leverage >> > > > > > >> > > > > >> > >>>>>>>> out of Kafka. >> > > > > > >> > > > > >> > >>>>>>>> 2. Partition management is fully >> > delegated >> > > to >> > > > > the >> > > > > > >> new >> > > > > > >> > > > > >> consumer. >> > > > > > >> > > > > >> > >>>>> This >> > > > > > >> > > > > >> > >>>>>>>> is nice since now any partition >> > management >> > > > > > strategy >> > > > > > >> > > > > available >> > > > > > >> > > > > >> > >>>>>>>> to >> > > > > > >> > > > > >> > >>>>>> Kafka >> > > > > > >> > > > > >> > >>>>>>>> consumer is also available to Samza >> (and >> > > vice >> > > > > > versa) >> > > > > > >> > and >> > > > > > >> > > > > with >> > > > > > >> > > > > >> > >>>>>>>> the >> > > > > > >> > > > > >> > >>>>>>> exact >> > > > > > >> > > > > >> > >>>>>>>> same configs. >> > > > > > >> > > > > >> > >>>>>>>> 3. It supports state as well as state >> > reuse >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is >> > thought >> > > > > > >> provoking. >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> -Jay >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris >> > > > > Riccomini < >> > > > > > >> > > > > >> > >>>>>> criccom...@apache.org> >> > > > > > >> > > > > >> > >>>>>>>> wrote: >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Hey all, >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza >> > > > > engineers >> > > > > > at >> > > > > > >> > > > LinkedIn >> > > > > > >> > > > > >> > >>>>>>>>> and >> > > > > > >> > > > > >> > >>>>>>> Confluent >> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations >> > and >> > > > > would >> > > > > > >> like >> > > > > > >> > to >> > > > > > >> > > > > >> > >>>>>>>>> propose >> > > > > > >> > > > > >> > >>>>> some >> > > > > > >> > > > > >> > >>>>>>>>> changes. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I >> want to >> > > > call >> > > > > > out >> > > > > > >> > about >> > > > > > >> > > > > >> > >>>>>>>>> Samza's >> > > > > > >> > > > > >> > >>>>>> design, >> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic >> > > > deployment >> > > > > > >> system. >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable. >> > > > > > >> > > > > >> > >>>>>>>>> * Samza's >> SystemConsumer/SystemProducer >> > and >> > > > > > Kafka's >> > > > > > >> > > > consumer >> > > > > > >> > > > > >> > >>>>>>>>> APIs >> > > > > > >> > > > > >> > >>>>> are >> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same >> > problems. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are related, >> > but >> > > > I'll >> > > > > > >> > address >> > > > > > >> > > > them >> > > > > > >> > > > > >> in >> > > > > > >> > > > > >> > >>>>> order. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Deployment >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a >> > > > dynamic >> > > > > > >> > > deployment >> > > > > > >> > > > > >> > >>>>>>>>> scheduler >> > > > > > >> > > > > >> > >>>>>> such >> > > > > > >> > > > > >> > >>>>>>>>> as >> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially >> built >> > > > > Samza, >> > > > > > we >> > > > > > >> > bet >> > > > > > >> > > > that >> > > > > > >> > > > > >> > >>>>>>>>> there >> > > > > > >> > > > > >> > >>>>>> would >> > > > > > >> > > > > >> > >>>>>>>>> be >> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and >> we >> > > could >> > > > > > >> support >> > > > > > >> > > > them, >> > > > > > >> > > > > >> and >> > > > > > >> > > > > >> > >>>>>>>>> the >> > > > > > >> > > > > >> > >>>>>> rest >> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are >> many >> > > > > > >> variations. >> > > > > > >> > > > > >> > >>>>>>>>> Furthermore, >> > > > > > >> > > > > >> > >>>>>> many >> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start >> their >> > > > > > processors >> > > > > > >> > like >> > > > > > >> > > > > normal >> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional >> > > > deployment >> > > > > > >> scripts >> > > > > > >> > > > such >> > > > > > >> > > > > as >> > > > > > >> > > > > >> > >>>>>>>>> Fabric, >> > > > > > >> > > > > >> > >>>>>> Chef, >> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment >> system >> > > on >> > > > > > users >> > > > > > >> > makes >> > > > > > >> > > > the >> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful >> for >> > > > first >> > > > > > time >> > > > > > >> > > > users. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement >> was >> > > also >> > > > a >> > > > > > bit >> > > > > > >> of >> > > > > > >> > a >> > > > > > >> > > > > >> > >>>>>>>>> mis-fire >> > > > > > >> > > > > >> > >>>>>> because >> > > > > > >> > > > > >> > >>>>>>>>> of >> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between >> > the >> > > > > > nature of >> > > > > > >> > > batch >> > > > > > >> > > > > >> jobs >> > > > > > >> > > > > >> > >>>>>>>>> and >> > > > > > >> > > > > >> > >>>>>>> stream >> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made >> > > conscious >> > > > > > effort >> > > > > > >> to >> > > > > > >> > > > favor >> > > > > > >> > > > > >> > >>>>>>>>> the >> > > > > > >> > > > > >> > >>>>>> Hadoop >> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things, >> since >> > it >> > > > > worked >> > > > > > >> and >> > > > > > >> > > was >> > > > > > >> > > > > well >> > > > > > >> > > > > >> > >>>>>>> understood. >> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that >> batch >> > > jobs >> > > > > > have a >> > > > > > >> > > > definite >> > > > > > >> > > > > >> > >>>>>> beginning, >> > > > > > >> > > > > >> > >>>>>>>>> and >> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't >> > > > > (usually). >> > > > > > >> This >> > > > > > >> > > > leads >> > > > > > >> > > > > to >> > > > > > >> > > > > >> > >>>>>>>>> a >> > > > > > >> > > > > >> > >>>>> much >> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream >> > > > > processors. >> > > > > > >> You >> > > > > > >> > > > > >> basically >> > > > > > >> > > > > >> > >>>>>>>>> just >> > > > > > >> > > > > >> > >>>>>>> need >> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the >> processor, >> > and >> > > > > start >> > > > > > >> it. >> > > > > > >> > > The >> > > > > > >> > > > > way >> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no >> > > concept >> > > > > of >> > > > > > a >> > > > > > >> > > cluster >> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always >> > > > > > >> > > > > >> > >>>>>> add >> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with >> coupling >> > > > Samza >> > > > > > with >> > > > > > >> a >> > > > > > >> > > > > >> scheduler >> > > > > > >> > > > > >> > >>>>>>>>> is >> > > > > > >> > > > > >> > >>>>>> that >> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to >> handle >> > > > > > deployment. >> > > > > > >> > > This >> > > > > > >> > > > > >> pulls >> > > > > > >> > > > > >> > >>>>>>>>> in a >> > > > > > >> > > > > >> > >>>>>>> bunch >> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration >> > > distribution >> > > > > > (config >> > > > > > >> > > > > stream), >> > > > > > >> > > > > >> > >>>>>>>>> shell >> > > > > > >> > > > > >> > >>>>>>> scrips >> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging >> > (all >> > > > the >> > > > > > .tgz >> > > > > > >> > > > stuff), >> > > > > > >> > > > > >> etc. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic >> > > > deployment >> > > > > > was >> > > > > > >> to >> > > > > > >> > > > > support >> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have >> > > locality, >> > > > > you >> > > > > > >> need >> > > > > > >> > to >> > > > > > >> > > > put >> > > > > > >> > > > > >> > >>>>>>>>> your >> > > > > > >> > > > > >> > >>>>>> processors >> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're processing. >> > Upon >> > > > > > further >> > > > > > >> > > > > >> > >>>>>>>>> investigation, >> > > > > > >> > > > > >> > >>>>>>> though, >> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial. >> > There >> > > is >> > > > > > some >> > > > > > >> > good >> > > > > > >> > > > > >> > >>>>>>>>> discussion >> > > > > > >> > > > > >> > >>>>>> about >> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335. >> > Again, >> > > we >> > > > > > took >> > > > > > >> the >> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce >> > > > > > >> > > > > >> > >>>>>> path, >> > > > > > >> > > > > >> > >>>>>>>>> but >> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences >> > > > between >> > > > > > HDFS >> > > > > > >> > and >> > > > > > >> > > > > Kafka. >> > > > > > >> > > > > >> > >>>>>>>>> HDFS >> > > > > > >> > > > > >> > >>>>>> has >> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions. >> This >> > > > leads >> > > > > to >> > > > > > >> less >> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream >> > > processors >> > > > > on >> > > > > > top >> > > > > > >> > of >> > > > > > >> > > > > Kafka. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch. >> > > Samza >> > > > > > doesn't >> > > > > > >> > > have >> > > > > > >> > > > > any >> > > > > > >> > > > > >> > >>>>>>>>> built >> > > > > > >> > > > > >> > >>>>> in >> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it >> > depends >> > > on >> > > > > the >> > > > > > >> > > dynamic >> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle >> > > > restarts >> > > > > > >> when a >> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has >> > > > > > >> > > > > >> > >>>>>>> made >> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a >> standalone >> > > Samza >> > > > > > >> > container >> > > > > > >> > > > > >> > >>>> (SAMZA-516). >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Pluggability >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good, >> but I >> > > > think >> > > > > > that >> > > > > > >> > > we've >> > > > > > >> > > > > >> gone >> > > > > > >> > > > > >> > >>>>>>>>> too >> > > > > > >> > > > > >> > >>>>>> far >> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has: >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config. >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics. >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems. >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems >> > > > (SystemConsumer, >> > > > > > >> > > > > SystemProducer, >> > > > > > >> > > > > >> > >>>> etc). >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes. >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines. >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about >> > every >> > > > > > >> component >> > > > > > >> > > > > >> > >>>>> (MessageChooser, >> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper, >> > > ConfigRewriter, >> > > > > > etc). >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've >> > forgotten, >> > > as >> > > > > > well. >> > > > > > >> > Some >> > > > > > >> > > > of >> > > > > > >> > > > > >> > >>>>>>>>> these >> > > > > > >> > > > > >> > >>>>> are >> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to >> be. >> > > This >> > > > > all >> > > > > > >> comes >> > > > > > >> > > at >> > > > > > >> > > > a >> > > > > > >> > > > > >> cost: >> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making >> it >> > > > harder >> > > > > > for >> > > > > > >> > our >> > > > > > >> > > > > users >> > > > > > >> > > > > >> > >>>>>>>>> to >> > > > > > >> > > > > >> > >>>>> pick >> > > > > > >> > > > > >> > >>>>>> up >> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also >> > makes >> > > > it >> > > > > > >> > difficult >> > > > > > >> > > > for >> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what >> the >> > > > > > >> > > characteristics >> > > > > > >> > > > of >> > > > > > >> > > > > >> > >>>>>>>>> the container (since the >> characteristics >> > > > change >> > > > > > >> > > depending >> > > > > > >> > > > on >> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use). >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most >> > > visible >> > > > > in >> > > > > > the >> > > > > > >> > > > System >> > > > > > >> > > > > >> APIs. >> > > > > > >> > > > > >> > >>>>> What >> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be >> functional is >> > > > Kafka >> > > > > > as >> > > > > > >> its >> > > > > > >> > > > > >> > >>>>>>>>> transport >> > > > > > >> > > > > >> > >>>>>> layer. >> > > > > > >> > > > > >> > >>>>>>>>> But >> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use >> cases >> > > into >> > > > > one >> > > > > > >> API: >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka. >> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports both >> of >> > > these >> > > > > use >> > > > > > >> > cases. >> > > > > > >> > > > The >> > > > > > >> > > > > >> > >>>>>>>>> problem >> > > > > > >> > > > > >> > >>>>>> is, >> > > > > > >> > > > > >> > >>>>>>>>> we >> > > > > > >> > > > > >> > >>>>>>>>> actually want different features for >> each >> > > use >> > > > > > case. >> > > > > > >> By >> > > > > > >> > > > > >> papering >> > > > > > >> > > > > >> > >>>>>>>>> over >> > > > > > >> > > > > >> > >>>>>>> these >> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single >> > API, >> > > > > we've >> > > > > > >> > > > introduced >> > > > > > >> > > > > a >> > > > > > >> > > > > >> > >>>>>>>>> ton of >> > > > > > >> > > > > >> > >>>>>>> leaky >> > > > > > >> > > > > >> > >>>>>>>>> abstractions. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in >> (2) >> > > is >> > > > to >> > > > > > have >> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for >> > offsets >> > > > > (like >> > > > > > >> > Kafka). >> > > > > > >> > > > > This >> > > > > > >> > > > > >> > >>>>>>>>> would be at odds >> > > > > > >> > > > > >> > >>>>> with >> > > > > > >> > > > > >> > >>>>>>> (1), >> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems have >> > > > different >> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors. >> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the >> mailing >> > > list >> > > > > and >> > > > > > >> the >> > > > > > >> > > SQL >> > > > > > >> > > > > >> JIRAs >> > > > > > >> > > > > >> > >>>>> about >> > > > > > >> > > > > >> > >>>>>>> the >> > > > > > >> > > > > >> > >>>>>>>>> need for this. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for >> > > replayability. >> > > > > > Kafka >> > > > > > >> > > allows >> > > > > > >> > > > us >> > > > > > >> > > > > >> to >> > > > > > >> > > > > >> > >>>>> rewind >> > > > > > >> > > > > >> > >>>>>>>>> when >> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems >> > > don't. >> > > > In >> > > > > > some >> > > > > > >> > > > cases, >> > > > > > >> > > > > >> > >>>>>>>>> systems >> > > > > > >> > > > > >> > >>>>>>> return >> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g. >> > > > > > >> WikipediaSystemConsumer) >> > > > > > >> > > > > because >> > > > > > >> > > > > >> > >>>>>>>>> they >> > > > > > >> > > > > >> > >>>>>> have >> > > > > > >> > > > > >> > >>>>>>> no >> > > > > > >> > > > > >> > >>>>>>>>> offsets. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka >> > > > supports >> > > > > > >> > > > > partitioning, >> > > > > > >> > > > > >> > >>>>>>>>> but >> > > > > > >> > > > > >> > >>>>> many >> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by >> having a >> > > > single >> > > > > > >> > > partition >> > > > > > >> > > > > for >> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems >> model >> > > > > > >> partitioning >> > > > > > >> > > > > >> > >>>> differently (e.g. >> > > > > > >> > > > > >> > >>>>>>>>> Kinesis). >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a >> mess. >> > > > > > Creating >> > > > > > >> > > streams >> > > > > > >> > > > > in >> > > > > > >> > > > > >> a >> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost >> impossible. >> > > As >> > > > is >> > > > > > >> > modeling >> > > > > > >> > > > > >> > >>>>>>>>> metadata >> > > > > > >> > > > > >> > >>>>> for >> > > > > > >> > > > > >> > >>>>>>> the >> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor, >> partitions, >> > > > > location, >> > > > > > >> > etc). >> > > > > > >> > > > The >> > > > > > >> > > > > >> > >>>>>>>>> list >> > > > > > >> > > > > >> > >>>>> goes >> > > > > > >> > > > > >> > >>>>>>> on. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing >> Samza, >> > > > > Kafka's >> > > > > > >> > > consumer >> > > > > > >> > > > > and >> > > > > > >> > > > > >> > >>>>> producer >> > > > > > >> > > > > >> > >>>>>>>>> APIs >> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On >> the >> > > > > > >> > consumer-side, >> > > > > > >> > > > you >> > > > > > >> > > > > >> > >>>>>>>>> had two >> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer, >> or >> > > the >> > > > > > simple >> > > > > > >> > > > > consumer. >> > > > > > >> > > > > >> > >>>>>>>>> The >> > > > > > >> > > > > >> > >>>>>>> problem >> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that >> it >> > > > > > controlled >> > > > > > >> > your >> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and >> the >> > > order >> > > > > in >> > > > > > >> which >> > > > > > >> > > you >> > > > > > >> > > > > >> > >>>>>>>>> received messages. The >> > > > > > >> > > > > >> > >>>>> problem >> > > > > > >> > > > > >> > >>>>>>>>> with >> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not >> > > simple. >> > > > > It's >> > > > > > >> > basic. >> > > > > > >> > > > You >> > > > > > >> > > > > >> > >>>>>>>>> end up >> > > > > > >> > > > > >> > >>>>>>> having >> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level >> stuff >> > > > that >> > > > > > you >> > > > > > >> > > > > shouldn't. >> > > > > > >> > > > > >> > >>>>>>>>> We >> > > > > > >> > > > > >> > >>>>>> spent a >> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's >> > > > KafkaSystemConsumer >> > > > > > very >> > > > > > >> > > > robust. >> > > > > > >> > > > > >> It >> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool >> > > features: >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and >> > > > > > prioritization. >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition >> assignment >> > > to >> > > > > > support >> > > > > > >> > > > joins, >> > > > > > >> > > > > >> > >>>>>>>>> global >> > > > > > >> > > > > >> > >>>>>> state >> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc. >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset >> > checkpointing. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is >> > that >> > > > > these >> > > > > > >> > > features >> > > > > > >> > > > > >> > >>>>>>>>> should >> > > > > > >> > > > > >> > >>>>>>> actually >> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers >> > (not >> > > > just >> > > > > > >> Samza >> > > > > > >> > > > stream >> > > > > > >> > > > > >> > >>>>>> processors) >> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins >> > and >> > > > > > partition >> > > > > > >> > > > > >> > >>>>>>>>> assignment. The >> > > > > > >> > > > > >> > >>>>>>> Kafka >> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same >> > conclusion. >> > > > > > They're >> > > > > > >> > > adding >> > > > > > >> > > > a >> > > > > > >> > > > > >> ton >> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka >> consumer >> > > > > > >> > > implementation. >> > > > > > >> > > > > To a >> > > > > > >> > > > > >> > >>>>>>>>> large extent, >> > > > > > >> > > > > >> > >>>>> it's >> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already >> done >> > > in >> > > > > > Samza. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking >> a >> > > very >> > > > > > similar >> > > > > > >> > > > > approach >> > > > > > >> > > > > >> > >>>>>>>>> to >> > > > > > >> > > > > >> > >>>>>> Samza's >> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation >> for >> > > > > > handling >> > > > > > >> > > offset >> > > > > > >> > > > > >> > >>>>>> checkpointing. >> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset >> management >> > > > > feature >> > > > > > >> > stores >> > > > > > >> > > > > >> offset >> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows >> you to >> > > > fetch >> > > > > > them >> > > > > > >> > > from >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > >>>>>>>>> broker. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste, >> since >> > we >> > > > > could >> > > > > > >> have >> > > > > > >> > > > shared >> > > > > > >> > > > > >> > >>>>>>>>> the >> > > > > > >> > > > > >> > >>>>> work >> > > > > > >> > > > > >> > >>>>>> if >> > > > > > >> > > > > >> > >>>>>>>>> it >> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the >> get-go. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Vision >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather >> radical >> > > > > > proposal. >> > > > > > >> > Samza >> > > > > > >> > > > is >> > > > > > >> > > > > >> > >>>>> relatively >> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to >> say >> > > that >> > > > > > we're >> > > > > > >> > > near a >> > > > > > >> > > > > 1.0 >> > > > > > >> > > > > >> > >>>>>> release. >> > > > > > >> > > > > >> > >>>>>>>>> I'd >> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what >> we've >> > > > > learned, >> > > > > > and >> > > > > > >> > > begin >> > > > > > >> > > > > >> > >>>>>>>>> thinking >> > > > > > >> > > > > >> > >>>>>>> about >> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we >> change if >> > > we >> > > > > were >> > > > > > >> > > starting >> > > > > > >> > > > > >> from >> > > > > > >> > > > > >> > >>>>>> scratch? >> > > > > > >> > > > > >> > >>>>>>>>> My >> > > > > > >> > > > > >> > >>>>>>>>> proposal is to: >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only* >> way >> > to >> > > > run >> > > > > > Samza >> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct >> > > > > dependences >> > > > > > on >> > > > > > >> > > YARN, >> > > > > > >> > > > > >> Mesos, >> > > > > > >> > > > > >> > >>>> etc. >> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support >> only >> > > > Kafka >> > > > > > as >> > > > > > >> the >> > > > > > >> > > > > stream >> > > > > > >> > > > > >> > >>>>>> processing >> > > > > > >> > > > > >> > >>>>>>>>> layer. >> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging, >> > > > > > >> serialization, >> > > > > > >> > > and >> > > > > > >> > > > > >> > >>>>>>>>> config >> > > > > > >> > > > > >> > >>>>>>> systems, >> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that >> I >> > > > > outlined >> > > > > > >> > above. >> > > > > > >> > > It >> > > > > > >> > > > > >> > >>>>>>>>> should >> > > > > > >> > > > > >> > >>>>> also >> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty >> > > > dramatically. >> > > > > > >> > > Supporting >> > > > > > >> > > > > >> only >> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow >> Samza >> > to >> > > be >> > > > > > >> executed >> > > > > > >> > > on >> > > > > > >> > > > > YARN >> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using >> > > > Marathon/Aurora), >> > > > > or >> > > > > > >> most >> > > > > > >> > > > other >> > > > > > >> > > > > >> > >>>>>>>>> in-house >> > > > > > >> > > > > >> > >>>>>>> deployment >> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot >> > easier >> > > > for >> > > > > > new >> > > > > > >> > > users. >> > > > > > >> > > > > >> > >>>>>>>>> Imagine >> > > > > > >> > > > > >> > >>>>>>> having >> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN. >> > The >> > > > drop >> > > > > > in >> > > > > > >> > > mailing >> > > > > > >> > > > > >> list >> > > > > > >> > > > > >> > >>>>>> traffic >> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long >> overdue to >> > > me. >> > > > > The >> > > > > > >> > > reality >> > > > > > >> > > > > is, >> > > > > > >> > > > > >> > >>>>> everyone >> > > > > > >> > > > > >> > >>>>>>>>> that >> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with >> Kafka. >> > We >> > > > > > basically >> > > > > > >> > > > require >> > > > > > >> > > > > >> it >> > > > > > >> > > > > >> > >>>>>> already >> > > > > > >> > > > > >> > >>>>>>> in >> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those >> > that >> > > > are >> > > > > > >> using >> > > > > > >> > > > other >> > > > > > >> > > > > >> > >>>>>>>>> systems >> > > > > > >> > > > > >> > >>>>>> are >> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into >> Kafka >> > > (1), >> > > > > and >> > > > > > >> then >> > > > > > >> > > > they >> > > > > > >> > > > > do >> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is >> already >> > > > > > discussion ( >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > > > > > >> > >> > > > > > >> > > >> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851 >> > > > > > >> > > > > >> > >>>>> 767 >> > > > > > >> > > > > >> > >>>>>>>>> ) >> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka >> > > > extremely >> > > > > > >> easy. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with >> > Kafka, >> > > > we >> > > > > > can >> > > > > > >> > > > leverage >> > > > > > >> > > > > a >> > > > > > >> > > > > >> > >>>>>>>>> ton of >> > > > > > >> > > > > >> > >>>>>>> their >> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to >> maintain >> > > our >> > > > > own >> > > > > > >> > config, >> > > > > > >> > > > > >> > >>>>>>>>> metrics, >> > > > > > >> > > > > >> > >>>>> etc. >> > > > > > >> > > > > >> > >>>>>>> We >> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and >> > make >> > > > them >> > > > > > >> > better. >> > > > > > >> > > > This >> > > > > > >> > > > > >> > >>>>>>>>> will >> > > > > > >> > > > > >> > >>>>> also >> > > > > > >> > > > > >> > >>>>>>>>> allow us to share the >> consumer/producer >> > > APIs, >> > > > > and >> > > > > > >> will >> > > > > > >> > > let >> > > > > > >> > > > > us >> > > > > > >> > > > > >> > >>>>> leverage >> > > > > > >> > > > > >> > >>>>>>>>> their offset management and partition >> > > > > management, >> > > > > > >> > rather >> > > > > > >> > > > > than >> > > > > > >> > > > > >> > >>>>>>>>> having >> > > > > > >> > > > > >> > >>>>>> our >> > > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream >> code >> > > would >> > > > > go >> > > > > > >> away, >> > > > > > >> > > as >> > > > > > >> > > > > >> would >> > > > > > >> > > > > >> > >>>>>>>>> most >> > > > > > >> > > > > >> > >>>>>> of >> > > > > > >> > > > > >> > >>>>>>>>> the >> > > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably >> have >> > to >> > > > push >> > > > > > some >> > > > > > >> > > > > partition >> > > > > > >> > > > > >> > >>>>>>> management >> > > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but >> > they're >> > > > > > already >> > > > > > >> > > moving >> > > > > > >> > > > > in >> > > > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer >> API. >> > > The >> > > > > > >> features >> > > > > > >> > > we >> > > > > > >> > > > > have >> > > > > > >> > > > > >> > >>>>>>>>> for >> > > > > > >> > > > > >> > >>>>>> partition >> > > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and >> > seem >> > > > > like >> > > > > > >> they >> > > > > > >> > > > should >> > > > > > >> > > > > >> be >> > > > > > >> > > > > >> > >>>>>>>>> in >> > > > > > >> > > > > >> > >>>>>> Kafka >> > > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some >> niche >> > > > usages >> > > > > > which >> > > > > > >> > > will >> > > > > > >> > > > > >> > >>>>>>>>> require >> > > > > > >> > > > > >> > >>>>>> extra >> > > > > > >> > > > > >> > >>>>>>>>> care and hence full control over >> > partition >> > > > > > >> assignments >> > > > > > >> > > > much >> > > > > > >> > > > > >> > >>>>>>>>> like the >> > > > > > >> > > > > >> > >>>>>>> Kafka >> > > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would >> > > continue >> > > > to >> > > > > > be >> > > > > > >> > > > > supported. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza >> > > > > community. >> > > > > > >> > > They'll >> > > > > > >> > > > > make >> > > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it >> easier >> > for >> > > > > > >> developers >> > > > > > >> > > to >> > > > > > >> > > > > add >> > > > > > >> > > > > >> > >>>>>>>>> new features. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and >> > > > somewhat >> > > > > > >> > backwards >> > > > > > >> > > > > >> > >>>>> incompatible >> > > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this >> route, >> > > it's >> > > > > > >> important >> > > > > > >> > > > that >> > > > > > >> > > > > we >> > > > > > >> > > > > >> > >>>>> openly >> > > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to >> provide a >> > > > > > migration >> > > > > > >> > path >> > > > > > >> > > > from >> > > > > > >> > > > > >> > >>>>>>>>> the >> > > > > > >> > > > > >> > >>>>>>> existing >> > > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make >> > > incompatible >> > > > > > >> > changes). >> > > > > > >> > > I >> > > > > > >> > > > > >> think >> > > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to >> > > provide a >> > > > > > >> wrapper >> > > > > > >> > to >> > > > > > >> > > > > allow >> > > > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to >> > > > continue >> > > > > > >> > running >> > > > > > >> > > on >> > > > > > >> > > > > the >> > > > > > >> > > > > >> > >>>> new container. >> > > > > > >> > > > > >> > >>>>>>> It's >> > > > > > >> > > > > >> > >>>>>>>>> also important that we openly >> communicate >> > > > about >> > > > > > >> > timing, >> > > > > > >> > > > and >> > > > > > >> > > > > >> > >>>>>>>>> stages >> > > > > > >> > > > > >> > >>>>> of >> > > > > > >> > > > > >> > >>>>>>> the >> > > > > > >> > > > > >> > >>>>>>>>> migration. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you >> > have >> > > > > > opinions. >> > > > > > >> > :) >> > > > > > >> > > > > Please >> > > > > > >> > > > > >> > >>>>>>>>> send >> > > > > > >> > > > > >> > >>>>>> your >> > > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback. >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>>> Cheers, >> > > > > > >> > > > > >> > >>>>>>>>> Chris >> > > > > > >> > > > > >> > >>>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>>> >> > > > > > >> > > > > >> > >>>>>>> >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>>> -- >> > > > > > >> > > > > >> > >>>>>> -- Guozhang >> > > > > > >> > > > > >> > >>>>>> >> > > > > > >> > > > > >> > >>>>> >> > > > > > >> > > > > >> > >>>> >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> > >> > > > > > >> > > > > >> >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > > >> > > > >> > > > > > >> > > >> > > > > > >> > >> > > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> > >