Re: Thoughts and obesrvations on Samza

Yi Pan Sun, 12 Jul 2015 19:47:07 -0700

Just to make it explicitly clear what I am proposing, here is a version of
more detailed description:


The fourth option (in addition to what Jakob summarized) we are proposing
is:

- Recharter Samza to “stream processing as a service”

- The current Samza core (the basic transformation API w/ basic partition
and offset management build-in) will be moved to Kafka Streams (i.e. part
of Kafka) and supports “run-as-a-library”

- Deprecate the SystemConsumers and SystemProducers APIs and move them to
Copycat

- The current SQL development:

   * physical operators and a Trident-like stream API should stay in Kafka
Streams as libraries, enabling any standalone deployment to use the core
window/join functions

   * query parser/planner and execution on top of a distributed service
should stay in new Samza (i.e. “stream processing as a service”)

- Advanced features related to job scheduling/state management stays in new
Samza (i.e. “streaming processing as a service”)

   * Any advanced PartitionManager implementation that can be plugged into
Kafka Streams

   * Any auto-scaling, dynamic configuration via coordinator stream

   * Any advanced state management s.t. host-affinity etc.


Pros:

- W/ the current Samza core as Kafka Streams and move the ingestion to
Copycat, we achieved most of the goals in the initial proposal:

   * Tighter coupling w/ Kafka

   * Reuse Kafka’s build-in functionalities, such as offset manager, basic
partition distribution

   * Separation of ingestion vs transformation APIs

   * offload a lot of system-specific configuration to Kafka Streams and
Copycat (i.e. SystemFactory configure, serde configure, etc.)

   * remove YARN dependency and make standalone deployment easy. As
Guozhang mentioned, it would be really easy to start a process that
internally run Kafka Streams as library.

- By re-chartering Samza as “stream processing as a service”, we address
the concern regarding to

   * Pluggable partition management

   * Running in a distributed cluster to manage process lifecycle,
fault-tolerance, resource-allocation, etc.

   * More advanced features s.t. host-affinity, auto-scaling, and dynamic
configure changes, etc.


Regarding to the code and community organization, I think the following may
be the best:

Code:

- A Kafka sub-project Kafka Streams to hold samza-core, samza-kv-store, and
the physical operator layer as library in SQL: this would allow better
alignment w/ Kafka, in code, doc, and branding

- Retain the current Samza project just to keep

   * A pluggable explicit partition management in Kafka Streams client

   * Integration w/ cluster-management systems for advanced features:

      * host-affinity, auto-scaling,, dynamic configuration, etc.

   * It will fully depend on the Kafka Streams API and remove all support
for SystemConsumers/SystemProducers in the future

Community: (this is almost the same as what Chris proposed)

- Kafka Streams: the current Samza community should be supporting this
effort together with some Kafka members, since most of the code here will
be from samza-core, samza-kv-store, and samza-sql.

- new Samza: the current Samza community should continue serve the course
to support more advanced features to run Kafka Streams as a service.
Arguably, the new Samza framework may be used to run Copycat workers as
well, at least to manage Copycat worker’s lifecycle in a clustered
environment. Hence, it would stay as a general stream processing framework
that takes in any source and output to any destination, just the transport
system is fixed to Kafka.

On Sun, Jul 12, 2015 at 7:29 PM, Yi Pan <[email protected]> wrote:

> Hi, Chris,
>
> Thanks for sending out this concrete set of points here. I agree w/ all
> but have a slight different point view on 8).
>
> My view on this is: instead of sunset Samza as TLP, can we re-charter the
> scope of Samza to be the home for "running streaming process as a service"?
>
> My main motivation is from the following points from a long internal
> discussion in LinkedIn:
>
> - There is a clear ask for pluggable partition management, like we do in
> LinkedIn, and as Ben Kirwin has mentioned in
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3ccacux-d-yjx++2gnf_1laf10kyuvyamg7up_dt19v0znmmhb...@mail.gmail.com%3E
> - There are concerns on lack of support for running stream processing in a
> cluster: lifecycle management, resource allocation, fault tolerance, etc.
> - There is a question to how to support more advanced features s.t.
> host-affinity, auto-scaling, and dynamic configuration in Samza jobs, as
> raised by Martin here:
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%[email protected]%3E
>
> We have use cases that need to address all the above three cases and most
> of the functions are all in the current Samza project, in some flavor. We
> are all supporting to merge the samza-core functionalities into Kafka
> Streams, but there is a question where we keep these functions in the
> future. One option is to start a new project that includes these functions
> that are closely related w/ "run stream-processing as-a-service", while
> another personally more attractive option is to re-charter Samza project
> just do "run stream processing as-a-service". We can avoid the overhead of
> re-starting another community for this project. Personally, I felt that
> here are the benefits we should be getting:
>
> 1. We have already agreed mostly that Kafka Streams API would allow some
> pluggable partition management functions. Hence, the advanced partition
> management can live out-side the new Kafka Streams core w/o affecting the
> run-as-a-library model in Kafka Streams.
> 2. The integration w/ cluster management system and advanced features
> listed above stays in the same project and allow existing users enjoy
> no-impact migration to Kafka Stream as the core. That also addresses Tim's
> question on "removing the support for YARN".
> 3. A separate project for stream-processing-as-a-service also allow the
> new Kafka Streams being independent to any cluster management and just
> focusing on stream process core functions, while leaving the functions that
> requires cluster-resource and state management to a separate layer.
>
> Please feel free to comment. Thanks!
>
> On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <[email protected]>
> wrote:
>
>> Hey all,
>>
>> I want to start by saying that I'm absolutely thrilled to be a part of
>> this
>> community. The amount of level-headed, thoughtful, educated discussion
>> that's gone on over the past ~10 days is overwhelming. Wonderful.
>>
>> It seems like discussion is waning a bit, and we've reached some
>> conclusions. There are several key emails in this threat, which I want to
>> call out:
>>
>> 1. Jakob's summary of the three potential ways forward.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
>> 2. Julian's call out that we should be focusing on community over code.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
>> 3. Martin's summary about the benefits of merging communities.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
>> 4. Jakob's comments about the distinction between community and code
>> paths.
>>
>>
>> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E
>>
>> I agree with the comments on all of these emails. I think Martin's summary
>> of his position aligns very closely with my own. To that end, I think we
>> should get concrete about what the proposal is, and call a vote on it.
>> Given that Jay, Martin, and I seem to be aligning fairly closely, I think
>> we should start with:
>>
>> 1. [community] Make Samza a subproject of Kafka.
>> 2. [community] Make all Samza PMC/committers committers of the subproject.
>> 3. [community] Migrate Samza's website/documentation into Kafka's.
>> 4. [code] Have the Samza community and the Kafka community start a
>> from-scratch reboot together in the new Kafka subproject. We can
>> borrow/copy &  paste significant chunks of code from Samza's code base.
>> 5. [code] The subproject would intentionally eliminate support for both
>> other streaming systems and all deployment systems.
>> 6. [code] Attempt to provide a bridge from our SystemConsumer to KIP-26
>> (copy cat)
>> 7. [code] Attempt to provide a bridge from the new subproject's processor
>> interface to our legacy StreamTask interface.
>> 8. [code/community] Sunset Samza as a TLP when we have a working Kafka
>> subproject that has a fault-tolerant container with state management.
>>
>> It's likely that (6) and (7) won't be fully drop-in. Still, the closer we
>> can get, the better it's going to be for our existing community.
>>
>> One thing that I didn't touch on with (2) is whether any Samza PMC members
>> should be rolled into Kafka PMC membership as well (though, Jay and Jakob
>> are already PMC members on both). I think that Samza's community deserves
>> a
>> voice on the PMC, so I'd propose that we roll at least a few PMC members
>> into the Kafka PMC, but I don't have a strong framework for which people
>> to
>> pick.
>>
>> Before (8), I think that Samza's TLP can continue to commit bug fixes and
>> patches as it sees fit, provided that we openly communicate that we won't
>> necessarily migrate new features to the new subproject, and that the TLP
>> will be shut down after the migration to the Kafka subproject occurs.
>>
>> Jakob, I could use your guidance here about about how to achieve this from
>> an Apache process perspective (sorry).
>>
>> * Should I just call a vote on this proposal?
>> * Should it happen on dev or private?
>> * Do committers have binding votes, or just PMC?
>>
>> Having trouble finding much detail on the Apache wikis. :(
>>
>> Cheers,
>> Chris
>>
>> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <[email protected]> wrote:
>>
>> > Thanks, Jay. This argument persuaded me actually. :)
>> >
>> > Fang, Yan
>> > [email protected]
>> >
>> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <[email protected]> wrote:
>> >
>> > > Hey Yan,
>> > >
>> > > Yeah philosophically I think the argument is that you should capture
>> the
>> > > stream in Kafka independent of the transformation. This is obviously a
>> > > Kafka-centric view point.
>> > >
>> > > Advantages of this:
>> > > - In practice I think this is what e.g. Storm people often end up
>> doing
>> > > anyway. You usually need to throttle any access to a live serving
>> > database.
>> > > - Can have multiple subscribers and they get the same thing without
>> > > additional load on the source system.
>> > > - Applications can tap into the stream if need be by subscribing.
>> > > - You can debug your transformation by tailing the Kafka topic with
>> the
>> > > console consumer
>> > > - Can tee off the same data stream for batch analysis or Lambda arch
>> > style
>> > > re-processing
>> > >
>> > > The disadvantage is that it will use Kafka resources. But the idea is
>> > > eventually you will have multiple subscribers to any data source (at
>> > least
>> > > for monitoring) so you will end up there soon enough anyway.
>> > >
>> > > Down the road the technical benefit is that I think it gives us a good
>> > path
>> > > towards end-to-end exactly once semantics from source to destination.
>> > > Basically the connectors need to support idempotence when talking to
>> > Kafka
>> > > and we need the transactional write feature in Kafka to make the
>> > > transformation atomic. This is actually pretty doable if you separate
>> > > connector=>kafka problem from the generic transformations which are
>> > always
>> > > kafka=>kafka. However I think it is quite impossible to do in a
>> > all_things
>> > > => all_things environment. Today you can say "well the semantics of
>> the
>> > > Samza APIs depend on the connectors you use" but it is actually worse
>> > then
>> > > that because the semantics actually depend on the pairing of
>> > connectors--so
>> > > not only can you probably not get a usable "exactly once" guarantee
>> > > end-to-end it can actually be quite hard to reverse engineer what
>> > property
>> > > (if any) your end-to-end flow has if you have heterogenous systems.
>> > >
>> > > -Jay
>> > >
>> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <[email protected]>
>> wrote:
>> > >
>> > > > {quote}
>> > > > maintained in a separate repository and retaining the existing
>> > > > committership but sharing as much else as possible (website, etc)
>> > > > {quote}
>> > > >
>> > > > Overall, I agree on this idea. Now the question is more about "how
>> to
>> > do
>> > > > it".
>> > > >
>> > > > On the other hand, one thing I want to point out is that, if we
>> decide
>> > to
>> > > > go this way, how do we want to support
>> > > > otherSystem-transformation-otherSystem use case?
>> > > >
>> > > > Basically, there are four user groups here:
>> > > >
>> > > > 1. Kafka-transformation-Kafka
>> > > > 2. Kafka-transformation-otherSystem
>> > > > 3. otherSystem-transformation-Kafka
>> > > > 4. otherSystem-transformation-otherSystem
>> > > >
>> > > > For group 1, they can easily use the new Samza library to achieve.
>> For
>> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka or
>> > Kafka->
>> > > > transformation -> copyCat.
>> > > >
>> > > > The problem is for group 4. Do we want to abandon this or still
>> support
>> > > it?
>> > > > Of course, this use case can be achieved by using copyCat ->
>> > > transformation
>> > > > -> Kafka -> transformation -> copyCat, the thing is how we persuade
>> > them
>> > > to
>> > > > do this long chain. If yes, it will also be a win for Kafka too. Or
>> if
>> > > > there is no one in this community actually doing this so far, maybe
>> ok
>> > to
>> > > > not support the group 4 directly.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Fang, Yan
>> > > > [email protected]
>> > > >
>> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <[email protected]>
>> wrote:
>> > > >
>> > > > > Yeah I agree with this summary. I think there are kind of two
>> > questions
>> > > > > here:
>> > > > > 1. Technically does alignment/reliance on Kafka make sense
>> > > > > 2. Branding wise (naming, website, concepts, etc) does alignment
>> with
>> > > > Kafka
>> > > > > make sense
>> > > > >
>> > > > > Personally I do think both of these things would be really
>> valuable,
>> > > and
>> > > > > would dramatically alter the trajectory of the project.
>> > > > >
>> > > > > My preference would be to see if people can mostly agree on a
>> > direction
>> > > > > rather than splintering things off. From my point of view the
>> ideal
>> > > > outcome
>> > > > > of all the options discussed would be to make Samza a closely
>> aligned
>> > > > > subproject, maintained in a separate repository and retaining the
>> > > > existing
>> > > > > committership but sharing as much else as possible (website,
>> etc). No
>> > > > idea
>> > > > > about how these things work, Jacob, you probably know more.
>> > > > >
>> > > > > No discussion amongst the Kafka folks has happened on this, but
>> > likely
>> > > we
>> > > > > should figure out what the Samza community actually wants first.
>> > > > >
>> > > > > I admit that this is a fairly radical departure from how things
>> are.
>> > > > >
>> > > > > If that doesn't fly, I think, yeah we could leave Samza as it is
>> and
>> > do
>> > > > the
>> > > > > more radical reboot inside Kafka. From my point of view that does
>> > leave
>> > > > > things in a somewhat confusing state since now there are two
>> stream
>> > > > > processing systems more or less coupled to Kafka in large part
>> made
>> > by
>> > > > the
>> > > > > same people. But, arguably that might be a cleaner way to make the
>> > > > cut-over
>> > > > > and perhaps less risky for Samza community since if it works
>> people
>> > can
>> > > > > switch and if it doesn't nothing will have changed. Dunno, how do
>> > > people
>> > > > > feel about this?
>> > > > >
>> > > > > -Jay
>> > > > >
>> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > >  This leads me to thinking that merging projects and
>> communities
>> > > > might
>> > > > > > be a good idea: with the union of experience from both
>> communities,
>> > > we
>> > > > > will
>> > > > > > probably build a better system that is better for users.
>> > > > > > Is this what's being proposed though? Merging the projects seems
>> > like
>> > > > > > a consequence of at most one of the three directions under
>> > > discussion:
>> > > > > > 1) Samza 2.0: The Samza community relies more heavily on Kafka
>> for
>> > > > > > configuration, etc. (to a greater or lesser extent to be
>> > determined)
>> > > > > > but the Samza community would not automatically merge withe
>> Kafka
>> > > > > > community (the Phoenix/HBase example is a good one here).
>> > > > > > 2) Samza Reboot: The Samza community continues to exist with a
>> > > limited
>> > > > > > project scope, but similarly would not need to be part of the
>> Kafka
>> > > > > > community (ie given committership) to progress.  Here, maybe the
>> > > Samza
>> > > > > > team would become a subproject of Kafka (the Board frowns on
>> > > > > > subprojects at the moment, so I'm not sure if that's even
>> > feasible),
>> > > > > > but that would not be required.
>> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this option the
>> > Kafka
>> > > > > > team builds its own streaming library, possibly off of Jay's
>> > > > > > prototype, which has not direct lineage to the Samza team.
>> There's
>> > > no
>> > > > > > reason for the Kafka team to bring in the Samza team.
>> > > > > >
>> > > > > > Is the Kafka community on board with this?
>> > > > > >
>> > > > > > To be clear, all three options under discussion are interesting,
>> > > > > > technically valid and likely healthy directions for the project.
>> > > > > > Also, they are not mutually exclusive.  The Samza community
>> could
>> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka community
>> went
>> > > > > > forward with 'Hey Samza!'  My points above are directed
>> entirely at
>> > > > > > the community aspect of these choices.
>> > > > > > -Jakob
>> > > > > >
>> > > > > > On 10 July 2015 at 09:10, Roger Hoover <[email protected]>
>> > > wrote:
>> > > > > > > That's great.  Thanks, Jay.
>> > > > > > >
>> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <[email protected]>
>> > > wrote:
>> > > > > > >
>> > > > > > >> Yeah totally agree. I think you have this issue even today,
>> > right?
>> > > > > I.e.
>> > > > > > if
>> > > > > > >> you need to make a simple config change and you're running in
>> > YARN
>> > > > > today
>> > > > > > >> you end up bouncing the job which then rebuilds state. I
>> think
>> > the
>> > > > fix
>> > > > > > is
>> > > > > > >> exactly what you described which is to have a long timeout on
>> > > > > partition
>> > > > > > >> movement for stateful jobs so that if a job is just getting
>> > > bounced,
>> > > > > and
>> > > > > > >> the cluster manager (or admin) is smart enough to restart it
>> on
>> > > the
>> > > > > same
>> > > > > > >> host when possible, it can optimistically reuse any existing
>> > state
>> > > > it
>> > > > > > finds
>> > > > > > >> on disk (if it is valid).
>> > > > > > >>
>> > > > > > >> So in this model the charter of the CM is to place processes
>> as
>> > > > > > stickily as
>> > > > > > >> possible and to restart or re-place failed processes. The
>> > charter
>> > > of
>> > > > > the
>> > > > > > >> partition management system is to control the assignment of
>> work
>> > > to
>> > > > > > these
>> > > > > > >> processes. The nice thing about this is that the work
>> > assignment,
>> > > > > > timeouts,
>> > > > > > >> behavior, configs, and code will all be the same across all
>> > > cluster
>> > > > > > >> managers.
>> > > > > > >>
>> > > > > > >> So I think that prototype would actually give you exactly
>> what
>> > you
>> > > > > want
>> > > > > > >> today for any cluster manager (or manual placement + restart
>> > > script)
>> > > > > > that
>> > > > > > >> was sticky in terms of host placement since there is already
>> a
>> > > > > > configurable
>> > > > > > >> partition movement timeout and task-by-task state reuse with
>> a
>> > > check
>> > > > > on
>> > > > > > >> state validity.
>> > > > > > >>
>> > > > > > >> -Jay
>> > > > > > >>
>> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
>> > > > [email protected]
>> > > > > >
>> > > > > > >> wrote:
>> > > > > > >>
>> > > > > > >> > That would be great to let Kafka do as much heavy lifting
>> as
>> > > > > possible
>> > > > > > and
>> > > > > > >> > make it easier for other languages to implement Samza apis.
>> > > > > > >> >
>> > > > > > >> > One thing to watch out for is the interplay between Kafka's
>> > > group
>> > > > > > >> > management and the external scheduler/process manager's
>> fault
>> > > > > > tolerance.
>> > > > > > >> > If a container dies, the Kafka group membership protocol
>> will
>> > > try
>> > > > to
>> > > > > > >> assign
>> > > > > > >> > it's tasks to other containers while at the same time the
>> > > process
>> > > > > > manager
>> > > > > > >> > is trying to relaunch the container.  Without some
>> > consideration
>> > > > for
>> > > > > > this
>> > > > > > >> > (like a configurable amount of time to wait before Kafka
>> > alters
>> > > > the
>> > > > > > group
>> > > > > > >> > membership), there may be thrashing going on which is
>> > especially
>> > > > bad
>> > > > > > for
>> > > > > > >> > containers with large amounts of local state.
>> > > > > > >> >
>> > > > > > >> > Someone else pointed this out already but I thought it
>> might
>> > be
>> > > > > worth
>> > > > > > >> > calling out again.
>> > > > > > >> >
>> > > > > > >> > Cheers,
>> > > > > > >> >
>> > > > > > >> > Roger
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <
>> [email protected]>
>> > > > > wrote:
>> > > > > > >> >
>> > > > > > >> > > Hey Roger,
>> > > > > > >> > >
>> > > > > > >> > > I couldn't agree more. We spent a bunch of time talking
>> to
>> > > > people
>> > > > > > and
>> > > > > > >> > that
>> > > > > > >> > > is exactly the stuff we heard time and again. What makes
>> it
>> > > > hard,
>> > > > > of
>> > > > > > >> > > course, is that there is some tension between
>> compatibility
>> > > with
>> > > > > > what's
>> > > > > > >> > > there now and making things better for new users.
>> > > > > > >> > >
>> > > > > > >> > > I also strongly agree with the importance of
>> multi-language
>> > > > > > support. We
>> > > > > > >> > are
>> > > > > > >> > > talking now about Java, but for application development
>> use
>> > > > cases
>> > > > > > >> people
>> > > > > > >> > > want to work in whatever language they are using
>> elsewhere.
>> > I
>> > > > > think
>> > > > > > >> > moving
>> > > > > > >> > > to a model where Kafka itself does the group membership,
>> > > > lifecycle
>> > > > > > >> > control,
>> > > > > > >> > > and partition assignment has the advantage of putting all
>> > that
>> > > > > > complex
>> > > > > > >> > > stuff behind a clean api that the clients are already
>> going
>> > to
>> > > > be
>> > > > > > >> > > implementing for their consumer, so the added
>> functionality
>> > > for
>> > > > > > stream
>> > > > > > >> > > processing beyond a consumer becomes very minor.
>> > > > > > >> > >
>> > > > > > >> > > -Jay
>> > > > > > >> > >
>> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
>> > > > > > [email protected]>
>> > > > > > >> > > wrote:
>> > > > > > >> > >
>> > > > > > >> > > > Metamorphosis...nice. :)
>> > > > > > >> > > >
>> > > > > > >> > > > This has been a great discussion.  As a user of Samza
>> > who's
>> > > > > > recently
>> > > > > > >> > > > integrated it into a relatively large organization, I
>> just
>> > > > want
>> > > > > to
>> > > > > > >> add
>> > > > > > >> > > > support to a few points already made.
>> > > > > > >> > > >
>> > > > > > >> > > > The biggest hurdles to adoption of Samza as it
>> currently
>> > > > exists
>> > > > > > that
>> > > > > > >> > I've
>> > > > > > >> > > > experienced are:
>> > > > > > >> > > > 1) YARN - YARN is overly complex in many environments
>> > where
>> > > > > Puppet
>> > > > > > >> > would
>> > > > > > >> > > do
>> > > > > > >> > > > just fine but it was the only mechanism to get fault
>> > > > tolerance.
>> > > > > > >> > > > 2) Configuration - I think I like the idea of
>> configuring
>> > > most
>> > > > > of
>> > > > > > the
>> > > > > > >> > job
>> > > > > > >> > > > in code rather than config files.  In general, I think
>> the
>> > > > goal
>> > > > > > >> should
>> > > > > > >> > be
>> > > > > > >> > > > to make it harder to make mistakes, especially of the
>> kind
>> > > > where
>> > > > > > the
>> > > > > > >> > code
>> > > > > > >> > > > expects something and the config doesn't match.  The
>> > current
>> > > > > > config
>> > > > > > >> is
>> > > > > > >> > > > quite intricate and error-prone.  For example, the
>> > > application
>> > > > > > logic
>> > > > > > >> > may
>> > > > > > >> > > > depend on bootstrapping a topic but rather than
>> asserting
>> > > that
>> > > > > in
>> > > > > > the
>> > > > > > >> > > code,
>> > > > > > >> > > > you have to rely on getting the config right.  Likewise
>> > with
>> > > > > > serdes,
>> > > > > > >> > the
>> > > > > > >> > > > Java representations produced by various serdes (JSON,
>> > Avro,
>> > > > > etc.)
>> > > > > > >> are
>> > > > > > >> > > not
>> > > > > > >> > > > equivalent so you cannot just reconfigure a serde
>> without
>> > > > > changing
>> > > > > > >> the
>> > > > > > >> > > > code.   It would be nice for jobs to be able to assert
>> > what
>> > > > they
>> > > > > > >> expect
>> > > > > > >> > > > from their input topics in terms of partitioning.
>> This is
>> > > > > > getting a
>> > > > > > >> > > little
>> > > > > > >> > > > off topic but I was even thinking about creating a
>> "Samza
>> > > > config
>> > > > > > >> > linter"
>> > > > > > >> > > > that would sanity check a set of configs.  Especially
>> in
>> > > > > > >> organizations
>> > > > > > >> > > > where config is managed by a different team than the
>> > > > application
>> > > > > > >> > > developer,
>> > > > > > >> > > > it's very hard to get avoid config mistakes.
>> > > > > > >> > > > 3) Java/Scala centric - for many teams (especially
>> > > DevOps-type
>> > > > > > >> folks),
>> > > > > > >> > > the
>> > > > > > >> > > > pain of the Java toolchain (maven, slow builds, weak
>> > command
>> > > > > line
>> > > > > > >> > > support,
>> > > > > > >> > > > configuration over convention) really inhibits
>> > productivity.
>> > > > As
>> > > > > > more
>> > > > > > >> > and
>> > > > > > >> > > > more high-quality clients become available for Kafka, I
>> > hope
>> > > > > > they'll
>> > > > > > >> > > follow
>> > > > > > >> > > > Samza's model.  Not sure how much it affects the
>> proposals
>> > > in
>> > > > > this
>> > > > > > >> > thread
>> > > > > > >> > > > but please consider other languages in the ecosystem as
>> > > well.
>> > > > > > From
>> > > > > > >> > what
>> > > > > > >> > > > I've heard, Spark has more Python users than
>> Java/Scala.
>> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > >
>> > > > > > >> >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
>> > > > > > >> > > > and are working on a Yeoman generator
>> > > > > > >> > > > https://github.com/Quantiply/generator-rico for
>> > > Jython/Samza
>> > > > > > >> projects
>> > > > > > >> > to
>> > > > > > >> > > > alleviate some of the pain)
>> > > > > > >> > > >
>> > > > > > >> > > > I also want to underscore Jay's point about improving
>> the
>> > > user
>> > > > > > >> > > experience.
>> > > > > > >> > > > That's a very important factor for adoption.  I think
>> the
>> > > goal
>> > > > > > should
>> > > > > > >> > be
>> > > > > > >> > > to
>> > > > > > >> > > > make Samza as easy to get started with as something
>> like
>> > > > > Logstash.
>> > > > > > >> > > > Logstash is vastly inferior in terms of capabilities to
>> > > Samza
>> > > > > but
>> > > > > > >> it's
>> > > > > > >> > > easy
>> > > > > > >> > > > to get started and that makes a big difference.
>> > > > > > >> > > >
>> > > > > > >> > > > Cheers,
>> > > > > > >> > > >
>> > > > > > >> > > > Roger
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > >
>> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De Francisci
>> > > > Morales <
>> > > > > > >> > > > [email protected]> wrote:
>> > > > > > >> > > >
>> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka
>> Metamorphosis
>> > > is
>> > > > a
>> > > > > > clear
>> > > > > > >> > > > winner
>> > > > > > >> > > > > :)
>> > > > > > >> > > > >
>> > > > > > >> > > > > --
>> > > > > > >> > > > > Gianmarco
>> > > > > > >> > > > >
>> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci
>> Morales
>> > <
>> > > > > > >> > > [email protected]
>> > > > > > >> > > > >
>> > > > > > >> > > > > wrote:
>> > > > > > >> > > > >
>> > > > > > >> > > > > > Hi,
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > @Martin, thanks for you comments.
>> > > > > > >> > > > > > Maybe I'm missing some important point, but I think
>> > > > coupling
>> > > > > > the
>> > > > > > >> > > > releases
>> > > > > > >> > > > > > is actually a *good* thing.
>> > > > > > >> > > > > > To make an example, would it be better if the MR
>> and
>> > > HDFS
>> > > > > > >> > components
>> > > > > > >> > > of
>> > > > > > >> > > > > > Hadoop had different release schedules?
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Actually, keeping the discussion in a single place
>> > would
>> > > > > make
>> > > > > > >> > > agreeing
>> > > > > > >> > > > on
>> > > > > > >> > > > > > releases (and backwards compatibility) much
>> easier, as
>> > > > > > everybody
>> > > > > > >> > > would
>> > > > > > >> > > > be
>> > > > > > >> > > > > > responsible for the whole codebase.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > That said, I like the idea of absorbing samza-core
>> as
>> > a
>> > > > > > >> > sub-project,
>> > > > > > >> > > > and
>> > > > > > >> > > > > > leave the fancy stuff separate.
>> > > > > > >> > > > > > It probably gives 90% of the benefits we have been
>> > > > > discussing
>> > > > > > >> here.
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > Cheers,
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > --
>> > > > > > >> > > > > > Gianmarco
>> > > > > > >> > > > > >
>> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > >> > > > > >
>> > > > > > >> > > > > >> Hey Martin,
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> I agree coupling release schedules is a downside.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> Definitely we can try to solve some of the
>> > integration
>> > > > > > problems
>> > > > > > >> in
>> > > > > > >> > > > > >> Confluent Platform or in other distributions. But
>> I
>> > > think
>> > > > > > this
>> > > > > > >> > ends
>> > > > > > >> > > up
>> > > > > > >> > > > > >> being really shallow. I guess I feel to really
>> get a
>> > > good
>> > > > > > user
>> > > > > > >> > > > > experience
>> > > > > > >> > > > > >> the two systems have to kind of feel like part of
>> the
>> > > > same
>> > > > > > thing
>> > > > > > >> > and
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> can't really add that in later--you can put both
>> in
>> > the
>> > > > > same
>> > > > > > >> > > > > downloadable
>> > > > > > >> > > > > >> tar file but it doesn't really give a very
>> cohesive
>> > > > > feeling.
>> > > > > > I
>> > > > > > >> > agree
>> > > > > > >> > > > > that
>> > > > > > >> > > > > >> ultimately any of the project stuff is as much
>> social
>> > > and
>> > > > > > naming
>> > > > > > >> > as
>> > > > > > >> > > > > >> anything else--theoretically two totally
>> independent
>> > > > > projects
>> > > > > > >> > could
>> > > > > > >> > > > work
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> tightly align. In practice this seems to be quite
>> > > > difficult
>> > > > > > >> > though.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> For the frameworks--totally agree it would be
>> good to
>> > > > > > maintain
>> > > > > > >> the
>> > > > > > >> > > > > >> framework support with the project. In some cases
>> > there
>> > > > may
>> > > > > > not
>> > > > > > >> be
>> > > > > > >> > > too
>> > > > > > >> > > > > >> much
>> > > > > > >> > > > > >> there since the integration gets lighter but I
>> think
>> > > > > whatever
>> > > > > > >> > stubs
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> need should be included. So no I definitely wasn't
>> > > trying
>> > > > > to
>> > > > > > >> imply
>> > > > > > >> > > > > >> dropping
>> > > > > > >> > > > > >> support for these frameworks, just making the
>> > > integration
>> > > > > > >> lighter
>> > > > > > >> > by
>> > > > > > >> > > > > >> separating process management from partition
>> > > management.
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> You raise two good points we would have to figure
>> out
>> > > if
>> > > > we
>> > > > > > went
>> > > > > > >> > > down
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> alignment path:
>> > > > > > >> > > > > >> 1. With respect to the name, yeah I think the
>> first
>> > > > > question
>> > > > > > is
>> > > > > > >> > > > whether
>> > > > > > >> > > > > >> some "re-branding" would be worth it. If so then I
>> > > think
>> > > > we
>> > > > > > can
>> > > > > > >> > > have a
>> > > > > > >> > > > > big
>> > > > > > >> > > > > >> thread on the name. I'm definitely not set on
>> Kafka
>> > > > > > Streaming or
>> > > > > > >> > > Kafka
>> > > > > > >> > > > > >> Streams I was just using them to be kind of
>> > > > illustrative. I
>> > > > > > >> agree
>> > > > > > >> > > with
>> > > > > > >> > > > > >> your
>> > > > > > >> > > > > >> critique of these names, though I think people
>> would
>> > > get
>> > > > > the
>> > > > > > >> idea.
>> > > > > > >> > > > > >> 2. Yeah you also raise a good point about how to
>> > > "factor"
>> > > > > it.
>> > > > > > >> Here
>> > > > > > >> > > are
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> options I see (I could get enthusiastic about any
>> of
>> > > > them):
>> > > > > > >> > > > > >>    a. One repo for both Kafka and Samza
>> > > > > > >> > > > > >>    b. Two repos, retaining the current seperation
>> > > > > > >> > > > > >>    c. Two repos, the equivalent of samza-api and
>> > > > samza-core
>> > > > > > is
>> > > > > > >> > > > absorbed
>> > > > > > >> > > > > >> almost like a third client
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> Cheers,
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> -Jay
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin Kleppmann <
>> > > > > > >> > > > [email protected]>
>> > > > > > >> > > > > >> wrote:
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a few
>> > > follow-up
>> > > > > > >> > comments.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka or
>> > becoming
>> > > a
>> > > > > > >> > subproject:
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > reasons you mention are good. The risk I see is
>> > that
>> > > > > > release
>> > > > > > >> > > > schedules
>> > > > > > >> > > > > >> > become coupled to each other, which can slow
>> > everyone
>> > > > > down,
>> > > > > > >> and
>> > > > > > >> > > > large
>> > > > > > >> > > > > >> > projects with many contributors are harder to
>> > manage.
>> > > > > > (Jakob,
>> > > > > > >> > can
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> speak
>> > > > > > >> > > > > >> > from experience, having seen a wider range of
>> > Hadoop
>> > > > > > ecosystem
>> > > > > > >> > > > > >> projects?)
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > Some of the goals of a better unified developer
>> > > > > experience
>> > > > > > >> could
>> > > > > > >> > > > also
>> > > > > > >> > > > > be
>> > > > > > >> > > > > >> > solved by integrating Samza nicely into a Kafka
>> > > > > > distribution
>> > > > > > >> > (such
>> > > > > > >> > > > as
>> > > > > > >> > > > > >> > Confluent's). I'm not against merging projects
>> if
>> > we
>> > > > > decide
>> > > > > > >> > that's
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> way
>> > > > > > >> > > > > >> > to go, just pointing out the same goals can
>> perhaps
>> > > > also
>> > > > > be
>> > > > > > >> > > achieved
>> > > > > > >> > > > > in
>> > > > > > >> > > > > >> > other ways.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - With regard to dropping the YARN dependency:
>> are
>> > > you
>> > > > > > >> proposing
>> > > > > > >> > > > that
>> > > > > > >> > > > > >> > Samza doesn't give any help to people wanting to
>> > run
>> > > on
>> > > > > > >> > > > > >> YARN/Mesos/AWS/etc?
>> > > > > > >> > > > > >> > So the docs would basically have a link to
>> Slider
>> > and
>> > > > > > nothing
>> > > > > > >> > > else?
>> > > > > > >> > > > Or
>> > > > > > >> > > > > >> > would we maintain integrations with a bunch of
>> > > popular
>> > > > > > >> > deployment
>> > > > > > >> > > > > >> methods
>> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts to
>> make
>> > > > Samza
>> > > > > > work
>> > > > > > >> > with
>> > > > > > >> > > > > >> Slider)?
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > I absolutely think it's a good idea to have the
>> > "as a
>> > > > > > library"
>> > > > > > >> > and
>> > > > > > >> > > > > "as a
>> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for
>> people
>> > who
>> > > > > want
>> > > > > > >> them,
>> > > > > > >> > > > but I
>> > > > > > >> > > > > >> > think there should also be a low-friction path
>> for
>> > > > common
>> > > > > > "as
>> > > > > > >> a
>> > > > > > >> > > > > service"
>> > > > > > >> > > > > >> > deployment methods, for which we probably need
>> to
>> > > > > maintain
>> > > > > > >> > > > > integrations.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems odd to
>> me,
>> > > > > because
>> > > > > > >> Kafka
>> > > > > > >> > > is
>> > > > > > >> > > > > all
>> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka
>> Transformers"
>> > > or
>> > > > > > "Kafka
>> > > > > > >> > > > Filters"
>> > > > > > >> > > > > >> > would be more apt?
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza
>> (stream
>> > > > > > >> transformation
>> > > > > > >> > > > with
>> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a
>> library"
>> > > bit)
>> > > > > > could
>> > > > > > >> > > become
>> > > > > > >> > > > > >> part of
>> > > > > > >> > > > > >> > Kafka, while higher-level tools such as
>> streaming
>> > SQL
>> > > > and
>> > > > > > >> > > > integrations
>> > > > > > >> > > > > >> with
>> > > > > > >> > > > > >> > deployment frameworks remain in a separate
>> project?
>> > > In
>> > > > > > other
>> > > > > > >> > > words,
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > would absorb the proven, stable core of Samza,
>> > which
>> > > > > would
>> > > > > > >> > become
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > "third Kafka client" mentioned early in this
>> > thread.
>> > > > The
>> > > > > > Samza
>> > > > > > >> > > > project
>> > > > > > >> > > > > >> > would then target that third Kafka client as its
>> > base
>> > > > > API,
>> > > > > > and
>> > > > > > >> > the
>> > > > > > >> > > > > >> project
>> > > > > > >> > > > > >> > would be freed up to explore more experimental
>> new
>> > > > > > horizons.
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > Martin
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
>> > > > [email protected]>
>> > > > > > >> wrote:
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > > Hey Martin,
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I actually
>> > don't
>> > > > > think
>> > > > > > it
>> > > > > > >> > ties
>> > > > > > >> > > > our
>> > > > > > >> > > > > >> > hands
>> > > > > > >> > > > > >> > > at all, all it does is refactor things. The
>> > > division
>> > > > of
>> > > > > > >> > > > > >> responsibility is
>> > > > > > >> > > > > >> > > that Samza core is responsible for task
>> > lifecycle,
>> > > > > state,
>> > > > > > >> and
>> > > > > > >> > > > > >> partition
>> > > > > > >> > > > > >> > > management (using the Kafka co-ordinator) but
>> it
>> > is
>> > > > NOT
>> > > > > > >> > > > responsible
>> > > > > > >> > > > > >> for
>> > > > > > >> > > > > >> > > packaging, configuration deployment or
>> execution
>> > of
>> > > > > > >> processes.
>> > > > > > >> > > The
>> > > > > > >> > > > > >> > problem
>> > > > > > >> > > > > >> > > of packaging and starting these processes is
>> > > > > > >> > > > > >> > > framework/environment-specific. This leaves
>> > > > individual
>> > > > > > >> > > frameworks
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> be
>> > > > > > >> > > > > >> > as
>> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you can get
>> > > simple
>> > > > > > >> stateless
>> > > > > > >> > > > > >> support in
>> > > > > > >> > > > > >> > > YARN, Mesos, etc using their off-the-shelf app
>> > > > > framework
>> > > > > > >> > > (Slider,
>> > > > > > >> > > > > >> > Marathon,
>> > > > > > >> > > > > >> > > etc). These are well known by people and have
>> > nice
>> > > > UIs
>> > > > > > and a
>> > > > > > >> > lot
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > > flexibility. I don't think they have node
>> > affinity
>> > > > as a
>> > > > > > >> built
>> > > > > > >> > in
>> > > > > > >> > > > > >> option
>> > > > > > >> > > > > >> > > (though I could be wrong). So if we want that
>> we
>> > > can
>> > > > > > either
>> > > > > > >> > wait
>> > > > > > >> > > > for
>> > > > > > >> > > > > >> them
>> > > > > > >> > > > > >> > > to add it or do a custom framework to add that
>> > > > feature
>> > > > > > (as
>> > > > > > >> > now).
>> > > > > > >> > > > > >> > Obviously
>> > > > > > >> > > > > >> > > if you manage things with old-school ops tools
>> > > > > > >> > (puppet/chef/etc)
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> get
>> > > > > > >> > > > > >> > > locality easily. The nice thing, though, is
>> that
>> > > all
>> > > > > the
>> > > > > > >> samza
>> > > > > > >> > > > > >> "business
>> > > > > > >> > > > > >> > > logic" around partition management and fault
>> > > > tolerance
>> > > > > > is in
>> > > > > > >> > > Samza
>> > > > > > >> > > > > >> core
>> > > > > > >> > > > > >> > so
>> > > > > > >> > > > > >> > > it is shared across frameworks and the
>> framework
>> > > > > specific
>> > > > > > >> bit
>> > > > > > >> > is
>> > > > > > >> > > > > just
>> > > > > > >> > > > > >> > > whether it is smart enough to try to get the
>> same
>> > > > host
>> > > > > > when
>> > > > > > >> a
>> > > > > > >> > > job
>> > > > > > >> > > > is
>> > > > > > >> > > > > >> > > restarted.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > With respect to the Kafka-alignment, yeah I
>> think
>> > > the
>> > > > > > goal
>> > > > > > >> > would
>> > > > > > >> > > > be
>> > > > > > >> > > > > >> (a)
>> > > > > > >> > > > > >> > > actually get better alignment in user
>> experience,
>> > > and
>> > > > > (b)
>> > > > > > >> > > express
>> > > > > > >> > > > > >> this in
>> > > > > > >> > > > > >> > > the naming and project branding. Specifically:
>> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for the
>> > > > > > "transformation"
>> > > > > > >> api
>> > > > > > >> > > to
>> > > > > > >> > > > be
>> > > > > > >> > > > > >> > > discoverable in the main Kafka docs--i.e. be
>> able
>> > > to
>> > > > > > explain
>> > > > > > >> > > when
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> use
>> > > > > > >> > > > > >> > > the consumer and when to use the stream
>> > processing
>> > > > > > >> > functionality
>> > > > > > >> > > > and
>> > > > > > >> > > > > >> lead
>> > > > > > >> > > > > >> > > people into that experience.
>> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza 1.4.2
>> (or
>> > > > > > whatever)
>> > > > > > >> > that
>> > > > > > >> > > > has
>> > > > > > >> > > > > >> both
>> > > > > > >> > > > > >> > > Kafka and the stream processing part and they
>> > > > actually
>> > > > > > work
>> > > > > > >> > > > > together.
>> > > > > > >> > > > > >> > > 3. Unify the programming experience so the
>> client
>> > > and
>> > > > > > Samza
>> > > > > > >> > api
>> > > > > > >> > > > > share
>> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > I think sub-projects keep separate committers
>> and
>> > > can
>> > > > > > have a
>> > > > > > >> > > > > separate
>> > > > > > >> > > > > >> > repo,
>> > > > > > >> > > > > >> > > but I'm actually not really sure (I can't
>> find a
>> > > > > > definition
>> > > > > > >> > of a
>> > > > > > >> > > > > >> > subproject
>> > > > > > >> > > > > >> > > in Apache).
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > Basically at a high-level you want the
>> experience
>> > > to
>> > > > > > "feel"
>> > > > > > >> > > like a
>> > > > > > >> > > > > >> single
>> > > > > > >> > > > > >> > > system, not to relatively independent things
>> that
>> > > are
>> > > > > > kind
>> > > > > > >> of
>> > > > > > >> > > > > >> awkwardly
>> > > > > > >> > > > > >> > > glued together.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > I think if we did that they having naming or
>> > > branding
>> > > > > > like
>> > > > > > >> > > "kafka
>> > > > > > >> > > > > >> > > streaming" or "kafka streams" or something
>> like
>> > > that
>> > > > > > would
>> > > > > > >> > > > actually
>> > > > > > >> > > > > >> do a
>> > > > > > >> > > > > >> > > good job of conveying what it is. I do that
>> this
>> > > > would
>> > > > > > help
>> > > > > > >> > > > adoption
>> > > > > > >> > > > > >> > quite
>> > > > > > >> > > > > >> > > a lot as it would correctly convey that using
>> > Kafka
>> > > > > > >> Streaming
>> > > > > > >> > > with
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > is
>> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka is
>> pretty
>> > > > > heavily
>> > > > > > >> > adopted
>> > > > > > >> > > > at
>> > > > > > >> > > > > >> this
>> > > > > > >> > > > > >> > > point.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > Fwiw we actually considered this model
>> originally
>> > > > when
>> > > > > > open
>> > > > > > >> > > > sourcing
>> > > > > > >> > > > > >> > Samza,
>> > > > > > >> > > > > >> > > however at that time Kafka was relatively
>> unknown
>> > > and
>> > > > > we
>> > > > > > >> > decided
>> > > > > > >> > > > not
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> > do
>> > > > > > >> > > > > >> > > it since we felt it would be limiting. From my
>> > > point
>> > > > of
>> > > > > > view
>> > > > > > >> > the
>> > > > > > >> > > > > three
>> > > > > > >> > > > > >> > > things have changed (1) Kafka is now really
>> > heavily
>> > > > > used
>> > > > > > for
>> > > > > > >> > > > stream
>> > > > > > >> > > > > >> > > processing, (2) we learned that abstracting
>> out
>> > the
>> > > > > > stream
>> > > > > > >> > well
>> > > > > > >> > > is
>> > > > > > >> > > > > >> > > basically impossible, (3) we learned it is
>> really
>> > > > hard
>> > > > > to
>> > > > > > >> keep
>> > > > > > >> > > the
>> > > > > > >> > > > > two
>> > > > > > >> > > > > >> > > things feeling like a single product.
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > -Jay
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin
>> Kleppmann
>> > <
>> > > > > > >> > > > > >> [email protected]>
>> > > > > > >> > > > > >> > > wrote:
>> > > > > > >> > > > > >> > >
>> > > > > > >> > > > > >> > >> Hi all,
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Lots of good thoughts here.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> I agree with the general philosophy of tying
>> > Samza
>> > > > > more
>> > > > > > >> > firmly
>> > > > > > >> > > to
>> > > > > > >> > > > > >> Kafka.
>> > > > > > >> > > > > >> > >> After I spent a while looking at integrating
>> > other
>> > > > > > message
>> > > > > > >> > > > brokers
>> > > > > > >> > > > > >> (e.g.
>> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to the
>> > > > conclusion
>> > > > > > that
>> > > > > > >> > > > > >> > SystemConsumer
>> > > > > > >> > > > > >> > >> tacitly assumes a model so much like Kafka's
>> > that
>> > > > > pretty
>> > > > > > >> much
>> > > > > > >> > > > > nobody
>> > > > > > >> > > > > >> but
>> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus is
>> > perhaps
>> > > an
>> > > > > > >> > exception,
>> > > > > > >> > > > but
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.) Thus,
>> > > making
>> > > > > > Samza
>> > > > > > >> > > fully
>> > > > > > >> > > > > >> > dependent
>> > > > > > >> > > > > >> > >> on Kafka acknowledges that the
>> > system-independence
>> > > > was
>> > > > > > >> never
>> > > > > > >> > as
>> > > > > > >> > > > > real
>> > > > > > >> > > > > >> as
>> > > > > > >> > > > > >> > we
>> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of code
>> > reuse
>> > > > are
>> > > > > > >> real.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN has
>> also
>> > > > always
>> > > > > > been
>> > > > > > >> > > > > >> appealing to
>> > > > > > >> > > > > >> > >> me, for various reasons already mentioned in
>> > this
>> > > > > > thread.
>> > > > > > >> > > > Although
>> > > > > > >> > > > > >> > making
>> > > > > > >> > > > > >> > >> Samza jobs deployable on anything
>> > > > (YARN/Mesos/AWS/etc)
>> > > > > > >> seems
>> > > > > > >> > > > > >> laudable,
>> > > > > > >> > > > > >> > I am
>> > > > > > >> > > > > >> > >> a little concerned that it will restrict us
>> to a
>> > > > > lowest
>> > > > > > >> > common
>> > > > > > >> > > > > >> > denominator.
>> > > > > > >> > > > > >> > >> For example, would host affinity (SAMZA-617)
>> > still
>> > > > be
>> > > > > > >> > possible?
>> > > > > > >> > > > For
>> > > > > > >> > > > > >> jobs
>> > > > > > >> > > > > >> > >> with large amounts of state, I think
>> SAMZA-617
>> > > would
>> > > > > be
>> > > > > > a
>> > > > > > >> big
>> > > > > > >> > > > boon,
>> > > > > > >> > > > > >> > since
>> > > > > > >> > > > > >> > >> restoring state off the changelog on every
>> > single
>> > > > > > restart
>> > > > > > >> is
>> > > > > > >> > > > > painful,
>> > > > > > >> > > > > >> > due
>> > > > > > >> > > > > >> > >> to long recovery times. It would be a shame
>> if
>> > the
>> > > > > > >> decoupling
>> > > > > > >> > > > from
>> > > > > > >> > > > > >> YARN
>> > > > > > >> > > > > >> > >> made host affinity impossible.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Jay, a question about the proposed API for
>> > > > > > instantiating a
>> > > > > > >> > job
>> > > > > > >> > > in
>> > > > > > >> > > > > >> code
>> > > > > > >> > > > > >> > >> (rather than a properties file): when
>> > submitting a
>> > > > job
>> > > > > > to a
>> > > > > > >> > > > > cluster,
>> > > > > > >> > > > > >> is
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >> idea that the instantiation code runs on a
>> > client
>> > > > > > >> somewhere,
>> > > > > > >> > > > which
>> > > > > > >> > > > > >> then
>> > > > > > >> > > > > >> > >> pokes the necessary endpoints on
>> > > YARN/Mesos/AWS/etc?
>> > > > > Or
>> > > > > > >> does
>> > > > > > >> > > that
>> > > > > > >> > > > > >> code
>> > > > > > >> > > > > >> > run
>> > > > > > >> > > > > >> > >> on each container that is part of the job (in
>> > > which
>> > > > > > case,
>> > > > > > >> how
>> > > > > > >> > > > does
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > job
>> > > > > > >> > > > > >> > >> submission to the cluster work)?
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel
>> right to
>> > > > make
>> > > > > a
>> > > > > > 1.0
>> > > > > > >> > > > release
>> > > > > > >> > > > > >> > with a
>> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete. So if
>> > this
>> > > > is
>> > > > > > going
>> > > > > > >> > to
>> > > > > > >> > > > > >> happen, I
>> > > > > > >> > > > > >> > >> think it would be more honest to stick with
>> 0.*
>> > > > > version
>> > > > > > >> > numbers
>> > > > > > >> > > > > until
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >> library-ified Samza has been implemented, is
>> > > stable
>> > > > > and
>> > > > > > >> > widely
>> > > > > > >> > > > > used.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of
>> Kafka?
>> > > There
>> > > > > is
>> > > > > > >> > > precedent
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > >> tight coupling between different Apache
>> projects
>> > > > (e.g.
>> > > > > > >> > Curator
>> > > > > > >> > > > and
>> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I think
>> > > remaining
>> > > > > > >> separate
>> > > > > > >> > > > would
>> > > > > > >> > > > > >> be
>> > > > > > >> > > > > >> > ok.
>> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on Kafka,
>> there
>> > > is
>> > > > > > enough
>> > > > > > >> > > > > substance
>> > > > > > >> > > > > >> in
>> > > > > > >> > > > > >> > >> Samza that it warrants being a separate
>> project.
>> > > An
>> > > > > > >> argument
>> > > > > > >> > in
>> > > > > > >> > > > > >> favour
>> > > > > > >> > > > > >> > of
>> > > > > > >> > > > > >> > >> merging would be if we think Kafka has a much
>> > > > stronger
>> > > > > > >> "brand
>> > > > > > >> > > > > >> presence"
>> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one. If
>> the
>> > > Kafka
>> > > > > > >> project
>> > > > > > >> > is
>> > > > > > >> > > > > >> willing
>> > > > > > >> > > > > >> > to
>> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of doing
>> > > > stateful
>> > > > > > >> stream
>> > > > > > >> > > > > >> > >> transformations, that would probably have
>> much
>> > the
>> > > > > same
>> > > > > > >> > effect
>> > > > > > >> > > as
>> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream
>> Processors"
>> > or
>> > > > > > suchlike.
>> > > > > > >> > > Close
>> > > > > > >> > > > > >> > >> collaboration between the two projects will
>> be
>> > > > needed
>> > > > > in
>> > > > > > >> any
>> > > > > > >> > > > case.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> From a project management perspective, I
>> guess
>> > the
>> > > > > "new
>> > > > > > >> > Samza"
>> > > > > > >> > > > > would
>> > > > > > >> > > > > >> > have
>> > > > > > >> > > > > >> > >> to be developed on a branch alongside ongoing
>> > > > > > maintenance
>> > > > > > >> of
>> > > > > > >> > > the
>> > > > > > >> > > > > >> current
>> > > > > > >> > > > > >> > >> line of development? I think it would be
>> > important
>> > > > to
>> > > > > > >> > continue
>> > > > > > >> > > > > >> > supporting
>> > > > > > >> > > > > >> > >> existing users, and provide a graceful
>> migration
>> > > > path
>> > > > > to
>> > > > > > >> the
>> > > > > > >> > > new
>> > > > > > >> > > > > >> > version.
>> > > > > > >> > > > > >> > >> Leaving the current versions unsupported and
>> > > forcing
>> > > > > > people
>> > > > > > >> > to
>> > > > > > >> > > > > >> rewrite
>> > > > > > >> > > > > >> > >> their jobs would send a bad signal.
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> Best,
>> > > > > > >> > > > > >> > >> Martin
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
>> > > > [email protected]>
>> > > > > > >> wrote:
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >>> Hey Garry,
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be happy
>> to
>> > > chat
>> > > > > > more
>> > > > > > >> > about
>> > > > > > >> > > > > this
>> > > > > > >> > > > > >> if
>> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris and I
>> > started
>> > > > with
>> > > > > > the
>> > > > > > >> > idea
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> "what
>> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass
>> > ingestion
>> > > > > tool"
>> > > > > > but
>> > > > > > >> > > > > >> ultimately
>> > > > > > >> > > > > >> > we
>> > > > > > >> > > > > >> > >>> kind of came around to the idea that
>> ingestion
>> > > and
>> > > > > > >> > > > transformation
>> > > > > > >> > > > > >> had
>> > > > > > >> > > > > >> > >>> pretty different needs and coupling the two
>> > made
>> > > > > things
>> > > > > > >> > hard.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> For what it's worth I think copycat (KIP-26)
>> > > > actually
>> > > > > > will
>> > > > > > >> > do
>> > > > > > >> > > > what
>> > > > > > >> > > > > >> you
>> > > > > > >> > > > > >> > >> are
>> > > > > > >> > > > > >> > >>> looking for.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> With regard to your point about slider, I
>> don't
>> > > > > > >> necessarily
>> > > > > > >> > > > > >> disagree.
>> > > > > > >> > > > > >> > >> But I
>> > > > > > >> > > > > >> > >>> think getting good YARN support is quite
>> doable
>> > > > and I
>> > > > > > >> think
>> > > > > > >> > we
>> > > > > > >> > > > can
>> > > > > > >> > > > > >> make
>> > > > > > >> > > > > >> > >>> that work well. I think the issue this
>> proposal
>> > > > > solves
>> > > > > > is
>> > > > > > >> > that
>> > > > > > >> > > > > >> > >> technically
>> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple
>> cluster
>> > > > > > management
>> > > > > > >> > > systems
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > way
>> > > > > > >> > > > > >> > >>> things are now, you need to write an "app
>> > master"
>> > > > or
>> > > > > > >> > > "framework"
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > each
>> > > > > > >> > > > > >> > >>> and they are all a little different so
>> testing
>> > is
>> > > > > > really
>> > > > > > >> > hard.
>> > > > > > >> > > > In
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > >>> absence of this we have been stuck with just
>> > YARN
>> > > > > which
>> > > > > > >> has
>> > > > > > >> > > > > >> fantastic
>> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the org,
>> but
>> > > > zero
>> > > > > > >> > > penetration
>> > > > > > >> > > > > >> > >> elsewhere.
>> > > > > > >> > > > > >> > >>> Given the huge amount of work being put in
>> to
>> > > > slider,
>> > > > > > >> > > marathon,
>> > > > > > >> > > > > aws
>> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen related
>> > > > packaging
>> > > > > > >> > > > technologies
>> > > > > > >> > > > > >> > people
>> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes, various
>> > > > > cloud-specific
>> > > > > > >> > deploy
>> > > > > > >> > > > > >> tools,
>> > > > > > >> > > > > >> > >> etc)
>> > > > > > >> > > > > >> > >>> I really think it is important to get this
>> > right.
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> -Jay
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry
>> > Turkington
>> > > <
>> > > > > > >> > > > > >> > >>> [email protected]> wrote:
>> > > > > > >> > > > > >> > >>>
>> > > > > > >> > > > > >> > >>>> Hi all,
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> I think the question below re does Samza
>> > become
>> > > a
>> > > > > > >> > sub-project
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>> highlights the broader point around
>> migration.
>> > > > Chris
>> > > > > > >> > mentions
>> > > > > > >> > > > > >> Samza's
>> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1 release
>> but
>> > I'm
>> > > > not
>> > > > > > sure
>> > > > > > >> > it
>> > > > > > >> > > > > feels
>> > > > > > >> > > > > >> > >> right to
>> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to
>> deprecate
>> > > > most
>> > > > > of
>> > > > > > >> it.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> From a selfish perspective I have some guys
>> > who
>> > > > have
>> > > > > > >> > started
>> > > > > > >> > > > > >> working
>> > > > > > >> > > > > >> > >> with
>> > > > > > >> > > > > >> > >>>> Samza and building some new
>> > consumers/producers
>> > > > was
>> > > > > > next
>> > > > > > >> > up.
>> > > > > > >> > > > > Sounds
>> > > > > > >> > > > > >> > like
>> > > > > > >> > > > > >> > >>>> that is absolutely not the direction to
>> go. I
>> > > need
>> > > > > to
>> > > > > > >> look
>> > > > > > >> > > into
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > KIP
>> > > > > > >> > > > > >> > >> in
>> > > > > > >> > > > > >> > >>>> more detail but for me the attractiveness
>> of
>> > > > adding
>> > > > > > new
>> > > > > > >> > Samza
>> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all they
>> > were
>> > > > > doing
>> > > > > > was
>> > > > > > >> > > > really
>> > > > > > >> > > > > >> > getting
>> > > > > > >> > > > > >> > >>>> data into and out of Kafka --  was to avoid
>> > > > having
>> > > > > to
>> > > > > > >> > worry
>> > > > > > >> > > > > about
>> > > > > > >> > > > > >> the
>> > > > > > >> > > > > >> > >>>> lifecycle management of external clients.
>> If
>> > > there
>> > > > > is
>> > > > > > a
>> > > > > > >> > > generic
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug a new
>> > > > connector
>> > > > > > into
>> > > > > > >> > and
>> > > > > > >> > > > > have
>> > > > > > >> > > > > >> a
>> > > > > > >> > > > > >> > >> lot of
>> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and reliability
>> > done
>> > > > for
>> > > > > me
>> > > > > > >> then
>> > > > > > >> > > it
>> > > > > > >> > > > > >> gives
>> > > > > > >> > > > > >> > me
>> > > > > > >> > > > > >> > >> all
>> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers would.
>> If
>> > > not
>> > > > > > then it
>> > > > > > >> > > > > >> complicates
>> > > > > > >> > > > > >> > my
>> > > > > > >> > > > > >> > >>>> operational deployments.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Which is similar to my other question with
>> the
>> > > > > > proposal
>> > > > > > >> --
>> > > > > > >> > if
>> > > > > > >> > > > we
>> > > > > > >> > > > > >> > build a
>> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza plus the
>> > > > requisite
>> > > > > > >> shims
>> > > > > > >> > to
>> > > > > > >> > > > > >> > integrate
>> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former may
>> be a
>> > > lot
>> > > > > more
>> > > > > > >> work
>> > > > > > >> > > > than
>> > > > > > >> > > > > we
>> > > > > > >> > > > > >> > >> think.
>> > > > > > >> > > > > >> > >>>> We may make it much easier for a newcomer
>> to
>> > get
>> > > > > > >> something
>> > > > > > >> > > > > running
>> > > > > > >> > > > > >> but
>> > > > > > >> > > > > >> > >>>> having them step up and get a reliable
>> > > production
>> > > > > > >> > deployment
>> > > > > > >> > > > may
>> > > > > > >> > > > > >> still
>> > > > > > >> > > > > >> > >>>> dominate mailing list  traffic, if for
>> > different
>> > > > > > reasons
>> > > > > > >> > than
>> > > > > > >> > > > > >> today.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable with
>> > > making
>> > > > > the
>> > > > > > >> Samza
>> > > > > > >> > > > > >> dependency
>> > > > > > >> > > > > >> > >> on
>> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I absolutely
>> see
>> > > the
>> > > > > > >> benefits
>> > > > > > >> > > in
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing
>> > > > > > >> > > > terminologies/abstractions
>> > > > > > >> > > > > >> that
>> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a library
>> would
>> > > > likely
>> > > > > > be a
>> > > > > > >> > very
>> > > > > > >> > > > > nice
>> > > > > > >> > > > > >> > tool
>> > > > > > >> > > > > >> > >> to
>> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just have the
>> > > > concerns
>> > > > > > >> above
>> > > > > > >> > re
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > >>>> operational side.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Garry
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> -----Original Message-----
>> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales
>> [mailto:
>> > > > > > >> > [email protected]
>> > > > > > >> > > ]
>> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
>> > > > > > >> > > > > >> > >>>> To: [email protected]
>> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and obesrvations on
>> > Samza
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Very interesting thoughts.
>> > > > > > >> > > > > >> > >>>> From outside, I have always perceived Samza
>> > as a
>> > > > > > >> computing
>> > > > > > >> > > > layer
>> > > > > > >> > > > > >> over
>> > > > > > >> > > > > >> > >>>> Kafka.
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> The question, maybe a bit provocative, is
>> > > "should
>> > > > > > Samza
>> > > > > > >> be
>> > > > > > >> > a
>> > > > > > >> > > > > >> > sub-project
>> > > > > > >> > > > > >> > >>>> of Kafka then?"
>> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a
>> separate
>> > > > > project
>> > > > > > >> > with a
>> > > > > > >> > > > > >> separate
>> > > > > > >> > > > > >> > >>>> governance?
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> Cheers,
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> --
>> > > > > > >> > > > > >> > >>>> Gianmarco
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
>> > > > > > [email protected]>
>> > > > > > >> > > > wrote:
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with Kafka more
>> > > > tightly.
>> > > > > > >> > Because
>> > > > > > >> > > > > Samza
>> > > > > > >> > > > > >> de
>> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it should
>> > leverage
>> > > > > what
>> > > > > > >> Kafka
>> > > > > > >> > > > has.
>> > > > > > >> > > > > At
>> > > > > > >> > > > > >> > the
>> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to reinvent
>> > what
>> > > > > Samza
>> > > > > > >> > > already
>> > > > > > >> > > > > >> has. I
>> > > > > > >> > > > > >> > >>>>> also like the idea of separating the
>> > ingestion
>> > > > and
>> > > > > > >> > > > > transformation.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me to
>> image
>> > > how
>> > > > > the
>> > > > > > >> Samza
>> > > > > > >> > > > will
>> > > > > > >> > > > > >> look
>> > > > > > >> > > > > >> > >>>> like.
>> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a little
>> > > difference
>> > > > > in
>> > > > > > >> terms
>> > > > > > >> > > of
>> > > > > > >> > > > > how
>> > > > > > >> > > > > >> > >>>>> Samza should look like.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's code
>> shows
>> > (A
>> > > > > > client of
>> > > > > > >> > > > Kakfa)
>> > > > > > >> > > > > ?
>> > > > > > >> > > > > >> And
>> > > > > > >> > > > > >> > >>>>> user's application code calls this client?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of Kafka
>> > (like
>> > > > > what
>> > > > > > the
>> > > > > > >> > > code
>> > > > > > >> > > > > >> shows),
>> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and
>> > > > > fault-tolerance?
>> > > > > > >> Are
>> > > > > > >> > > they
>> > > > > > >> > > > > >> taken
>> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other
>> mechanism,
>> > > such
>> > > > > as
>> > > > > > >> > "Samza
>> > > > > > >> > > > > >> worker"
>> > > > > > >> > > > > >> > >>>>> (just make up the name) ?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 2. What about other features, such as
>> > > > auto-scaling,
>> > > > > > >> shared
>> > > > > > >> > > > > state,
>> > > > > > >> > > > > >> > >>>>> monitoring?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is this
>> > what
>> > > > > Chris
>> > > > > > >> > > > suggests?)
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from Kakfa
>> > and
>> > > > > > produce
>> > > > > > >> to
>> > > > > > >> > > it.
>> > > > > > >> > > > > >> Then it
>> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks like
>> > now,
>> > > > > > except it
>> > > > > > >> > > does
>> > > > > > >> > > > > not
>> > > > > > >> > > > > >> > rely
>> > > > > > >> > > > > >> > >>>>> on Yarn anymore.
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it
>> leverage
>> > > > Kafka's
>> > > > > > >> > metrics,
>> > > > > > >> > > > > logs,
>> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the dependency?
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> Thanks,
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> Fang, Yan
>> > > > > > >> > > > > >> > >>>>> [email protected]
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM, Guozhang
>> > Wang <
>> > > > > > >> > > > > [email protected]
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> > >>>> wrote:
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>>> Read through the code example and it
>> looks
>> > > good
>> > > > to
>> > > > > > me.
>> > > > > > >> A
>> > > > > > >> > > few
>> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable
>> runnable
>> > > like:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
>> > > --config-factory=...
>> > > > > > >> > > > > >> > >>>> --config-path=file://...
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for deploying
>> > Samza
>> > > > > more
>> > > > > > as
>> > > > > > >> > > > embedded
>> > > > > > >> > > > > >> > >>>>>> libraries in user application code
>> (ignoring
>> > > the
>> > > > > > >> > > terminology
>> > > > > > >> > > > > >> since
>> > > > > > >> > > > > >> > >>>>>> it is not the
>> > > > > > >> > > > > >> > >>>>> same
>> > > > > > >> > > > > >> > >>>>>> as the prototype code):
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> StreamTask task = new
>> MyStreamTask(configs);
>> > > > > Thread
>> > > > > > >> > thread
>> > > > > > >> > > =
>> > > > > > >> > > > > new
>> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> I think both of these deployment modes
>> are
>> > > > > important
>> > > > > > >> for
>> > > > > > >> > > > > >> different
>> > > > > > >> > > > > >> > >>>>>> types
>> > > > > > >> > > > > >> > >>>>> of
>> > > > > > >> > > > > >> > >>>>>> users. That said, I think making Samza
>> > purely
>> > > > > > >> standalone
>> > > > > > >> > is
>> > > > > > >> > > > > still
>> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or library
>> > > modes.
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> Guozhang
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM, Jay
>> Kreps
>> > <
>> > > > > > >> > > > [email protected]>
>> > > > > > >> > > > > >> > wrote:
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code
>> example,
>> > it
>> > > > was
>> > > > > > >> > supposed
>> > > > > > >> > > > to
>> > > > > > >> > > > > >> look
>> > > > > > >> > > > > >> > >>>>>>> like
>> > > > > > >> > > > > >> > >>>>>>> this:
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> Properties props = new Properties();
>> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
>> > > > "localhost:4242");
>> > > > > > >> > > > > >> StreamingConfig
>> > > > > > >> > > > > >> > >>>>>>> config = new StreamingConfig(props);
>> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
>> > > > "test-topic-2");
>> > > > > > >> > > > > >> > >>>>>>>
>> > > config.processor(ExampleStreamProcessor.class);
>> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
>> > StringSerializer(),
>> > > > new
>> > > > > > >> > > > > >> > >>>>>>> StringDeserializer()); KafkaStreaming
>> > > > container =
>> > > > > > new
>> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config); container.run();
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> -Jay
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM, Jay
>> > Kreps <
>> > > > > > >> > > > [email protected]
>> > > > > > >> > > > > >
>> > > > > > >> > > > > >> > >>>> wrote:
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Hey guys,
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> This came out of some conversations
>> Chris
>> > > and
>> > > > I
>> > > > > > were
>> > > > > > >> > > having
>> > > > > > >> > > > > >> > >>>>>>>> around
>> > > > > > >> > > > > >> > >>>>>>> whether
>> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza as a
>> kind
>> > > of
>> > > > > data
>> > > > > > >> > > > ingestion
>> > > > > > >> > > > > >> > >>>>> framework
>> > > > > > >> > > > > >> > >>>>>>> for
>> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to KIP-26
>> > > > > "copycat").
>> > > > > > >> This
>> > > > > > >> > > > kind
>> > > > > > >> > > > > of
>> > > > > > >> > > > > >> > >>>>>> combined
>> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and YARN
>> and
>> > > the
>> > > > > > >> > discussion
>> > > > > > >> > > > > >> around
>> > > > > > >> > > > > >> > >>>>>>>> how
>> > > > > > >> > > > > >> > >>>>> to
>> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was, given
>> that
>> > > > Samza
>> > > > > > was
>> > > > > > >> > > > basically
>> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific, what if
>> > you
>> > > > just
>> > > > > > >> > embraced
>> > > > > > >> > > > > that
>> > > > > > >> > > > > >> > >>>>>>>> and turned it
>> > > > > > >> > > > > >> > >>>>>> into
>> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight
>> > framework
>> > > > and
>> > > > > > more
>> > > > > > >> > > like a
>> > > > > > >> > > > > >> > >>>>>>>> third
>> > > > > > >> > > > > >> > >>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing consumer"
>> > with
>> > > > > state
>> > > > > > >> > > > management
>> > > > > > >> > > > > >> > >>>>>> facilities.
>> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a
>> complex
>> > > > stream
>> > > > > > >> > > processing
>> > > > > > >> > > > > >> > >>>>>>>> framework
>> > > > > > >> > > > > >> > >>>>>>> this
>> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple thing,
>> not
>> > > > much
>> > > > > > more
>> > > > > > >> > > > > >> complicated
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> use
>> > > > > > >> > > > > >> > >>>>>>> or
>> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As Chris
>> > said
>> > > > we
>> > > > > > >> thought
>> > > > > > >> > > > about
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >>>>>>>> a
>> > > > > > >> > > > > >> > >>>>> lot
>> > > > > > >> > > > > >> > >>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream
>> > processing
>> > > > > > systems
>> > > > > > >> > were
>> > > > > > >> > > > > doing)
>> > > > > > >> > > > > >> > >>>>> seemed
>> > > > > > >> > > > > >> > >>>>>>> like
>> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from MapReduce.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Of course you need to ingest/output
>> data
>> > to
>> > > > and
>> > > > > > from
>> > > > > > >> > the
>> > > > > > >> > > > > stream
>> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually looked
>> > into
>> > > > how
>> > > > > > that
>> > > > > > >> > > would
>> > > > > > >> > > > > >> > >>>>>>>> work,
>> > > > > > >> > > > > >> > >>>>> Samza
>> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data ingestion
>> > > framework
>> > > > > > for a
>> > > > > > >> > > bunch
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>> reasons.
>> > > > > > >> > > > > >> > >>>>>> To
>> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a pretty
>> > > > different
>> > > > > > >> > internal
>> > > > > > >> > > > > data
>> > > > > > >> > > > > >> > >>>>>>>> model
>> > > > > > >> > > > > >> > >>>>>> and
>> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split them
>> and
>> > > had
>> > > > > an
>> > > > > > api
>> > > > > > >> > for
>> > > > > > >> > > > > Kafka
>> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA KIP-26)
>> and a
>> > > > > separate
>> > > > > > >> api
>> > > > > > >> > > for
>> > > > > > >> > > > > >> Kafka
>> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza).
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> This would also allow really embracing
>> the
>> > > > same
>> > > > > > >> > > terminology
>> > > > > > >> > > > > and
>> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about the
>> > current
>> > > > > > state is
>> > > > > > >> > > that
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>>>>>> two
>> > > > > > >> > > > > >> > >>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on. Terminology
>> like
>> > > > > "stream"
>> > > > > > vs
>> > > > > > >> > > > "topic"
>> > > > > > >> > > > > >> and
>> > > > > > >> > > > > >> > >>>>>>> different
>> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems means you
>> > kind
>> > > > of
>> > > > > > have
>> > > > > > >> to
>> > > > > > >> > > > learn
>> > > > > > >> > > > > >> > >>>>>>>> Kafka's
>> > > > > > >> > > > > >> > >>>>>>> way,
>> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly different
>> way,
>> > > > then
>> > > > > > kind
>> > > > > > >> of
>> > > > > > >> > > > > >> > >>>>>>>> understand
>> > > > > > >> > > > > >> > >>>>> how
>> > > > > > >> > > > > >> > >>>>>>> they
>> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having walked
>> a
>> > few
>> > > > > > people
>> > > > > > >> > > through
>> > > > > > >> > > > > >> this
>> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks to
>> get.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot of
>> time
>> > on
>> > > > > > >> airplanes I
>> > > > > > >> > > > > hacked
>> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat
>> incomplete
>> > > > > > prototype
>> > > > > > >> of
>> > > > > > >> > > > what
>> > > > > > >> > > > > >> > >>>>>>>> this would
>> > > > > > >> > > > > >> > >>>>> look
>> > > > > > >> > > > > >> > >>>>>>>> like. This is just unceremoniously
>> dumped
>> > > into
>> > > > > > Kafka
>> > > > > > >> as
>> > > > > > >> > > it
>> > > > > > >> > > > > >> > >>>>>>>> required a
>> > > > > > >> > > > > >> > >>>>>> few
>> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here is
>> the
>> > > code:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > >
>> > > > > > >> >
>> > > > > >
>> > >
>> https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
>> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype I just
>> > > > > liberally
>> > > > > > >> > renamed
>> > > > > > >> > > > > >> > >>>>>>>> everything
>> > > > > > >> > > > > >> > >>>>> to
>> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with no
>> regard
>> > > for
>> > > > > > >> > > > compatibility.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> To use this would be something like
>> this:
>> > > > > > >> > > > > >> > >>>>>>>> Properties props = new Properties();
>> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
>> > > > > "localhost:4242");
>> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
>> > > > > > >> > > > > >> > >>>>> StreamingConfig(props);
>> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
>> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2");
>> > > > > > >> > > > > >> config.processor(ExampleStreamProcessor.class);
>> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
>> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
>> > > StringDeserializer());
>> > > > > > >> > > > KafkaStreaming
>> > > > > > >> > > > > >> > >>>>>> container =
>> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config);
>> > container.run();
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
>> > > > SamzaContainer;
>> > > > > > >> > > > > StreamProcessor
>> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the class
>> names
>> > > in
>> > > > a
>> > > > > > file
>> > > > > > >> > and
>> > > > > > >> > > > then
>> > > > > > >> > > > > >> > >>>>>>>> having
>> > > > > > >> > > > > >> > >>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you just
>> > > > > instantiate
>> > > > > > the
>> > > > > > >> > > > > container
>> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is balanced over
>> > > > however
>> > > > > > many
>> > > > > > >> > > > > instances
>> > > > > > >> > > > > >> > >>>>>>>> of
>> > > > > > >> > > > > >> > >>>>> this
>> > > > > > >> > > > > >> > >>>>>>> are
>> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an instance
>> > dies,
>> > > > new
>> > > > > > >> tasks
>> > > > > > >> > > are
>> > > > > > >> > > > > >> added
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> the
>> > > > > > >> > > > > >> > >>>>>>>> existing containers without shutting
>> them
>> > > > down).
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for running
>> > this
>> > > > > stuff
>> > > > > > in
>> > > > > > >> > YARN
>> > > > > > >> > > > via
>> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and AWS
>> using
>> > > some
>> > > > > of
>> > > > > > >> their
>> > > > > > >> > > > tools
>> > > > > > >> > > > > >> > >>>>>>>> but from the
>> > > > > > >> > > > > >> > >>>>>> point
>> > > > > > >> > > > > >> > >>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these stream
>> > > > processing
>> > > > > > jobs
>> > > > > > >> > are
>> > > > > > >> > > > > just
>> > > > > > >> > > > > >> > >>>>>> stateless
>> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and
>> expand
>> > and
>> > > > > > contract
>> > > > > > >> > at
>> > > > > > >> > > > > will.
>> > > > > > >> > > > > >> > >>>>>>>> There
>> > > > > > >> > > > > >> > >>>>> is
>> > > > > > >> > > > > >> > >>>>>>> no
>> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of code, it
>> > would
>> > > > get
>> > > > > > >> larger
>> > > > > > >> > > if
>> > > > > > >> > > > we
>> > > > > > >> > > > > >> > >>>>>>>>  productionized but not vastly larger.
>> We
>> > > > really
>> > > > > > do
>> > > > > > >> > get a
>> > > > > > >> > > > ton
>> > > > > > >> > > > > >> > >>>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>> leverage
>> > > > > > >> > > > > >> > >>>>>>>>  out of Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully
>> > delegated
>> > > to
>> > > > > the
>> > > > > > >> new
>> > > > > > >> > > > > >> consumer.
>> > > > > > >> > > > > >> > >>>>> This
>> > > > > > >> > > > > >> > >>>>>>>>  is nice since now any partition
>> > management
>> > > > > > strategy
>> > > > > > >> > > > > available
>> > > > > > >> > > > > >> > >>>>>>>> to
>> > > > > > >> > > > > >> > >>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>  consumer is also available to Samza
>> (and
>> > > vice
>> > > > > > versa)
>> > > > > > >> > and
>> > > > > > >> > > > > with
>> > > > > > >> > > > > >> > >>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>> exact
>> > > > > > >> > > > > >> > >>>>>>>>  same configs.
>> > > > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as state
>> > reuse
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it is
>> > thought
>> > > > > > >> provoking.
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> -Jay
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM, Chris
>> > > > > Riccomini <
>> > > > > > >> > > > > >> > >>>>>> [email protected]>
>> > > > > > >> > > > > >> > >>>>>>>> wrote:
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Hey all,
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with Samza
>> > > > > engineers
>> > > > > > at
>> > > > > > >> > > > LinkedIn
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>> Confluent
>> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few observations
>> > and
>> > > > > would
>> > > > > > >> like
>> > > > > > >> > to
>> > > > > > >> > > > > >> > >>>>>>>>> propose
>> > > > > > >> > > > > >> > >>>>> some
>> > > > > > >> > > > > >> > >>>>>>>>> changes.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that I
>> want to
>> > > > call
>> > > > > > out
>> > > > > > >> > about
>> > > > > > >> > > > > >> > >>>>>>>>> Samza's
>> > > > > > >> > > > > >> > >>>>>> design,
>> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some changes.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a dynamic
>> > > > deployment
>> > > > > > >> system.
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
>> > > > > > >> > > > > >> > >>>>>>>>> * Samza's
>> SystemConsumer/SystemProducer
>> > and
>> > > > > > Kafka's
>> > > > > > >> > > > consumer
>> > > > > > >> > > > > >> > >>>>>>>>> APIs
>> > > > > > >> > > > > >> > >>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the same
>> > problems.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are related,
>> > but
>> > > > I'll
>> > > > > > >> > address
>> > > > > > >> > > > them
>> > > > > > >> > > > > >> in
>> > > > > > >> > > > > >> > >>>>> order.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Deployment
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the use of a
>> > > > dynamic
>> > > > > > >> > > deployment
>> > > > > > >> > > > > >> > >>>>>>>>> scheduler
>> > > > > > >> > > > > >> > >>>>>> such
>> > > > > > >> > > > > >> > >>>>>>>>> as
>> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we initially
>> built
>> > > > > Samza,
>> > > > > > we
>> > > > > > >> > bet
>> > > > > > >> > > > that
>> > > > > > >> > > > > >> > >>>>>>>>> there
>> > > > > > >> > > > > >> > >>>>>> would
>> > > > > > >> > > > > >> > >>>>>>>>> be
>> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area, and
>> we
>> > > could
>> > > > > > >> support
>> > > > > > >> > > > them,
>> > > > > > >> > > > > >> and
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>> rest
>> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there are
>> many
>> > > > > > >> variations.
>> > > > > > >> > > > > >> > >>>>>>>>> Furthermore,
>> > > > > > >> > > > > >> > >>>>>> many
>> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just start
>> their
>> > > > > > processors
>> > > > > > >> > like
>> > > > > > >> > > > > normal
>> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use traditional
>> > > > deployment
>> > > > > > >> scripts
>> > > > > > >> > > > such
>> > > > > > >> > > > > as
>> > > > > > >> > > > > >> > >>>>>>>>> Fabric,
>> > > > > > >> > > > > >> > >>>>>> Chef,
>> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a deployment
>> system
>> > > on
>> > > > > > users
>> > > > > > >> > makes
>> > > > > > >> > > > the
>> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really painful
>> for
>> > > > first
>> > > > > > time
>> > > > > > >> > > > users.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a requirement
>> was
>> > > also
>> > > > a
>> > > > > > bit
>> > > > > > >> of
>> > > > > > >> > a
>> > > > > > >> > > > > >> > >>>>>>>>> mis-fire
>> > > > > > >> > > > > >> > >>>>>> because
>> > > > > > >> > > > > >> > >>>>>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding between
>> > the
>> > > > > > nature of
>> > > > > > >> > > batch
>> > > > > > >> > > > > >> jobs
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>> stream
>> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we made
>> > > conscious
>> > > > > > effort
>> > > > > > >> to
>> > > > > > >> > > > favor
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>> Hadoop
>> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing things,
>> since
>> > it
>> > > > > worked
>> > > > > > >> and
>> > > > > > >> > > was
>> > > > > > >> > > > > well
>> > > > > > >> > > > > >> > >>>>>>> understood.
>> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was that
>> batch
>> > > jobs
>> > > > > > have a
>> > > > > > >> > > > definite
>> > > > > > >> > > > > >> > >>>>>> beginning,
>> > > > > > >> > > > > >> > >>>>>>>>> and
>> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs don't
>> > > > > (usually).
>> > > > > > >> This
>> > > > > > >> > > > leads
>> > > > > > >> > > > > to
>> > > > > > >> > > > > >> > >>>>>>>>> a
>> > > > > > >> > > > > >> > >>>>> much
>> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for stream
>> > > > > processors.
>> > > > > > >> You
>> > > > > > >> > > > > >> basically
>> > > > > > >> > > > > >> > >>>>>>>>> just
>> > > > > > >> > > > > >> > >>>>>>> need
>> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the
>> processor,
>> > and
>> > > > > start
>> > > > > > >> it.
>> > > > > > >> > > The
>> > > > > > >> > > > > way
>> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn, there's no
>> > > concept
>> > > > > of
>> > > > > > a
>> > > > > > >> > > cluster
>> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always
>> > > > > > >> > > > > >> > >>>>>> add
>> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with
>> coupling
>> > > > Samza
>> > > > > > with
>> > > > > > >> a
>> > > > > > >> > > > > >> scheduler
>> > > > > > >> > > > > >> > >>>>>>>>> is
>> > > > > > >> > > > > >> > >>>>>> that
>> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has to
>> handle
>> > > > > > deployment.
>> > > > > > >> > > This
>> > > > > > >> > > > > >> pulls
>> > > > > > >> > > > > >> > >>>>>>>>> in a
>> > > > > > >> > > > > >> > >>>>>>> bunch
>> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration
>> > > distribution
>> > > > > > (config
>> > > > > > >> > > > > stream),
>> > > > > > >> > > > > >> > >>>>>>>>> shell
>> > > > > > >> > > > > >> > >>>>>>> scrips
>> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner), packaging
>> > (all
>> > > > the
>> > > > > > .tgz
>> > > > > > >> > > > stuff),
>> > > > > > >> > > > > >> etc.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring dynamic
>> > > > deployment
>> > > > > > was
>> > > > > > >> to
>> > > > > > >> > > > > support
>> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to have
>> > > locality,
>> > > > > you
>> > > > > > >> need
>> > > > > > >> > to
>> > > > > > >> > > > put
>> > > > > > >> > > > > >> > >>>>>>>>> your
>> > > > > > >> > > > > >> > >>>>>> processors
>> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're processing.
>> > Upon
>> > > > > > further
>> > > > > > >> > > > > >> > >>>>>>>>> investigation,
>> > > > > > >> > > > > >> > >>>>>>> though,
>> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that beneficial.
>> > There
>> > > is
>> > > > > > some
>> > > > > > >> > good
>> > > > > > >> > > > > >> > >>>>>>>>> discussion
>> > > > > > >> > > > > >> > >>>>>> about
>> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on SAMZA-335.
>> > Again,
>> > > we
>> > > > > > took
>> > > > > > >> the
>> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce
>> > > > > > >> > > > > >> > >>>>>> path,
>> > > > > > >> > > > > >> > >>>>>>>>> but
>> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental differences
>> > > > between
>> > > > > > HDFS
>> > > > > > >> > and
>> > > > > > >> > > > > Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>> HDFS
>> > > > > > >> > > > > >> > >>>>>> has
>> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has partitions.
>> This
>> > > > leads
>> > > > > to
>> > > > > > >> less
>> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with stream
>> > > processors
>> > > > > on
>> > > > > > top
>> > > > > > >> > of
>> > > > > > >> > > > > Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a crutch.
>> > > Samza
>> > > > > > doesn't
>> > > > > > >> > > have
>> > > > > > >> > > > > any
>> > > > > > >> > > > > >> > >>>>>>>>> built
>> > > > > > >> > > > > >> > >>>>> in
>> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead, it
>> > depends
>> > > on
>> > > > > the
>> > > > > > >> > > dynamic
>> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to handle
>> > > > restarts
>> > > > > > >> when a
>> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has
>> > > > > > >> > > > > >> > >>>>>>> made
>> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a
>> standalone
>> > > Samza
>> > > > > > >> > container
>> > > > > > >> > > > > >> > >>>> (SAMZA-516).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Pluggability
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is good,
>> but I
>> > > > think
>> > > > > > that
>> > > > > > >> > > we've
>> > > > > > >> > > > > >> gone
>> > > > > > >> > > > > >> > >>>>>>>>> too
>> > > > > > >> > > > > >> > >>>>>> far
>> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
>> > > > (SystemConsumer,
>> > > > > > >> > > > > SystemProducer,
>> > > > > > >> > > > > >> > >>>> etc).
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
>> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just about
>> > every
>> > > > > > >> component
>> > > > > > >> > > > > >> > >>>>> (MessageChooser,
>> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
>> > > ConfigRewriter,
>> > > > > > etc).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've
>> > forgotten,
>> > > as
>> > > > > > well.
>> > > > > > >> > Some
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>>>>>> these
>> > > > > > >> > > > > >> > >>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not to
>> be.
>> > > This
>> > > > > all
>> > > > > > >> comes
>> > > > > > >> > > at
>> > > > > > >> > > > a
>> > > > > > >> > > > > >> cost:
>> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is making
>> it
>> > > > harder
>> > > > > > for
>> > > > > > >> > our
>> > > > > > >> > > > > users
>> > > > > > >> > > > > >> > >>>>>>>>> to
>> > > > > > >> > > > > >> > >>>>> pick
>> > > > > > >> > > > > >> > >>>>>> up
>> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It also
>> > makes
>> > > > it
>> > > > > > >> > difficult
>> > > > > > >> > > > for
>> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about what
>> the
>> > > > > > >> > > characteristics
>> > > > > > >> > > > of
>> > > > > > >> > > > > >> > >>>>>>>>> the container (since the
>> characteristics
>> > > > change
>> > > > > > >> > > depending
>> > > > > > >> > > > on
>> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are most
>> > > visible
>> > > > > in
>> > > > > > the
>> > > > > > >> > > > System
>> > > > > > >> > > > > >> APIs.
>> > > > > > >> > > > > >> > >>>>> What
>> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be
>> functional is
>> > > > Kafka
>> > > > > > as
>> > > > > > >> its
>> > > > > > >> > > > > >> > >>>>>>>>> transport
>> > > > > > >> > > > > >> > >>>>>> layer.
>> > > > > > >> > > > > >> > >>>>>>>>> But
>> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated use
>> cases
>> > > into
>> > > > > one
>> > > > > > >> API:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports both
>> of
>> > > these
>> > > > > use
>> > > > > > >> > cases.
>> > > > > > >> > > > The
>> > > > > > >> > > > > >> > >>>>>>>>> problem
>> > > > > > >> > > > > >> > >>>>>> is,
>> > > > > > >> > > > > >> > >>>>>>>>> we
>> > > > > > >> > > > > >> > >>>>>>>>> actually want different features for
>> each
>> > > use
>> > > > > > case.
>> > > > > > >> By
>> > > > > > >> > > > > >> papering
>> > > > > > >> > > > > >> > >>>>>>>>> over
>> > > > > > >> > > > > >> > >>>>>>> these
>> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a single
>> > API,
>> > > > > we've
>> > > > > > >> > > > introduced
>> > > > > > >> > > > > a
>> > > > > > >> > > > > >> > >>>>>>>>> ton of
>> > > > > > >> > > > > >> > >>>>>>> leaky
>> > > > > > >> > > > > >> > >>>>>>>>> abstractions.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really like in
>> (2)
>> > > is
>> > > > to
>> > > > > > have
>> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs for
>> > offsets
>> > > > > (like
>> > > > > > >> > Kafka).
>> > > > > > >> > > > > This
>> > > > > > >> > > > > >> > >>>>>>>>> would be at odds
>> > > > > > >> > > > > >> > >>>>> with
>> > > > > > >> > > > > >> > >>>>>>> (1),
>> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems have
>> > > > different
>> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
>> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the
>> mailing
>> > > list
>> > > > > and
>> > > > > > >> the
>> > > > > > >> > > SQL
>> > > > > > >> > > > > >> JIRAs
>> > > > > > >> > > > > >> > >>>>> about
>> > > > > > >> > > > > >> > >>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> need for this.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
>> > > replayability.
>> > > > > > Kafka
>> > > > > > >> > > allows
>> > > > > > >> > > > us
>> > > > > > >> > > > > >> to
>> > > > > > >> > > > > >> > >>>>> rewind
>> > > > > > >> > > > > >> > >>>>>>>>> when
>> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other systems
>> > > don't.
>> > > > In
>> > > > > > some
>> > > > > > >> > > > cases,
>> > > > > > >> > > > > >> > >>>>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>>> return
>> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
>> > > > > > >> WikipediaSystemConsumer)
>> > > > > > >> > > > > because
>> > > > > > >> > > > > >> > >>>>>>>>> they
>> > > > > > >> > > > > >> > >>>>>> have
>> > > > > > >> > > > > >> > >>>>>>> no
>> > > > > > >> > > > > >> > >>>>>>>>> offsets.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example. Kafka
>> > > > supports
>> > > > > > >> > > > > partitioning,
>> > > > > > >> > > > > >> > >>>>>>>>> but
>> > > > > > >> > > > > >> > >>>>> many
>> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by
>> having a
>> > > > single
>> > > > > > >> > > partition
>> > > > > > >> > > > > for
>> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other systems
>> model
>> > > > > > >> partitioning
>> > > > > > >> > > > > >> > >>>> differently (e.g.
>> > > > > > >> > > > > >> > >>>>>>>>> Kinesis).
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is also a
>> mess.
>> > > > > > Creating
>> > > > > > >> > > streams
>> > > > > > >> > > > > in
>> > > > > > >> > > > > >> a
>> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost
>> impossible.
>> > > As
>> > > > is
>> > > > > > >> > modeling
>> > > > > > >> > > > > >> > >>>>>>>>> metadata
>> > > > > > >> > > > > >> > >>>>> for
>> > > > > > >> > > > > >> > >>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor,
>> partitions,
>> > > > > location,
>> > > > > > >> > etc).
>> > > > > > >> > > > The
>> > > > > > >> > > > > >> > >>>>>>>>> list
>> > > > > > >> > > > > >> > >>>>> goes
>> > > > > > >> > > > > >> > >>>>>>> on.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began writing
>> Samza,
>> > > > > Kafka's
>> > > > > > >> > > consumer
>> > > > > > >> > > > > and
>> > > > > > >> > > > > >> > >>>>> producer
>> > > > > > >> > > > > >> > >>>>>>>>> APIs
>> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature set. On
>> the
>> > > > > > >> > consumer-side,
>> > > > > > >> > > > you
>> > > > > > >> > > > > >> > >>>>>>>>> had two
>> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level consumer,
>> or
>> > > the
>> > > > > > simple
>> > > > > > >> > > > > consumer.
>> > > > > > >> > > > > >> > >>>>>>>>> The
>> > > > > > >> > > > > >> > >>>>>>> problem
>> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was that
>> it
>> > > > > > controlled
>> > > > > > >> > your
>> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments, and
>> the
>> > > order
>> > > > > in
>> > > > > > >> which
>> > > > > > >> > > you
>> > > > > > >> > > > > >> > >>>>>>>>> received messages. The
>> > > > > > >> > > > > >> > >>>>> problem
>> > > > > > >> > > > > >> > >>>>>>>>> with
>> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's not
>> > > simple.
>> > > > > It's
>> > > > > > >> > basic.
>> > > > > > >> > > > You
>> > > > > > >> > > > > >> > >>>>>>>>> end up
>> > > > > > >> > > > > >> > >>>>>>> having
>> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really low-level
>> stuff
>> > > > that
>> > > > > > you
>> > > > > > >> > > > > shouldn't.
>> > > > > > >> > > > > >> > >>>>>>>>> We
>> > > > > > >> > > > > >> > >>>>>> spent a
>> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
>> > > > KafkaSystemConsumer
>> > > > > > very
>> > > > > > >> > > > robust.
>> > > > > > >> > > > > >> It
>> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some cool
>> > > features:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering and
>> > > > > > prioritization.
>> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition
>> assignment
>> > > to
>> > > > > > support
>> > > > > > >> > > > joins,
>> > > > > > >> > > > > >> > >>>>>>>>> global
>> > > > > > >> > > > > >> > >>>>>> state
>> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)), etc.
>> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset
>> > checkpointing.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the time is
>> > that
>> > > > > these
>> > > > > > >> > > features
>> > > > > > >> > > > > >> > >>>>>>>>> should
>> > > > > > >> > > > > >> > >>>>>>> actually
>> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka consumers
>> > (not
>> > > > just
>> > > > > > >> Samza
>> > > > > > >> > > > stream
>> > > > > > >> > > > > >> > >>>>>> processors)
>> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like joins
>> > and
>> > > > > > partition
>> > > > > > >> > > > > >> > >>>>>>>>> assignment. The
>> > > > > > >> > > > > >> > >>>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same
>> > conclusion.
>> > > > > > They're
>> > > > > > >> > > adding
>> > > > > > >> > > > a
>> > > > > > >> > > > > >> ton
>> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka
>> consumer
>> > > > > > >> > > implementation.
>> > > > > > >> > > > > To a
>> > > > > > >> > > > > >> > >>>>>>>>> large extent,
>> > > > > > >> > > > > >> > >>>>> it's
>> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've already
>> done
>> > > in
>> > > > > > Samza.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up taking
>> a
>> > > very
>> > > > > > similar
>> > > > > > >> > > > > approach
>> > > > > > >> > > > > >> > >>>>>>>>> to
>> > > > > > >> > > > > >> > >>>>>> Samza's
>> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager implementation
>> for
>> > > > > > handling
>> > > > > > >> > > offset
>> > > > > > >> > > > > >> > >>>>>> checkpointing.
>> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset
>> management
>> > > > > feature
>> > > > > > >> > stores
>> > > > > > >> > > > > >> offset
>> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and allows
>> you to
>> > > > fetch
>> > > > > > them
>> > > > > > >> > > from
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>>>>>>> broker.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a waste,
>> since
>> > we
>> > > > > could
>> > > > > > >> have
>> > > > > > >> > > > shared
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>> work
>> > > > > > >> > > > > >> > >>>>>> if
>> > > > > > >> > > > > >> > >>>>>>>>> it
>> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the
>> get-go.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Vision
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather
>> radical
>> > > > > > proposal.
>> > > > > > >> > Samza
>> > > > > > >> > > > is
>> > > > > > >> > > > > >> > >>>>> relatively
>> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd venture to
>> say
>> > > that
>> > > > > > we're
>> > > > > > >> > > near a
>> > > > > > >> > > > > 1.0
>> > > > > > >> > > > > >> > >>>>>> release.
>> > > > > > >> > > > > >> > >>>>>>>>> I'd
>> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take what
>> we've
>> > > > > learned,
>> > > > > > and
>> > > > > > >> > > begin
>> > > > > > >> > > > > >> > >>>>>>>>> thinking
>> > > > > > >> > > > > >> > >>>>>>> about
>> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we
>> change if
>> > > we
>> > > > > were
>> > > > > > >> > > starting
>> > > > > > >> > > > > >> from
>> > > > > > >> > > > > >> > >>>>>> scratch?
>> > > > > > >> > > > > >> > >>>>>>>>> My
>> > > > > > >> > > > > >> > >>>>>>>>> proposal is to:
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the *only*
>> way
>> > to
>> > > > run
>> > > > > > Samza
>> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all direct
>> > > > > dependences
>> > > > > > on
>> > > > > > >> > > YARN,
>> > > > > > >> > > > > >> Mesos,
>> > > > > > >> > > > > >> > >>>> etc.
>> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to support
>> only
>> > > > Kafka
>> > > > > > as
>> > > > > > >> the
>> > > > > > >> > > > > stream
>> > > > > > >> > > > > >> > >>>>>> processing
>> > > > > > >> > > > > >> > >>>>>>>>> layer.
>> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics, logging,
>> > > > > > >> serialization,
>> > > > > > >> > > and
>> > > > > > >> > > > > >> > >>>>>>>>> config
>> > > > > > >> > > > > >> > >>>>>>> systems,
>> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues that
>> I
>> > > > > outlined
>> > > > > > >> > above.
>> > > > > > >> > > It
>> > > > > > >> > > > > >> > >>>>>>>>> should
>> > > > > > >> > > > > >> > >>>>> also
>> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base pretty
>> > > > dramatically.
>> > > > > > >> > > Supporting
>> > > > > > >> > > > > >> only
>> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will allow
>> Samza
>> > to
>> > > be
>> > > > > > >> executed
>> > > > > > >> > > on
>> > > > > > >> > > > > YARN
>> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
>> > > > Marathon/Aurora),
>> > > > > or
>> > > > > > >> most
>> > > > > > >> > > > other
>> > > > > > >> > > > > >> > >>>>>>>>> in-house
>> > > > > > >> > > > > >> > >>>>>>> deployment
>> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a lot
>> > easier
>> > > > for
>> > > > > > new
>> > > > > > >> > > users.
>> > > > > > >> > > > > >> > >>>>>>>>> Imagine
>> > > > > > >> > > > > >> > >>>>>>> having
>> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without YARN.
>> > The
>> > > > drop
>> > > > > > in
>> > > > > > >> > > mailing
>> > > > > > >> > > > > >> list
>> > > > > > >> > > > > >> > >>>>>> traffic
>> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long
>> overdue to
>> > > me.
>> > > > > The
>> > > > > > >> > > reality
>> > > > > > >> > > > > is,
>> > > > > > >> > > > > >> > >>>>> everyone
>> > > > > > >> > > > > >> > >>>>>>>>> that
>> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with
>> Kafka.
>> > We
>> > > > > > basically
>> > > > > > >> > > > require
>> > > > > > >> > > > > >> it
>> > > > > > >> > > > > >> > >>>>>> already
>> > > > > > >> > > > > >> > >>>>>>> in
>> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work. Those
>> > that
>> > > > are
>> > > > > > >> using
>> > > > > > >> > > > other
>> > > > > > >> > > > > >> > >>>>>>>>> systems
>> > > > > > >> > > > > >> > >>>>>> are
>> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest into
>> Kafka
>> > > (1),
>> > > > > and
>> > > > > > >> then
>> > > > > > >> > > > they
>> > > > > > >> > > > > do
>> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is
>> already
>> > > > > > discussion (
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > >
>> > > > > > >> >
>> > > > > >
>> > >
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
>> > > > > > >> > > > > >> > >>>>> 767
>> > > > > > >> > > > > >> > >>>>>>>>> )
>> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into Kafka
>> > > > extremely
>> > > > > > >> easy.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple with
>> > Kafka,
>> > > > we
>> > > > > > can
>> > > > > > >> > > > leverage
>> > > > > > >> > > > > a
>> > > > > > >> > > > > >> > >>>>>>>>> ton of
>> > > > > > >> > > > > >> > >>>>>>> their
>> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to
>> maintain
>> > > our
>> > > > > own
>> > > > > > >> > config,
>> > > > > > >> > > > > >> > >>>>>>>>> metrics,
>> > > > > > >> > > > > >> > >>>>> etc.
>> > > > > > >> > > > > >> > >>>>>>> We
>> > > > > > >> > > > > >> > >>>>>>>>> can all share the same libraries, and
>> > make
>> > > > them
>> > > > > > >> > better.
>> > > > > > >> > > > This
>> > > > > > >> > > > > >> > >>>>>>>>> will
>> > > > > > >> > > > > >> > >>>>> also
>> > > > > > >> > > > > >> > >>>>>>>>> allow us to share the
>> consumer/producer
>> > > APIs,
>> > > > > and
>> > > > > > >> will
>> > > > > > >> > > let
>> > > > > > >> > > > > us
>> > > > > > >> > > > > >> > >>>>> leverage
>> > > > > > >> > > > > >> > >>>>>>>>> their offset management and partition
>> > > > > management,
>> > > > > > >> > rather
>> > > > > > >> > > > > than
>> > > > > > >> > > > > >> > >>>>>>>>> having
>> > > > > > >> > > > > >> > >>>>>> our
>> > > > > > >> > > > > >> > >>>>>>>>> own. All of the coordinator stream
>> code
>> > > would
>> > > > > go
>> > > > > > >> away,
>> > > > > > >> > > as
>> > > > > > >> > > > > >> would
>> > > > > > >> > > > > >> > >>>>>>>>> most
>> > > > > > >> > > > > >> > >>>>>> of
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> YARN AppMaster code. We'd probably
>> have
>> > to
>> > > > push
>> > > > > > some
>> > > > > > >> > > > > partition
>> > > > > > >> > > > > >> > >>>>>>> management
>> > > > > > >> > > > > >> > >>>>>>>>> features into the Kafka broker, but
>> > they're
>> > > > > > already
>> > > > > > >> > > moving
>> > > > > > >> > > > > in
>> > > > > > >> > > > > >> > >>>>>>>>> that direction with the new consumer
>> API.
>> > > The
>> > > > > > >> features
>> > > > > > >> > > we
>> > > > > > >> > > > > have
>> > > > > > >> > > > > >> > >>>>>>>>> for
>> > > > > > >> > > > > >> > >>>>>> partition
>> > > > > > >> > > > > >> > >>>>>>>>> assignment aren't unique to Samza, and
>> > seem
>> > > > > like
>> > > > > > >> they
>> > > > > > >> > > > should
>> > > > > > >> > > > > >> be
>> > > > > > >> > > > > >> > >>>>>>>>> in
>> > > > > > >> > > > > >> > >>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>> anyway. There will always be some
>> niche
>> > > > usages
>> > > > > > which
>> > > > > > >> > > will
>> > > > > > >> > > > > >> > >>>>>>>>> require
>> > > > > > >> > > > > >> > >>>>>> extra
>> > > > > > >> > > > > >> > >>>>>>>>> care and hence full control over
>> > partition
>> > > > > > >> assignments
>> > > > > > >> > > > much
>> > > > > > >> > > > > >> > >>>>>>>>> like the
>> > > > > > >> > > > > >> > >>>>>>> Kafka
>> > > > > > >> > > > > >> > >>>>>>>>> low level consumer api. These would
>> > > continue
>> > > > to
>> > > > > > be
>> > > > > > >> > > > > supported.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> These items will be good for the Samza
>> > > > > community.
>> > > > > > >> > > They'll
>> > > > > > >> > > > > make
>> > > > > > >> > > > > >> > >>>>>>>>> Samza easier to use, and make it
>> easier
>> > for
>> > > > > > >> developers
>> > > > > > >> > > to
>> > > > > > >> > > > > add
>> > > > > > >> > > > > >> > >>>>>>>>> new features.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Obviously this is a fairly large (and
>> > > > somewhat
>> > > > > > >> > backwards
>> > > > > > >> > > > > >> > >>>>> incompatible
>> > > > > > >> > > > > >> > >>>>>>>>> change). If we choose to go this
>> route,
>> > > it's
>> > > > > > >> important
>> > > > > > >> > > > that
>> > > > > > >> > > > > we
>> > > > > > >> > > > > >> > >>>>> openly
>> > > > > > >> > > > > >> > >>>>>>>>> communicate how we're going to
>> provide a
>> > > > > > migration
>> > > > > > >> > path
>> > > > > > >> > > > from
>> > > > > > >> > > > > >> > >>>>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>> existing
>> > > > > > >> > > > > >> > >>>>>>>>> APIs to the new ones (if we make
>> > > incompatible
>> > > > > > >> > changes).
>> > > > > > >> > > I
>> > > > > > >> > > > > >> think
>> > > > > > >> > > > > >> > >>>>>>>>> at a minimum, we'd probably need to
>> > > provide a
>> > > > > > >> wrapper
>> > > > > > >> > to
>> > > > > > >> > > > > allow
>> > > > > > >> > > > > >> > >>>>>>>>> existing StreamTask implementations to
>> > > > continue
>> > > > > > >> > running
>> > > > > > >> > > on
>> > > > > > >> > > > > the
>> > > > > > >> > > > > >> > >>>> new container.
>> > > > > > >> > > > > >> > >>>>>>> It's
>> > > > > > >> > > > > >> > >>>>>>>>> also important that we openly
>> communicate
>> > > > about
>> > > > > > >> > timing,
>> > > > > > >> > > > and
>> > > > > > >> > > > > >> > >>>>>>>>> stages
>> > > > > > >> > > > > >> > >>>>> of
>> > > > > > >> > > > > >> > >>>>>>> the
>> > > > > > >> > > > > >> > >>>>>>>>> migration.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> If you made it this far, I'm sure you
>> > have
>> > > > > > opinions.
>> > > > > > >> > :)
>> > > > > > >> > > > > Please
>> > > > > > >> > > > > >> > >>>>>>>>> send
>> > > > > > >> > > > > >> > >>>>>> your
>> > > > > > >> > > > > >> > >>>>>>>>> thoughts and feedback.
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>> Cheers,
>> > > > > > >> > > > > >> > >>>>>>>>> Chris
>> > > > > > >> > > > > >> > >>>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>>
>> > > > > > >> > > > > >> > >>>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>> --
>> > > > > > >> > > > > >> > >>>>>> -- Guozhang
>> > > > > > >> > > > > >> > >>>>>>
>> > > > > > >> > > > > >> > >>>>>
>> > > > > > >> > > > > >> > >>>>
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> > >>
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >> >
>> > > > > > >> > > > > >>
>> > > > > > >> > > > > >
>> > > > > > >> > > > > >
>> > > > > > >> > > > >
>> > > > > > >> > > >
>> > > > > > >> > >
>> > > > > > >> >
>> > > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Thoughts and obesrvations on Samza

Reply via email to