Re: Thoughts and obesrvations on Samza

Yi Pan Mon, 13 Jul 2015 15:33:14 -0700

Hi, Garry,

Just want to chime in to state our experience in LinkedIn. In LinkedIn, we
have a lot of aggregation/transformation stream processing jobs that falls
into the "transformation" category. That's also the motivation for us to
develop the SQL layer on top of streams to allow easy programming model for
data transformation on streams. Ingestion from wide-range of sources and
egress to some serving tier are important, but I would argue that w/o the
"transformation" in between, there is not much value added by stream
processing.


Just my 2-cents.

On Mon, Jul 13, 2015 at 12:56 PM, Garry Turkington <
g.turking...@improvedigital.com> wrote:

> Hi,
>
> I'm also supportive of Jay's option 5. There is a risk the "transformer
> API" -- I'd have preferred Metamorphosis but it's too hard to type! --
> takes on a life of its own and we end up with two very different things but
> given how good the Kafka community has been at introducing new producer and
> consumer clients and giving very clear guidance on when they are production
> ready this is a danger I believe can be managed. It'd also be excellent to
> get some working code to beat around the notions of stream processing atop
> a system with transacdtional messages.
>
> On the question of whether to keep or deprecate SystemConsumer/Producer I
> believe we need get a better understanding over the next while of just what
> the Samza community is looking for in such connectivity. For my own use
> cases I have been looking to add additional implementations primarily to
> use Samza as the data ingress and egress component around Kafka. Writing
> external clients that require their own reliability and scalability
> management gets old real fast and pushing this into a simple Samza job that
> reads from system X and pushes into Kafka (or vice versa) was the obvious
> choice for me in the current model. For this type of usage though copycat
> is likely much superior (obviously needs proven) and the question then is
> if most Samza users look to the system implementations to also act as a
> front-end into Kafka or if significant usage is indeed intended to have the
> alternative systems as the primary message source. That understanding will
> I think give much clarity in just what value the abstraction overhead of
> the current model brings.
>
> Garry
>
> -----Original Message-----
> From: Yan Fang [mailto:yanfang...@gmail.com]
> Sent: 13 July 2015 19:58
> To: dev@samza.apache.org
> Subject: Re: Thoughts and obesrvations on Samza
>
> I am leaning to Jay's fifth approach. It is not radical and gives us some
> time to see the outcome.
>
> In addition, I would suggest:
>
> 1) Keep the SystemConsumer/SystemProducer API. Because current
> SystemConsumer/SystemProducer API satisfies the usage (From Joardan, and
> even Garry's feedback) and is not so broken that we want to deprecate it.
> Though there are some issues in implemnting the Kinesis, they are not
> unfixable. Nothing should prevent Samza, as a stream processing system, to
> support other systems. In addition, there already are some systems
> exiting besides Kafka: ElasticSearch (committed to the master), HDFS
> (patch-available), S3( from the mailing list), Kinesis (developing in
> another repository), ActiveMQ (in two months). We may want to see how those
> go before we "kill" them.
>
> 2) Can have some Samza devs involved in Kafka's "transformer" client API.
> This can not only help the future integration (if any) much easier, because
> they have knowledge about both systems, but also good for Kafka's
> community, because Samza devs have the streaming process experience that
> Kafka devs may miss.
>
> 3) Samza's partition management system may still support other systems.
> Though the partition management logic in samza-kafka will be moved to
> Kafka, its still useful for other systems that do not have the partition
> management layer.
>
> 4) Start sharing the docs/websites and using the same terminology (though
> do not know how to do this exactly. :). This will reduce the future
> confusion and does not hurt Samza's independency.
>
> In my opinion, Samza, as a standalone project, still can (and already)
> heavily replying on Kafka, and even more tuned for Kafka-specific usage.
> Kafka, also can embed Samza in the document, I do not see anything prevent
> doing this.
>
> Thanks,
>
> Fang, Yan
> yanfang...@gmail.com
>
> On Mon, Jul 13, 2015 at 11:25 AM, Jordan Shaw <jor...@pubnub.com> wrote:
>
> > Jay,
> > I think doing this iteratively in smaller chunks is a better way to go as
> > new issues arise. As Navina said Kafka is a "stream system" and Samza is
> a
> > "stream processor" and those two ideas should be mutually exclusive.
> >
> > -Jordan
> >
> > On Mon, Jul 13, 2015 at 10:06 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
> >
> > > Hmm, thought about this more. Maybe this is just too much too quick.
> > > Overall I think there is some enthusiasm for the proposal but it's not
> > > really unanimous enough to make any kind of change this big cleanly.
> The
> > > board doesn't really like the merging stuff, user's are concerned about
> > > compatibility, I didn't feel there was unanimous agreement on dropping
> > > SystemConsumer, etc. Even if this is the right end state to get to,
> > > probably trying to push all this through at once isn't the right way to
> > do
> > > it.
> > >
> > > So let me propose a kind of fifth (?) option which I think is less
> > dramatic
> > > and let's things happen gradually. I think this is kind of like
> combining
> > > the first part of Yi's proposal and Jakob's third option, leaving the
> > rest
> > > to be figured out incrementally:
> > >
> > > Option 5: We continue the prototype I shared and propose that as a kind
> > of
> > > "transformer" client API in Kafka. This isn't really a full-fledged
> > stream
> > > processing layer, more like a supped up consumer api for munging
> topics.
> > > This would let us figure out some of the technical bits, how to do this
> > on
> > > Kafka's group management features, how to integrate the txn feature to
> do
> > > the exactly-once stuff in these transformations, and get all this stuff
> > > solid. This api would have valid uses in it's own right, especially
> when
> > > your transformation will be embedded inside an existing service or
> > > application which isn't possible with Samza (or other existing systems
> > that
> > > I know of).
> > >
> > > Independently we can iterate on some of the ideas of the original
> > proposal
> > > individually and figure out how (if at all) to make use of this
> > > functionality. This can be done bit-by-bit:
> > > - Could be that the existing StreamTask API ends up wrapping this
> > > - Could end up exposed directly in Samza as Yi proposed
> > > - Could be that just the lower-level group-management stuff get's used,
> > and
> > > in this case it could be either just for standalone mode, or always
> > > - Could be that it stays as-is
> > >
> > > The advantage of this is it is lower risk...we basically don't have to
> > make
> > > 12 major decisions all at once that kind of hinge on what amounts to a
> > > pretty aggressive rewrite. The disadvantage of this is it is a bit more
> > > confusing as all this is getting figured out.
> > >
> > > As with some of the other stuff, this would require a further
> discussion
> > in
> > > the Kafka community if people do like this approach.
> > >
> > > Thoughts?
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > > On Sun, Jul 12, 2015 at 10:52 PM, Jay Kreps <jay.kr...@gmail.com>
> wrote:
> > >
> > > > Hey Chris,
> > > >
> > > > Yeah, I'm obviously in favor of this.
> > > >
> > > > The sub-project approach seems the ideal way to take a graceful step
> in
> > > > this direction, so I will ping the board folks and see why they are
> > > > discouraged, it would be good to understand that. If we go that route
> > we
> > > > would need to do a similar discussion in the Kafka list (but makes
> > sense
> > > to
> > > > figure out first if it is what Samza wants).
> > > >
> > > > Irrespective of how it's implemented, though, to me the important
> > things
> > > > are the following:
> > > > 1. Unify the website, config, naming, docs, metrics, etc--basically
> fix
> > > > the product experience so the "stream" and the "processing" feel
> like a
> > > > single user experience and brand. This seems minor but I think is a
> > > really
> > > > big deal.
> > > > 2. Make "standalone" mode a first class citizen and have a real
> > technical
> > > > plan to be able to support cluster managers other than YARN.
> > > > 3. Make the config and out-of-the-box experience more usable
> > > >
> > > > I think that prototype gives a practical example of how 1-3 could be
> > done
> > > > and we should pursue it. This is a pretty radical change, so I
> wouldn't
> > > be
> > > > shocked if people didn't want to take a step like that.
> > > >
> > > > Maybe it would make sense to see if people are on board with that
> > general
> > > > idea, and then try to get some advice on sub-projects in parallel and
> > > nail
> > > > down those details?
> > > >
> > > > -Jay
> > > >
> > > > On Sun, Jul 12, 2015 at 5:54 PM, Chris Riccomini <
> > criccom...@apache.org>
> > > > wrote:
> > > >
> > > >> Hey all,
> > > >>
> > > >> I want to start by saying that I'm absolutely thrilled to be a part
> of
> > > >> this
> > > >> community. The amount of level-headed, thoughtful, educated
> discussion
> > > >> that's gone on over the past ~10 days is overwhelming. Wonderful.
> > > >>
> > > >> It seems like discussion is waning a bit, and we've reached some
> > > >> conclusions. There are several key emails in this threat, which I
> want
> > > to
> > > >> call out:
> > > >>
> > > >> 1. Jakob's summary of the three potential ways forward.
> > > >>
> > > >>
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVu-hxdBfyQ4qm3LDC55cUQbPdmbe4zGzTOOatYF1Pz43A%40mail.gmail.com%3E
> > > >> 2. Julian's call out that we should be focusing on community over
> > code.
> > > >>
> > > >>
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCAPSgeESZ_7bVFbwN%2Bzqi5MH%3D4CWu9MZUSanKg0-1woMqt55Fvg%40mail.gmail.com%3E
> > > >> 3. Martin's summary about the benefits of merging communities.
> > > >>
> > > >>
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CBFB866B6-D9D8-4578-93C0-FFAEB1DF00FC%40kleppmann.com%3E
> > > >> 4. Jakob's comments about the distinction between community and code
> > > >> paths.
> > > >>
> > > >>
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/samza-dev/201507.mbox/%3CCADiKvVtWPjHLLDsmxvz9KggVA5DfBi-nUvfqB6QdA-du%2B_a9Ng%40mail.gmail.com%3E
> > > >>
> > > >> I agree with the comments on all of these emails. I think Martin's
> > > summary
> > > >> of his position aligns very closely with my own. To that end, I
> think
> > we
> > > >> should get concrete about what the proposal is, and call a vote on
> it.
> > > >> Given that Jay, Martin, and I seem to be aligning fairly closely, I
> > > think
> > > >> we should start with:
> > > >>
> > > >> 1. [community] Make Samza a subproject of Kafka.
> > > >> 2. [community] Make all Samza PMC/committers committers of the
> > > subproject.
> > > >> 3. [community] Migrate Samza's website/documentation into Kafka's.
> > > >> 4. [code] Have the Samza community and the Kafka community start a
> > > >> from-scratch reboot together in the new Kafka subproject. We can
> > > >> borrow/copy &  paste significant chunks of code from Samza's code
> > base.
> > > >> 5. [code] The subproject would intentionally eliminate support for
> > both
> > > >> other streaming systems and all deployment systems.
> > > >> 6. [code] Attempt to provide a bridge from our SystemConsumer to
> > KIP-26
> > > >> (copy cat)
> > > >> 7. [code] Attempt to provide a bridge from the new subproject's
> > > processor
> > > >> interface to our legacy StreamTask interface.
> > > >> 8. [code/community] Sunset Samza as a TLP when we have a working
> Kafka
> > > >> subproject that has a fault-tolerant container with state
> management.
> > > >>
> > > >> It's likely that (6) and (7) won't be fully drop-in. Still, the
> closer
> > > we
> > > >> can get, the better it's going to be for our existing community.
> > > >>
> > > >> One thing that I didn't touch on with (2) is whether any Samza PMC
> > > members
> > > >> should be rolled into Kafka PMC membership as well (though, Jay and
> > > Jakob
> > > >> are already PMC members on both). I think that Samza's community
> > > deserves
> > > >> a
> > > >> voice on the PMC, so I'd propose that we roll at least a few PMC
> > members
> > > >> into the Kafka PMC, but I don't have a strong framework for which
> > people
> > > >> to
> > > >> pick.
> > > >>
> > > >> Before (8), I think that Samza's TLP can continue to commit bug
> fixes
> > > and
> > > >> patches as it sees fit, provided that we openly communicate that we
> > > won't
> > > >> necessarily migrate new features to the new subproject, and that the
> > TLP
> > > >> will be shut down after the migration to the Kafka subproject
> occurs.
> > > >>
> > > >> Jakob, I could use your guidance here about about how to achieve
> this
> > > from
> > > >> an Apache process perspective (sorry).
> > > >>
> > > >> * Should I just call a vote on this proposal?
> > > >> * Should it happen on dev or private?
> > > >> * Do committers have binding votes, or just PMC?
> > > >>
> > > >> Having trouble finding much detail on the Apache wikis. :(
> > > >>
> > > >> Cheers,
> > > >> Chris
> > > >>
> > > >> On Fri, Jul 10, 2015 at 2:38 PM, Yan Fang <yanfang...@gmail.com>
> > wrote:
> > > >>
> > > >> > Thanks, Jay. This argument persuaded me actually. :)
> > > >> >
> > > >> > Fang, Yan
> > > >> > yanfang...@gmail.com
> > > >> >
> > > >> > On Fri, Jul 10, 2015 at 2:33 PM, Jay Kreps <j...@confluent.io>
> > wrote:
> > > >> >
> > > >> > > Hey Yan,
> > > >> > >
> > > >> > > Yeah philosophically I think the argument is that you should
> > capture
> > > >> the
> > > >> > > stream in Kafka independent of the transformation. This is
> > > obviously a
> > > >> > > Kafka-centric view point.
> > > >> > >
> > > >> > > Advantages of this:
> > > >> > > - In practice I think this is what e.g. Storm people often end
> up
> > > >> doing
> > > >> > > anyway. You usually need to throttle any access to a live
> serving
> > > >> > database.
> > > >> > > - Can have multiple subscribers and they get the same thing
> > without
> > > >> > > additional load on the source system.
> > > >> > > - Applications can tap into the stream if need be by
> subscribing.
> > > >> > > - You can debug your transformation by tailing the Kafka topic
> > with
> > > >> the
> > > >> > > console consumer
> > > >> > > - Can tee off the same data stream for batch analysis or Lambda
> > arch
> > > >> > style
> > > >> > > re-processing
> > > >> > >
> > > >> > > The disadvantage is that it will use Kafka resources. But the
> idea
> > > is
> > > >> > > eventually you will have multiple subscribers to any data source
> > (at
> > > >> > least
> > > >> > > for monitoring) so you will end up there soon enough anyway.
> > > >> > >
> > > >> > > Down the road the technical benefit is that I think it gives us
> a
> > > good
> > > >> > path
> > > >> > > towards end-to-end exactly once semantics from source to
> > > destination.
> > > >> > > Basically the connectors need to support idempotence when
> talking
> > to
> > > >> > Kafka
> > > >> > > and we need the transactional write feature in Kafka to make the
> > > >> > > transformation atomic. This is actually pretty doable if you
> > > separate
> > > >> > > connector=>kafka problem from the generic transformations which
> > are
> > > >> > always
> > > >> > > kafka=>kafka. However I think it is quite impossible to do in a
> > > >> > all_things
> > > >> > > => all_things environment. Today you can say "well the semantics
> > of
> > > >> the
> > > >> > > Samza APIs depend on the connectors you use" but it is actually
> > > worse
> > > >> > then
> > > >> > > that because the semantics actually depend on the pairing of
> > > >> > connectors--so
> > > >> > > not only can you probably not get a usable "exactly once"
> > guarantee
> > > >> > > end-to-end it can actually be quite hard to reverse engineer
> what
> > > >> > property
> > > >> > > (if any) your end-to-end flow has if you have heterogenous
> > systems.
> > > >> > >
> > > >> > > -Jay
> > > >> > >
> > > >> > > On Fri, Jul 10, 2015 at 2:00 PM, Yan Fang <yanfang...@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > {quote}
> > > >> > > > maintained in a separate repository and retaining the existing
> > > >> > > > committership but sharing as much else as possible (website,
> > etc)
> > > >> > > > {quote}
> > > >> > > >
> > > >> > > > Overall, I agree on this idea. Now the question is more about
> > "how
> > > >> to
> > > >> > do
> > > >> > > > it".
> > > >> > > >
> > > >> > > > On the other hand, one thing I want to point out is that, if
> we
> > > >> decide
> > > >> > to
> > > >> > > > go this way, how do we want to support
> > > >> > > > otherSystem-transformation-otherSystem use case?
> > > >> > > >
> > > >> > > > Basically, there are four user groups here:
> > > >> > > >
> > > >> > > > 1. Kafka-transformation-Kafka
> > > >> > > > 2. Kafka-transformation-otherSystem
> > > >> > > > 3. otherSystem-transformation-Kafka
> > > >> > > > 4. otherSystem-transformation-otherSystem
> > > >> > > >
> > > >> > > > For group 1, they can easily use the new Samza library to
> > achieve.
> > > >> For
> > > >> > > > group 2 and 3, they can use copyCat -> transformation -> Kafka
> > or
> > > >> > Kafka->
> > > >> > > > transformation -> copyCat.
> > > >> > > >
> > > >> > > > The problem is for group 4. Do we want to abandon this or
> still
> > > >> support
> > > >> > > it?
> > > >> > > > Of course, this use case can be achieved by using copyCat ->
> > > >> > > transformation
> > > >> > > > -> Kafka -> transformation -> copyCat, the thing is how we
> > > persuade
> > > >> > them
> > > >> > > to
> > > >> > > > do this long chain. If yes, it will also be a win for Kafka
> too.
> > > Or
> > > >> if
> > > >> > > > there is no one in this community actually doing this so far,
> > > maybe
> > > >> ok
> > > >> > to
> > > >> > > > not support the group 4 directly.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Fang, Yan
> > > >> > > > yanfang...@gmail.com
> > > >> > > >
> > > >> > > > On Fri, Jul 10, 2015 at 12:58 PM, Jay Kreps <j...@confluent.io
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Yeah I agree with this summary. I think there are kind of
> two
> > > >> > questions
> > > >> > > > > here:
> > > >> > > > > 1. Technically does alignment/reliance on Kafka make sense
> > > >> > > > > 2. Branding wise (naming, website, concepts, etc) does
> > alignment
> > > >> with
> > > >> > > > Kafka
> > > >> > > > > make sense
> > > >> > > > >
> > > >> > > > > Personally I do think both of these things would be really
> > > >> valuable,
> > > >> > > and
> > > >> > > > > would dramatically alter the trajectory of the project.
> > > >> > > > >
> > > >> > > > > My preference would be to see if people can mostly agree on
> a
> > > >> > direction
> > > >> > > > > rather than splintering things off. From my point of view
> the
> > > >> ideal
> > > >> > > > outcome
> > > >> > > > > of all the options discussed would be to make Samza a
> closely
> > > >> aligned
> > > >> > > > > subproject, maintained in a separate repository and
> retaining
> > > the
> > > >> > > > existing
> > > >> > > > > committership but sharing as much else as possible (website,
> > > >> etc). No
> > > >> > > > idea
> > > >> > > > > about how these things work, Jacob, you probably know more.
> > > >> > > > >
> > > >> > > > > No discussion amongst the Kafka folks has happened on this,
> > but
> > > >> > likely
> > > >> > > we
> > > >> > > > > should figure out what the Samza community actually wants
> > first.
> > > >> > > > >
> > > >> > > > > I admit that this is a fairly radical departure from how
> > things
> > > >> are.
> > > >> > > > >
> > > >> > > > > If that doesn't fly, I think, yeah we could leave Samza as
> it
> > is
> > > >> and
> > > >> > do
> > > >> > > > the
> > > >> > > > > more radical reboot inside Kafka. From my point of view that
> > > does
> > > >> > leave
> > > >> > > > > things in a somewhat confusing state since now there are two
> > > >> stream
> > > >> > > > > processing systems more or less coupled to Kafka in large
> part
> > > >> made
> > > >> > by
> > > >> > > > the
> > > >> > > > > same people. But, arguably that might be a cleaner way to
> make
> > > the
> > > >> > > > cut-over
> > > >> > > > > and perhaps less risky for Samza community since if it works
> > > >> people
> > > >> > can
> > > >> > > > > switch and if it doesn't nothing will have changed. Dunno,
> how
> > > do
> > > >> > > people
> > > >> > > > > feel about this?
> > > >> > > > >
> > > >> > > > > -Jay
> > > >> > > > >
> > > >> > > > > On Fri, Jul 10, 2015 at 11:49 AM, Jakob Homan <
> > > jgho...@gmail.com>
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > >  This leads me to thinking that merging projects and
> > > >> communities
> > > >> > > > might
> > > >> > > > > > be a good idea: with the union of experience from both
> > > >> communities,
> > > >> > > we
> > > >> > > > > will
> > > >> > > > > > probably build a better system that is better for users.
> > > >> > > > > > Is this what's being proposed though? Merging the projects
> > > seems
> > > >> > like
> > > >> > > > > > a consequence of at most one of the three directions under
> > > >> > > discussion:
> > > >> > > > > > 1) Samza 2.0: The Samza community relies more heavily on
> > Kafka
> > > >> for
> > > >> > > > > > configuration, etc. (to a greater or lesser extent to be
> > > >> > determined)
> > > >> > > > > > but the Samza community would not automatically merge
> withe
> > > >> Kafka
> > > >> > > > > > community (the Phoenix/HBase example is a good one here).
> > > >> > > > > > 2) Samza Reboot: The Samza community continues to exist
> > with a
> > > >> > > limited
> > > >> > > > > > project scope, but similarly would not need to be part of
> > the
> > > >> Kafka
> > > >> > > > > > community (ie given committership) to progress.  Here,
> maybe
> > > the
> > > >> > > Samza
> > > >> > > > > > team would become a subproject of Kafka (the Board frowns
> on
> > > >> > > > > > subprojects at the moment, so I'm not sure if that's even
> > > >> > feasible),
> > > >> > > > > > but that would not be required.
> > > >> > > > > > 3) Hey Samza! FYI, Kafka does streaming now: In this
> option
> > > the
> > > >> > Kafka
> > > >> > > > > > team builds its own streaming library, possibly off of
> Jay's
> > > >> > > > > > prototype, which has not direct lineage to the Samza team.
> > > >> There's
> > > >> > > no
> > > >> > > > > > reason for the Kafka team to bring in the Samza team.
> > > >> > > > > >
> > > >> > > > > > Is the Kafka community on board with this?
> > > >> > > > > >
> > > >> > > > > > To be clear, all three options under discussion are
> > > interesting,
> > > >> > > > > > technically valid and likely healthy directions for the
> > > project.
> > > >> > > > > > Also, they are not mutually exclusive.  The Samza
> community
> > > >> could
> > > >> > > > > > decide to pursue, say, 'Samza 2.0', while the Kafka
> > community
> > > >> went
> > > >> > > > > > forward with 'Hey Samza!'  My points above are directed
> > > >> entirely at
> > > >> > > > > > the community aspect of these choices.
> > > >> > > > > > -Jakob
> > > >> > > > > >
> > > >> > > > > > On 10 July 2015 at 09:10, Roger Hoover <
> > > roger.hoo...@gmail.com>
> > > >> > > wrote:
> > > >> > > > > > > That's great.  Thanks, Jay.
> > > >> > > > > > >
> > > >> > > > > > > On Fri, Jul 10, 2015 at 8:46 AM, Jay Kreps <
> > > j...@confluent.io>
> > > >> > > wrote:
> > > >> > > > > > >
> > > >> > > > > > >> Yeah totally agree. I think you have this issue even
> > today,
> > > >> > right?
> > > >> > > > > I.e.
> > > >> > > > > > if
> > > >> > > > > > >> you need to make a simple config change and you're
> > running
> > > in
> > > >> > YARN
> > > >> > > > > today
> > > >> > > > > > >> you end up bouncing the job which then rebuilds state.
> I
> > > >> think
> > > >> > the
> > > >> > > > fix
> > > >> > > > > > is
> > > >> > > > > > >> exactly what you described which is to have a long
> > timeout
> > > on
> > > >> > > > > partition
> > > >> > > > > > >> movement for stateful jobs so that if a job is just
> > getting
> > > >> > > bounced,
> > > >> > > > > and
> > > >> > > > > > >> the cluster manager (or admin) is smart enough to
> restart
> > > it
> > > >> on
> > > >> > > the
> > > >> > > > > same
> > > >> > > > > > >> host when possible, it can optimistically reuse any
> > > existing
> > > >> > state
> > > >> > > > it
> > > >> > > > > > finds
> > > >> > > > > > >> on disk (if it is valid).
> > > >> > > > > > >>
> > > >> > > > > > >> So in this model the charter of the CM is to place
> > > processes
> > > >> as
> > > >> > > > > > stickily as
> > > >> > > > > > >> possible and to restart or re-place failed processes.
> The
> > > >> > charter
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > >> partition management system is to control the
> assignment
> > of
> > > >> work
> > > >> > > to
> > > >> > > > > > these
> > > >> > > > > > >> processes. The nice thing about this is that the work
> > > >> > assignment,
> > > >> > > > > > timeouts,
> > > >> > > > > > >> behavior, configs, and code will all be the same across
> > all
> > > >> > > cluster
> > > >> > > > > > >> managers.
> > > >> > > > > > >>
> > > >> > > > > > >> So I think that prototype would actually give you
> exactly
> > > >> what
> > > >> > you
> > > >> > > > > want
> > > >> > > > > > >> today for any cluster manager (or manual placement +
> > > restart
> > > >> > > script)
> > > >> > > > > > that
> > > >> > > > > > >> was sticky in terms of host placement since there is
> > > already
> > > >> a
> > > >> > > > > > configurable
> > > >> > > > > > >> partition movement timeout and task-by-task state reuse
> > > with
> > > >> a
> > > >> > > check
> > > >> > > > > on
> > > >> > > > > > >> state validity.
> > > >> > > > > > >>
> > > >> > > > > > >> -Jay
> > > >> > > > > > >>
> > > >> > > > > > >> On Fri, Jul 10, 2015 at 8:34 AM, Roger Hoover <
> > > >> > > > roger.hoo...@gmail.com
> > > >> > > > > >
> > > >> > > > > > >> wrote:
> > > >> > > > > > >>
> > > >> > > > > > >> > That would be great to let Kafka do as much heavy
> > lifting
> > > >> as
> > > >> > > > > possible
> > > >> > > > > > and
> > > >> > > > > > >> > make it easier for other languages to implement Samza
> > > apis.
> > > >> > > > > > >> >
> > > >> > > > > > >> > One thing to watch out for is the interplay between
> > > Kafka's
> > > >> > > group
> > > >> > > > > > >> > management and the external scheduler/process
> manager's
> > > >> fault
> > > >> > > > > > tolerance.
> > > >> > > > > > >> > If a container dies, the Kafka group membership
> > protocol
> > > >> will
> > > >> > > try
> > > >> > > > to
> > > >> > > > > > >> assign
> > > >> > > > > > >> > it's tasks to other containers while at the same time
> > the
> > > >> > > process
> > > >> > > > > > manager
> > > >> > > > > > >> > is trying to relaunch the container.  Without some
> > > >> > consideration
> > > >> > > > for
> > > >> > > > > > this
> > > >> > > > > > >> > (like a configurable amount of time to wait before
> > Kafka
> > > >> > alters
> > > >> > > > the
> > > >> > > > > > group
> > > >> > > > > > >> > membership), there may be thrashing going on which is
> > > >> > especially
> > > >> > > > bad
> > > >> > > > > > for
> > > >> > > > > > >> > containers with large amounts of local state.
> > > >> > > > > > >> >
> > > >> > > > > > >> > Someone else pointed this out already but I thought
> it
> > > >> might
> > > >> > be
> > > >> > > > > worth
> > > >> > > > > > >> > calling out again.
> > > >> > > > > > >> >
> > > >> > > > > > >> > Cheers,
> > > >> > > > > > >> >
> > > >> > > > > > >> > Roger
> > > >> > > > > > >> >
> > > >> > > > > > >> >
> > > >> > > > > > >> > On Tue, Jul 7, 2015 at 11:35 AM, Jay Kreps <
> > > >> j...@confluent.io>
> > > >> > > > > wrote:
> > > >> > > > > > >> >
> > > >> > > > > > >> > > Hey Roger,
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I couldn't agree more. We spent a bunch of time
> > talking
> > > >> to
> > > >> > > > people
> > > >> > > > > > and
> > > >> > > > > > >> > that
> > > >> > > > > > >> > > is exactly the stuff we heard time and again. What
> > > makes
> > > >> it
> > > >> > > > hard,
> > > >> > > > > of
> > > >> > > > > > >> > > course, is that there is some tension between
> > > >> compatibility
> > > >> > > with
> > > >> > > > > > what's
> > > >> > > > > > >> > > there now and making things better for new users.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > I also strongly agree with the importance of
> > > >> multi-language
> > > >> > > > > > support. We
> > > >> > > > > > >> > are
> > > >> > > > > > >> > > talking now about Java, but for application
> > development
> > > >> use
> > > >> > > > cases
> > > >> > > > > > >> people
> > > >> > > > > > >> > > want to work in whatever language they are using
> > > >> elsewhere.
> > > >> > I
> > > >> > > > > think
> > > >> > > > > > >> > moving
> > > >> > > > > > >> > > to a model where Kafka itself does the group
> > > membership,
> > > >> > > > lifecycle
> > > >> > > > > > >> > control,
> > > >> > > > > > >> > > and partition assignment has the advantage of
> putting
> > > all
> > > >> > that
> > > >> > > > > > complex
> > > >> > > > > > >> > > stuff behind a clean api that the clients are
> already
> > > >> going
> > > >> > to
> > > >> > > > be
> > > >> > > > > > >> > > implementing for their consumer, so the added
> > > >> functionality
> > > >> > > for
> > > >> > > > > > stream
> > > >> > > > > > >> > > processing beyond a consumer becomes very minor.
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > -Jay
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > On Tue, Jul 7, 2015 at 10:49 AM, Roger Hoover <
> > > >> > > > > > roger.hoo...@gmail.com>
> > > >> > > > > > >> > > wrote:
> > > >> > > > > > >> > >
> > > >> > > > > > >> > > > Metamorphosis...nice. :)
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > This has been a great discussion.  As a user of
> > Samza
> > > >> > who's
> > > >> > > > > > recently
> > > >> > > > > > >> > > > integrated it into a relatively large
> > organization, I
> > > >> just
> > > >> > > > want
> > > >> > > > > to
> > > >> > > > > > >> add
> > > >> > > > > > >> > > > support to a few points already made.
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > The biggest hurdles to adoption of Samza as it
> > > >> currently
> > > >> > > > exists
> > > >> > > > > > that
> > > >> > > > > > >> > I've
> > > >> > > > > > >> > > > experienced are:
> > > >> > > > > > >> > > > 1) YARN - YARN is overly complex in many
> > environments
> > > >> > where
> > > >> > > > > Puppet
> > > >> > > > > > >> > would
> > > >> > > > > > >> > > do
> > > >> > > > > > >> > > > just fine but it was the only mechanism to get
> > fault
> > > >> > > > tolerance.
> > > >> > > > > > >> > > > 2) Configuration - I think I like the idea of
> > > >> configuring
> > > >> > > most
> > > >> > > > > of
> > > >> > > > > > the
> > > >> > > > > > >> > job
> > > >> > > > > > >> > > > in code rather than config files.  In general, I
> > > think
> > > >> the
> > > >> > > > goal
> > > >> > > > > > >> should
> > > >> > > > > > >> > be
> > > >> > > > > > >> > > > to make it harder to make mistakes, especially of
> > the
> > > >> kind
> > > >> > > > where
> > > >> > > > > > the
> > > >> > > > > > >> > code
> > > >> > > > > > >> > > > expects something and the config doesn't match.
> > The
> > > >> > current
> > > >> > > > > > config
> > > >> > > > > > >> is
> > > >> > > > > > >> > > > quite intricate and error-prone.  For example,
> the
> > > >> > > application
> > > >> > > > > > logic
> > > >> > > > > > >> > may
> > > >> > > > > > >> > > > depend on bootstrapping a topic but rather than
> > > >> asserting
> > > >> > > that
> > > >> > > > > in
> > > >> > > > > > the
> > > >> > > > > > >> > > code,
> > > >> > > > > > >> > > > you have to rely on getting the config right.
> > > Likewise
> > > >> > with
> > > >> > > > > > serdes,
> > > >> > > > > > >> > the
> > > >> > > > > > >> > > > Java representations produced by various serdes
> > > (JSON,
> > > >> > Avro,
> > > >> > > > > etc.)
> > > >> > > > > > >> are
> > > >> > > > > > >> > > not
> > > >> > > > > > >> > > > equivalent so you cannot just reconfigure a serde
> > > >> without
> > > >> > > > > changing
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > code.   It would be nice for jobs to be able to
> > > assert
> > > >> > what
> > > >> > > > they
> > > >> > > > > > >> expect
> > > >> > > > > > >> > > > from their input topics in terms of partitioning.
> > > >> This is
> > > >> > > > > > getting a
> > > >> > > > > > >> > > little
> > > >> > > > > > >> > > > off topic but I was even thinking about creating
> a
> > > >> "Samza
> > > >> > > > config
> > > >> > > > > > >> > linter"
> > > >> > > > > > >> > > > that would sanity check a set of configs.
> > Especially
> > > >> in
> > > >> > > > > > >> organizations
> > > >> > > > > > >> > > > where config is managed by a different team than
> > the
> > > >> > > > application
> > > >> > > > > > >> > > developer,
> > > >> > > > > > >> > > > it's very hard to get avoid config mistakes.
> > > >> > > > > > >> > > > 3) Java/Scala centric - for many teams
> (especially
> > > >> > > DevOps-type
> > > >> > > > > > >> folks),
> > > >> > > > > > >> > > the
> > > >> > > > > > >> > > > pain of the Java toolchain (maven, slow builds,
> > weak
> > > >> > command
> > > >> > > > > line
> > > >> > > > > > >> > > support,
> > > >> > > > > > >> > > > configuration over convention) really inhibits
> > > >> > productivity.
> > > >> > > > As
> > > >> > > > > > more
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > more high-quality clients become available for
> > > Kafka, I
> > > >> > hope
> > > >> > > > > > they'll
> > > >> > > > > > >> > > follow
> > > >> > > > > > >> > > > Samza's model.  Not sure how much it affects the
> > > >> proposals
> > > >> > > in
> > > >> > > > > this
> > > >> > > > > > >> > thread
> > > >> > > > > > >> > > > but please consider other languages in the
> > ecosystem
> > > as
> > > >> > > well.
> > > >> > > > > > From
> > > >> > > > > > >> > what
> > > >> > > > > > >> > > > I've heard, Spark has more Python users than
> > > >> Java/Scala.
> > > >> > > > > > >> > > > (FYI, we added a Jython wrapper for the Samza API
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > >
> > > >> > > > > > >> >
> > > >> > > > > > >>
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/Quantiply/rico/tree/master/jython/src/main/java/com/quantiply/samza
> > > >> > > > > > >> > > > and are working on a Yeoman generator
> > > >> > > > > > >> > > > https://github.com/Quantiply/generator-rico for
> > > >> > > Jython/Samza
> > > >> > > > > > >> projects
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > alleviate some of the pain)
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > I also want to underscore Jay's point about
> > improving
> > > >> the
> > > >> > > user
> > > >> > > > > > >> > > experience.
> > > >> > > > > > >> > > > That's a very important factor for adoption.  I
> > think
> > > >> the
> > > >> > > goal
> > > >> > > > > > should
> > > >> > > > > > >> > be
> > > >> > > > > > >> > > to
> > > >> > > > > > >> > > > make Samza as easy to get started with as
> something
> > > >> like
> > > >> > > > > Logstash.
> > > >> > > > > > >> > > > Logstash is vastly inferior in terms of
> > capabilities
> > > to
> > > >> > > Samza
> > > >> > > > > but
> > > >> > > > > > >> it's
> > > >> > > > > > >> > > easy
> > > >> > > > > > >> > > > to get started and that makes a big difference.
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > Cheers,
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > Roger
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > On Tue, Jul 7, 2015 at 3:29 AM, Gianmarco De
> > > Francisci
> > > >> > > > Morales <
> > > >> > > > > > >> > > > g...@apache.org> wrote:
> > > >> > > > > > >> > > >
> > > >> > > > > > >> > > > > Forgot to add. On the naming issues, Kafka
> > > >> Metamorphosis
> > > >> > > is
> > > >> > > > a
> > > >> > > > > > clear
> > > >> > > > > > >> > > > winner
> > > >> > > > > > >> > > > > :)
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > --
> > > >> > > > > > >> > > > > Gianmarco
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > On 7 July 2015 at 13:26, Gianmarco De Francisci
> > > >> Morales
> > > >> > <
> > > >> > > > > > >> > > g...@apache.org
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > wrote:
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> > > > > > Hi,
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > @Martin, thanks for you comments.
> > > >> > > > > > >> > > > > > Maybe I'm missing some important point, but I
> > > think
> > > >> > > > coupling
> > > >> > > > > > the
> > > >> > > > > > >> > > > releases
> > > >> > > > > > >> > > > > > is actually a *good* thing.
> > > >> > > > > > >> > > > > > To make an example, would it be better if the
> > MR
> > > >> and
> > > >> > > HDFS
> > > >> > > > > > >> > components
> > > >> > > > > > >> > > of
> > > >> > > > > > >> > > > > > Hadoop had different release schedules?
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > Actually, keeping the discussion in a single
> > > place
> > > >> > would
> > > >> > > > > make
> > > >> > > > > > >> > > agreeing
> > > >> > > > > > >> > > > on
> > > >> > > > > > >> > > > > > releases (and backwards compatibility) much
> > > >> easier, as
> > > >> > > > > > everybody
> > > >> > > > > > >> > > would
> > > >> > > > > > >> > > > be
> > > >> > > > > > >> > > > > > responsible for the whole codebase.
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > That said, I like the idea of absorbing
> > > samza-core
> > > >> as
> > > >> > a
> > > >> > > > > > >> > sub-project,
> > > >> > > > > > >> > > > and
> > > >> > > > > > >> > > > > > leave the fancy stuff separate.
> > > >> > > > > > >> > > > > > It probably gives 90% of the benefits we have
> > > been
> > > >> > > > > discussing
> > > >> > > > > > >> here.
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > Cheers,
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > --
> > > >> > > > > > >> > > > > > Gianmarco
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > > On 7 July 2015 at 02:30, Jay Kreps <
> > > >> > jay.kr...@gmail.com
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >> Hey Martin,
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> I agree coupling release schedules is a
> > > downside.
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> Definitely we can try to solve some of the
> > > >> > integration
> > > >> > > > > > problems
> > > >> > > > > > >> in
> > > >> > > > > > >> > > > > >> Confluent Platform or in other
> distributions.
> > > But
> > > >> I
> > > >> > > think
> > > >> > > > > > this
> > > >> > > > > > >> > ends
> > > >> > > > > > >> > > up
> > > >> > > > > > >> > > > > >> being really shallow. I guess I feel to
> really
> > > >> get a
> > > >> > > good
> > > >> > > > > > user
> > > >> > > > > > >> > > > > experience
> > > >> > > > > > >> > > > > >> the two systems have to kind of feel like
> part
> > > of
> > > >> the
> > > >> > > > same
> > > >> > > > > > thing
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > you
> > > >> > > > > > >> > > > > >> can't really add that in later--you can put
> > both
> > > >> in
> > > >> > the
> > > >> > > > > same
> > > >> > > > > > >> > > > > downloadable
> > > >> > > > > > >> > > > > >> tar file but it doesn't really give a very
> > > >> cohesive
> > > >> > > > > feeling.
> > > >> > > > > > I
> > > >> > > > > > >> > agree
> > > >> > > > > > >> > > > > that
> > > >> > > > > > >> > > > > >> ultimately any of the project stuff is as
> much
> > > >> social
> > > >> > > and
> > > >> > > > > > naming
> > > >> > > > > > >> > as
> > > >> > > > > > >> > > > > >> anything else--theoretically two totally
> > > >> independent
> > > >> > > > > projects
> > > >> > > > > > >> > could
> > > >> > > > > > >> > > > work
> > > >> > > > > > >> > > > > >> to
> > > >> > > > > > >> > > > > >> tightly align. In practice this seems to be
> > > quite
> > > >> > > > difficult
> > > >> > > > > > >> > though.
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> For the frameworks--totally agree it would
> be
> > > >> good to
> > > >> > > > > > maintain
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > > >> framework support with the project. In some
> > > cases
> > > >> > there
> > > >> > > > may
> > > >> > > > > > not
> > > >> > > > > > >> be
> > > >> > > > > > >> > > too
> > > >> > > > > > >> > > > > >> much
> > > >> > > > > > >> > > > > >> there since the integration gets lighter
> but I
> > > >> think
> > > >> > > > > whatever
> > > >> > > > > > >> > stubs
> > > >> > > > > > >> > > > you
> > > >> > > > > > >> > > > > >> need should be included. So no I definitely
> > > wasn't
> > > >> > > trying
> > > >> > > > > to
> > > >> > > > > > >> imply
> > > >> > > > > > >> > > > > >> dropping
> > > >> > > > > > >> > > > > >> support for these frameworks, just making
> the
> > > >> > > integration
> > > >> > > > > > >> lighter
> > > >> > > > > > >> > by
> > > >> > > > > > >> > > > > >> separating process management from partition
> > > >> > > management.
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> You raise two good points we would have to
> > > figure
> > > >> out
> > > >> > > if
> > > >> > > > we
> > > >> > > > > > went
> > > >> > > > > > >> > > down
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> alignment path:
> > > >> > > > > > >> > > > > >> 1. With respect to the name, yeah I think
> the
> > > >> first
> > > >> > > > > question
> > > >> > > > > > is
> > > >> > > > > > >> > > > whether
> > > >> > > > > > >> > > > > >> some "re-branding" would be worth it. If so
> > > then I
> > > >> > > think
> > > >> > > > we
> > > >> > > > > > can
> > > >> > > > > > >> > > have a
> > > >> > > > > > >> > > > > big
> > > >> > > > > > >> > > > > >> thread on the name. I'm definitely not set
> on
> > > >> Kafka
> > > >> > > > > > Streaming or
> > > >> > > > > > >> > > Kafka
> > > >> > > > > > >> > > > > >> Streams I was just using them to be kind of
> > > >> > > > illustrative. I
> > > >> > > > > > >> agree
> > > >> > > > > > >> > > with
> > > >> > > > > > >> > > > > >> your
> > > >> > > > > > >> > > > > >> critique of these names, though I think
> people
> > > >> would
> > > >> > > get
> > > >> > > > > the
> > > >> > > > > > >> idea.
> > > >> > > > > > >> > > > > >> 2. Yeah you also raise a good point about
> how
> > to
> > > >> > > "factor"
> > > >> > > > > it.
> > > >> > > > > > >> Here
> > > >> > > > > > >> > > are
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> options I see (I could get enthusiastic
> about
> > > any
> > > >> of
> > > >> > > > them):
> > > >> > > > > > >> > > > > >>    a. One repo for both Kafka and Samza
> > > >> > > > > > >> > > > > >>    b. Two repos, retaining the current
> > > seperation
> > > >> > > > > > >> > > > > >>    c. Two repos, the equivalent of samza-api
> > and
> > > >> > > > samza-core
> > > >> > > > > > is
> > > >> > > > > > >> > > > absorbed
> > > >> > > > > > >> > > > > >> almost like a third client
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> Cheers,
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> -Jay
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> On Mon, Jul 6, 2015 at 1:18 PM, Martin
> > > Kleppmann <
> > > >> > > > > > >> > > > mar...@kleppmann.com>
> > > >> > > > > > >> > > > > >> wrote:
> > > >> > > > > > >> > > > > >>
> > > >> > > > > > >> > > > > >> > Ok, thanks for the clarifications. Just a
> > few
> > > >> > > follow-up
> > > >> > > > > > >> > comments.
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > - I see the appeal of merging with Kafka
> or
> > > >> > becoming
> > > >> > > a
> > > >> > > > > > >> > subproject:
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > >> > reasons you mention are good. The risk I
> see
> > > is
> > > >> > that
> > > >> > > > > > release
> > > >> > > > > > >> > > > schedules
> > > >> > > > > > >> > > > > >> > become coupled to each other, which can
> slow
> > > >> > everyone
> > > >> > > > > down,
> > > >> > > > > > >> and
> > > >> > > > > > >> > > > large
> > > >> > > > > > >> > > > > >> > projects with many contributors are harder
> > to
> > > >> > manage.
> > > >> > > > > > (Jakob,
> > > >> > > > > > >> > can
> > > >> > > > > > >> > > > you
> > > >> > > > > > >> > > > > >> speak
> > > >> > > > > > >> > > > > >> > from experience, having seen a wider range
> > of
> > > >> > Hadoop
> > > >> > > > > > ecosystem
> > > >> > > > > > >> > > > > >> projects?)
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > Some of the goals of a better unified
> > > developer
> > > >> > > > > experience
> > > >> > > > > > >> could
> > > >> > > > > > >> > > > also
> > > >> > > > > > >> > > > > be
> > > >> > > > > > >> > > > > >> > solved by integrating Samza nicely into a
> > > Kafka
> > > >> > > > > > distribution
> > > >> > > > > > >> > (such
> > > >> > > > > > >> > > > as
> > > >> > > > > > >> > > > > >> > Confluent's). I'm not against merging
> > projects
> > > >> if
> > > >> > we
> > > >> > > > > decide
> > > >> > > > > > >> > that's
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > >> way
> > > >> > > > > > >> > > > > >> > to go, just pointing out the same goals
> can
> > > >> perhaps
> > > >> > > > also
> > > >> > > > > be
> > > >> > > > > > >> > > achieved
> > > >> > > > > > >> > > > > in
> > > >> > > > > > >> > > > > >> > other ways.
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > - With regard to dropping the YARN
> > dependency:
> > > >> are
> > > >> > > you
> > > >> > > > > > >> proposing
> > > >> > > > > > >> > > > that
> > > >> > > > > > >> > > > > >> > Samza doesn't give any help to people
> > wanting
> > > to
> > > >> > run
> > > >> > > on
> > > >> > > > > > >> > > > > >> YARN/Mesos/AWS/etc?
> > > >> > > > > > >> > > > > >> > So the docs would basically have a link to
> > > >> Slider
> > > >> > and
> > > >> > > > > > nothing
> > > >> > > > > > >> > > else?
> > > >> > > > > > >> > > > Or
> > > >> > > > > > >> > > > > >> > would we maintain integrations with a
> bunch
> > of
> > > >> > > popular
> > > >> > > > > > >> > deployment
> > > >> > > > > > >> > > > > >> methods
> > > >> > > > > > >> > > > > >> > (e.g. the necessary glue and shell scripts
> > to
> > > >> make
> > > >> > > > Samza
> > > >> > > > > > work
> > > >> > > > > > >> > with
> > > >> > > > > > >> > > > > >> Slider)?
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > I absolutely think it's a good idea to
> have
> > > the
> > > >> > "as a
> > > >> > > > > > library"
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > > "as a
> > > >> > > > > > >> > > > > >> > process" (using Yi's taxonomy) options for
> > > >> people
> > > >> > who
> > > >> > > > > want
> > > >> > > > > > >> them,
> > > >> > > > > > >> > > > but I
> > > >> > > > > > >> > > > > >> > think there should also be a low-friction
> > path
> > > >> for
> > > >> > > > common
> > > >> > > > > > "as
> > > >> > > > > > >> a
> > > >> > > > > > >> > > > > service"
> > > >> > > > > > >> > > > > >> > deployment methods, for which we probably
> > need
> > > >> to
> > > >> > > > > maintain
> > > >> > > > > > >> > > > > integrations.
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > - Project naming: "Kafka Streams" seems
> odd
> > to
> > > >> me,
> > > >> > > > > because
> > > >> > > > > > >> Kafka
> > > >> > > > > > >> > > is
> > > >> > > > > > >> > > > > all
> > > >> > > > > > >> > > > > >> > about streams already. Perhaps "Kafka
> > > >> Transformers"
> > > >> > > or
> > > >> > > > > > "Kafka
> > > >> > > > > > >> > > > Filters"
> > > >> > > > > > >> > > > > >> > would be more apt?
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > One suggestion: perhaps the core of Samza
> > > >> (stream
> > > >> > > > > > >> transformation
> > > >> > > > > > >> > > > with
> > > >> > > > > > >> > > > > >> > state management -- i.e. the "Samza as a
> > > >> library"
> > > >> > > bit)
> > > >> > > > > > could
> > > >> > > > > > >> > > become
> > > >> > > > > > >> > > > > >> part of
> > > >> > > > > > >> > > > > >> > Kafka, while higher-level tools such as
> > > >> streaming
> > > >> > SQL
> > > >> > > > and
> > > >> > > > > > >> > > > integrations
> > > >> > > > > > >> > > > > >> with
> > > >> > > > > > >> > > > > >> > deployment frameworks remain in a separate
> > > >> project?
> > > >> > > In
> > > >> > > > > > other
> > > >> > > > > > >> > > words,
> > > >> > > > > > >> > > > > >> Kafka
> > > >> > > > > > >> > > > > >> > would absorb the proven, stable core of
> > Samza,
> > > >> > which
> > > >> > > > > would
> > > >> > > > > > >> > become
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > >> > "third Kafka client" mentioned early in
> this
> > > >> > thread.
> > > >> > > > The
> > > >> > > > > > Samza
> > > >> > > > > > >> > > > project
> > > >> > > > > > >> > > > > >> > would then target that third Kafka client
> as
> > > its
> > > >> > base
> > > >> > > > > API,
> > > >> > > > > > and
> > > >> > > > > > >> > the
> > > >> > > > > > >> > > > > >> project
> > > >> > > > > > >> > > > > >> > would be freed up to explore more
> > experimental
> > > >> new
> > > >> > > > > > horizons.
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > Martin
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > On 6 Jul 2015, at 18:51, Jay Kreps <
> > > >> > > > jay.kr...@gmail.com>
> > > >> > > > > > >> wrote:
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > > Hey Martin,
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > For the YARN/Mesos/etc decoupling I
> > actually
> > > >> > don't
> > > >> > > > > think
> > > >> > > > > > it
> > > >> > > > > > >> > ties
> > > >> > > > > > >> > > > our
> > > >> > > > > > >> > > > > >> > hands
> > > >> > > > > > >> > > > > >> > > at all, all it does is refactor things.
> > The
> > > >> > > division
> > > >> > > > of
> > > >> > > > > > >> > > > > >> responsibility is
> > > >> > > > > > >> > > > > >> > > that Samza core is responsible for task
> > > >> > lifecycle,
> > > >> > > > > state,
> > > >> > > > > > >> and
> > > >> > > > > > >> > > > > >> partition
> > > >> > > > > > >> > > > > >> > > management (using the Kafka
> co-ordinator)
> > > but
> > > >> it
> > > >> > is
> > > >> > > > NOT
> > > >> > > > > > >> > > > responsible
> > > >> > > > > > >> > > > > >> for
> > > >> > > > > > >> > > > > >> > > packaging, configuration deployment or
> > > >> execution
> > > >> > of
> > > >> > > > > > >> processes.
> > > >> > > > > > >> > > The
> > > >> > > > > > >> > > > > >> > problem
> > > >> > > > > > >> > > > > >> > > of packaging and starting these
> processes
> > is
> > > >> > > > > > >> > > > > >> > > framework/environment-specific. This
> > leaves
> > > >> > > > individual
> > > >> > > > > > >> > > frameworks
> > > >> > > > > > >> > > > to
> > > >> > > > > > >> > > > > >> be
> > > >> > > > > > >> > > > > >> > as
> > > >> > > > > > >> > > > > >> > > fancy or vanilla as they like. So you
> can
> > > get
> > > >> > > simple
> > > >> > > > > > >> stateless
> > > >> > > > > > >> > > > > >> support in
> > > >> > > > > > >> > > > > >> > > YARN, Mesos, etc using their
> off-the-shelf
> > > app
> > > >> > > > > framework
> > > >> > > > > > >> > > (Slider,
> > > >> > > > > > >> > > > > >> > Marathon,
> > > >> > > > > > >> > > > > >> > > etc). These are well known by people and
> > > have
> > > >> > nice
> > > >> > > > UIs
> > > >> > > > > > and a
> > > >> > > > > > >> > lot
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> > > flexibility. I don't think they have
> node
> > > >> > affinity
> > > >> > > > as a
> > > >> > > > > > >> built
> > > >> > > > > > >> > in
> > > >> > > > > > >> > > > > >> option
> > > >> > > > > > >> > > > > >> > > (though I could be wrong). So if we want
> > > that
> > > >> we
> > > >> > > can
> > > >> > > > > > either
> > > >> > > > > > >> > wait
> > > >> > > > > > >> > > > for
> > > >> > > > > > >> > > > > >> them
> > > >> > > > > > >> > > > > >> > > to add it or do a custom framework to
> add
> > > that
> > > >> > > > feature
> > > >> > > > > > (as
> > > >> > > > > > >> > now).
> > > >> > > > > > >> > > > > >> > Obviously
> > > >> > > > > > >> > > > > >> > > if you manage things with old-school ops
> > > tools
> > > >> > > > > > >> > (puppet/chef/etc)
> > > >> > > > > > >> > > > you
> > > >> > > > > > >> > > > > >> get
> > > >> > > > > > >> > > > > >> > > locality easily. The nice thing, though,
> > is
> > > >> that
> > > >> > > all
> > > >> > > > > the
> > > >> > > > > > >> samza
> > > >> > > > > > >> > > > > >> "business
> > > >> > > > > > >> > > > > >> > > logic" around partition management and
> > fault
> > > >> > > > tolerance
> > > >> > > > > > is in
> > > >> > > > > > >> > > Samza
> > > >> > > > > > >> > > > > >> core
> > > >> > > > > > >> > > > > >> > so
> > > >> > > > > > >> > > > > >> > > it is shared across frameworks and the
> > > >> framework
> > > >> > > > > specific
> > > >> > > > > > >> bit
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > > just
> > > >> > > > > > >> > > > > >> > > whether it is smart enough to try to get
> > the
> > > >> same
> > > >> > > > host
> > > >> > > > > > when
> > > >> > > > > > >> a
> > > >> > > > > > >> > > job
> > > >> > > > > > >> > > > is
> > > >> > > > > > >> > > > > >> > > restarted.
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > With respect to the Kafka-alignment,
> yeah
> > I
> > > >> think
> > > >> > > the
> > > >> > > > > > goal
> > > >> > > > > > >> > would
> > > >> > > > > > >> > > > be
> > > >> > > > > > >> > > > > >> (a)
> > > >> > > > > > >> > > > > >> > > actually get better alignment in user
> > > >> experience,
> > > >> > > and
> > > >> > > > > (b)
> > > >> > > > > > >> > > express
> > > >> > > > > > >> > > > > >> this in
> > > >> > > > > > >> > > > > >> > > the naming and project branding.
> > > Specifically:
> > > >> > > > > > >> > > > > >> > > 1. Website/docs, it would be nice for
> the
> > > >> > > > > > "transformation"
> > > >> > > > > > >> api
> > > >> > > > > > >> > > to
> > > >> > > > > > >> > > > be
> > > >> > > > > > >> > > > > >> > > discoverable in the main Kafka
> docs--i.e.
> > be
> > > >> able
> > > >> > > to
> > > >> > > > > > explain
> > > >> > > > > > >> > > when
> > > >> > > > > > >> > > > to
> > > >> > > > > > >> > > > > >> use
> > > >> > > > > > >> > > > > >> > > the consumer and when to use the stream
> > > >> > processing
> > > >> > > > > > >> > functionality
> > > >> > > > > > >> > > > and
> > > >> > > > > > >> > > > > >> lead
> > > >> > > > > > >> > > > > >> > > people into that experience.
> > > >> > > > > > >> > > > > >> > > 2. Align releases so if you get Kafkza
> > 1.4.2
> > > >> (or
> > > >> > > > > > whatever)
> > > >> > > > > > >> > that
> > > >> > > > > > >> > > > has
> > > >> > > > > > >> > > > > >> both
> > > >> > > > > > >> > > > > >> > > Kafka and the stream processing part and
> > > they
> > > >> > > > actually
> > > >> > > > > > work
> > > >> > > > > > >> > > > > together.
> > > >> > > > > > >> > > > > >> > > 3. Unify the programming experience so
> the
> > > >> client
> > > >> > > and
> > > >> > > > > > Samza
> > > >> > > > > > >> > api
> > > >> > > > > > >> > > > > share
> > > >> > > > > > >> > > > > >> > > config/monitoring/naming/packaging/etc.
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > I think sub-projects keep separate
> > > committers
> > > >> and
> > > >> > > can
> > > >> > > > > > have a
> > > >> > > > > > >> > > > > separate
> > > >> > > > > > >> > > > > >> > repo,
> > > >> > > > > > >> > > > > >> > > but I'm actually not really sure (I
> can't
> > > >> find a
> > > >> > > > > > definition
> > > >> > > > > > >> > of a
> > > >> > > > > > >> > > > > >> > subproject
> > > >> > > > > > >> > > > > >> > > in Apache).
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > Basically at a high-level you want the
> > > >> experience
> > > >> > > to
> > > >> > > > > > "feel"
> > > >> > > > > > >> > > like a
> > > >> > > > > > >> > > > > >> single
> > > >> > > > > > >> > > > > >> > > system, not to relatively independent
> > things
> > > >> that
> > > >> > > are
> > > >> > > > > > kind
> > > >> > > > > > >> of
> > > >> > > > > > >> > > > > >> awkwardly
> > > >> > > > > > >> > > > > >> > > glued together.
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > I think if we did that they having
> naming
> > or
> > > >> > > branding
> > > >> > > > > > like
> > > >> > > > > > >> > > "kafka
> > > >> > > > > > >> > > > > >> > > streaming" or "kafka streams" or
> something
> > > >> like
> > > >> > > that
> > > >> > > > > > would
> > > >> > > > > > >> > > > actually
> > > >> > > > > > >> > > > > >> do a
> > > >> > > > > > >> > > > > >> > > good job of conveying what it is. I do
> > that
> > > >> this
> > > >> > > > would
> > > >> > > > > > help
> > > >> > > > > > >> > > > adoption
> > > >> > > > > > >> > > > > >> > quite
> > > >> > > > > > >> > > > > >> > > a lot as it would correctly convey that
> > > using
> > > >> > Kafka
> > > >> > > > > > >> Streaming
> > > >> > > > > > >> > > with
> > > >> > > > > > >> > > > > >> Kafka
> > > >> > > > > > >> > > > > >> > is
> > > >> > > > > > >> > > > > >> > > a fairly seamless experience and Kafka
> is
> > > >> pretty
> > > >> > > > > heavily
> > > >> > > > > > >> > adopted
> > > >> > > > > > >> > > > at
> > > >> > > > > > >> > > > > >> this
> > > >> > > > > > >> > > > > >> > > point.
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > Fwiw we actually considered this model
> > > >> originally
> > > >> > > > when
> > > >> > > > > > open
> > > >> > > > > > >> > > > sourcing
> > > >> > > > > > >> > > > > >> > Samza,
> > > >> > > > > > >> > > > > >> > > however at that time Kafka was
> relatively
> > > >> unknown
> > > >> > > and
> > > >> > > > > we
> > > >> > > > > > >> > decided
> > > >> > > > > > >> > > > not
> > > >> > > > > > >> > > > > >> to
> > > >> > > > > > >> > > > > >> > do
> > > >> > > > > > >> > > > > >> > > it since we felt it would be limiting.
> > From
> > > my
> > > >> > > point
> > > >> > > > of
> > > >> > > > > > view
> > > >> > > > > > >> > the
> > > >> > > > > > >> > > > > three
> > > >> > > > > > >> > > > > >> > > things have changed (1) Kafka is now
> > really
> > > >> > heavily
> > > >> > > > > used
> > > >> > > > > > for
> > > >> > > > > > >> > > > stream
> > > >> > > > > > >> > > > > >> > > processing, (2) we learned that
> > abstracting
> > > >> out
> > > >> > the
> > > >> > > > > > stream
> > > >> > > > > > >> > well
> > > >> > > > > > >> > > is
> > > >> > > > > > >> > > > > >> > > basically impossible, (3) we learned it
> is
> > > >> really
> > > >> > > > hard
> > > >> > > > > to
> > > >> > > > > > >> keep
> > > >> > > > > > >> > > the
> > > >> > > > > > >> > > > > two
> > > >> > > > > > >> > > > > >> > > things feeling like a single product.
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > -Jay
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > > On Mon, Jul 6, 2015 at 3:37 AM, Martin
> > > >> Kleppmann
> > > >> > <
> > > >> > > > > > >> > > > > >> mar...@kleppmann.com>
> > > >> > > > > > >> > > > > >> > > wrote:
> > > >> > > > > > >> > > > > >> > >
> > > >> > > > > > >> > > > > >> > >> Hi all,
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> Lots of good thoughts here.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> I agree with the general philosophy of
> > > tying
> > > >> > Samza
> > > >> > > > > more
> > > >> > > > > > >> > firmly
> > > >> > > > > > >> > > to
> > > >> > > > > > >> > > > > >> Kafka.
> > > >> > > > > > >> > > > > >> > >> After I spent a while looking at
> > > integrating
> > > >> > other
> > > >> > > > > > message
> > > >> > > > > > >> > > > brokers
> > > >> > > > > > >> > > > > >> (e.g.
> > > >> > > > > > >> > > > > >> > >> Kinesis) with SystemConsumer, I came to
> > the
> > > >> > > > conclusion
> > > >> > > > > > that
> > > >> > > > > > >> > > > > >> > SystemConsumer
> > > >> > > > > > >> > > > > >> > >> tacitly assumes a model so much like
> > > Kafka's
> > > >> > that
> > > >> > > > > pretty
> > > >> > > > > > >> much
> > > >> > > > > > >> > > > > nobody
> > > >> > > > > > >> > > > > >> but
> > > >> > > > > > >> > > > > >> > >> Kafka actually implements it. (Databus
> is
> > > >> > perhaps
> > > >> > > an
> > > >> > > > > > >> > exception,
> > > >> > > > > > >> > > > but
> > > >> > > > > > >> > > > > >> it
> > > >> > > > > > >> > > > > >> > >> isn't widely used outside of LinkedIn.)
> > > Thus,
> > > >> > > making
> > > >> > > > > > Samza
> > > >> > > > > > >> > > fully
> > > >> > > > > > >> > > > > >> > dependent
> > > >> > > > > > >> > > > > >> > >> on Kafka acknowledges that the
> > > >> > system-independence
> > > >> > > > was
> > > >> > > > > > >> never
> > > >> > > > > > >> > as
> > > >> > > > > > >> > > > > real
> > > >> > > > > > >> > > > > >> as
> > > >> > > > > > >> > > > > >> > we
> > > >> > > > > > >> > > > > >> > >> perhaps made it out to be. The gains of
> > > code
> > > >> > reuse
> > > >> > > > are
> > > >> > > > > > >> real.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> The idea of decoupling Samza from YARN
> > has
> > > >> also
> > > >> > > > always
> > > >> > > > > > been
> > > >> > > > > > >> > > > > >> appealing to
> > > >> > > > > > >> > > > > >> > >> me, for various reasons already
> mentioned
> > > in
> > > >> > this
> > > >> > > > > > thread.
> > > >> > > > > > >> > > > Although
> > > >> > > > > > >> > > > > >> > making
> > > >> > > > > > >> > > > > >> > >> Samza jobs deployable on anything
> > > >> > > > (YARN/Mesos/AWS/etc)
> > > >> > > > > > >> seems
> > > >> > > > > > >> > > > > >> laudable,
> > > >> > > > > > >> > > > > >> > I am
> > > >> > > > > > >> > > > > >> > >> a little concerned that it will
> restrict
> > us
> > > >> to a
> > > >> > > > > lowest
> > > >> > > > > > >> > common
> > > >> > > > > > >> > > > > >> > denominator.
> > > >> > > > > > >> > > > > >> > >> For example, would host affinity
> > > (SAMZA-617)
> > > >> > still
> > > >> > > > be
> > > >> > > > > > >> > possible?
> > > >> > > > > > >> > > > For
> > > >> > > > > > >> > > > > >> jobs
> > > >> > > > > > >> > > > > >> > >> with large amounts of state, I think
> > > >> SAMZA-617
> > > >> > > would
> > > >> > > > > be
> > > >> > > > > > a
> > > >> > > > > > >> big
> > > >> > > > > > >> > > > boon,
> > > >> > > > > > >> > > > > >> > since
> > > >> > > > > > >> > > > > >> > >> restoring state off the changelog on
> > every
> > > >> > single
> > > >> > > > > > restart
> > > >> > > > > > >> is
> > > >> > > > > > >> > > > > painful,
> > > >> > > > > > >> > > > > >> > due
> > > >> > > > > > >> > > > > >> > >> to long recovery times. It would be a
> > shame
> > > >> if
> > > >> > the
> > > >> > > > > > >> decoupling
> > > >> > > > > > >> > > > from
> > > >> > > > > > >> > > > > >> YARN
> > > >> > > > > > >> > > > > >> > >> made host affinity impossible.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> Jay, a question about the proposed API
> > for
> > > >> > > > > > instantiating a
> > > >> > > > > > >> > job
> > > >> > > > > > >> > > in
> > > >> > > > > > >> > > > > >> code
> > > >> > > > > > >> > > > > >> > >> (rather than a properties file): when
> > > >> > submitting a
> > > >> > > > job
> > > >> > > > > > to a
> > > >> > > > > > >> > > > > cluster,
> > > >> > > > > > >> > > > > >> is
> > > >> > > > > > >> > > > > >> > the
> > > >> > > > > > >> > > > > >> > >> idea that the instantiation code runs
> on
> > a
> > > >> > client
> > > >> > > > > > >> somewhere,
> > > >> > > > > > >> > > > which
> > > >> > > > > > >> > > > > >> then
> > > >> > > > > > >> > > > > >> > >> pokes the necessary endpoints on
> > > >> > > YARN/Mesos/AWS/etc?
> > > >> > > > > Or
> > > >> > > > > > >> does
> > > >> > > > > > >> > > that
> > > >> > > > > > >> > > > > >> code
> > > >> > > > > > >> > > > > >> > run
> > > >> > > > > > >> > > > > >> > >> on each container that is part of the
> job
> > > (in
> > > >> > > which
> > > >> > > > > > case,
> > > >> > > > > > >> how
> > > >> > > > > > >> > > > does
> > > >> > > > > > >> > > > > >> the
> > > >> > > > > > >> > > > > >> > job
> > > >> > > > > > >> > > > > >> > >> submission to the cluster work)?
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> I agree with Garry that it doesn't feel
> > > >> right to
> > > >> > > > make
> > > >> > > > > a
> > > >> > > > > > 1.0
> > > >> > > > > > >> > > > release
> > > >> > > > > > >> > > > > >> > with a
> > > >> > > > > > >> > > > > >> > >> plan for it to be immediately obsolete.
> > So
> > > if
> > > >> > this
> > > >> > > > is
> > > >> > > > > > going
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > > >> happen, I
> > > >> > > > > > >> > > > > >> > >> think it would be more honest to stick
> > with
> > > >> 0.*
> > > >> > > > > version
> > > >> > > > > > >> > numbers
> > > >> > > > > > >> > > > > until
> > > >> > > > > > >> > > > > >> > the
> > > >> > > > > > >> > > > > >> > >> library-ified Samza has been
> implemented,
> > > is
> > > >> > > stable
> > > >> > > > > and
> > > >> > > > > > >> > widely
> > > >> > > > > > >> > > > > used.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> Should the new Samza be a subproject of
> > > >> Kafka?
> > > >> > > There
> > > >> > > > > is
> > > >> > > > > > >> > > precedent
> > > >> > > > > > >> > > > > for
> > > >> > > > > > >> > > > > >> > >> tight coupling between different Apache
> > > >> projects
> > > >> > > > (e.g.
> > > >> > > > > > >> > Curator
> > > >> > > > > > >> > > > and
> > > >> > > > > > >> > > > > >> > >> Zookeeper, or Slider and YARN), so I
> > think
> > > >> > > remaining
> > > >> > > > > > >> separate
> > > >> > > > > > >> > > > would
> > > >> > > > > > >> > > > > >> be
> > > >> > > > > > >> > > > > >> > ok.
> > > >> > > > > > >> > > > > >> > >> Even if Samza is fully dependent on
> > Kafka,
> > > >> there
> > > >> > > is
> > > >> > > > > > enough
> > > >> > > > > > >> > > > > substance
> > > >> > > > > > >> > > > > >> in
> > > >> > > > > > >> > > > > >> > >> Samza that it warrants being a separate
> > > >> project.
> > > >> > > An
> > > >> > > > > > >> argument
> > > >> > > > > > >> > in
> > > >> > > > > > >> > > > > >> favour
> > > >> > > > > > >> > > > > >> > of
> > > >> > > > > > >> > > > > >> > >> merging would be if we think Kafka has
> a
> > > much
> > > >> > > > stronger
> > > >> > > > > > >> "brand
> > > >> > > > > > >> > > > > >> presence"
> > > >> > > > > > >> > > > > >> > >> than Samza; I'm ambivalent on that one.
> > If
> > > >> the
> > > >> > > Kafka
> > > >> > > > > > >> project
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > > >> willing
> > > >> > > > > > >> > > > > >> > to
> > > >> > > > > > >> > > > > >> > >> endorse Samza as the "official" way of
> > > doing
> > > >> > > > stateful
> > > >> > > > > > >> stream
> > > >> > > > > > >> > > > > >> > >> transformations, that would probably
> have
> > > >> much
> > > >> > the
> > > >> > > > > same
> > > >> > > > > > >> > effect
> > > >> > > > > > >> > > as
> > > >> > > > > > >> > > > > >> > >> re-branding Samza as "Kafka Stream
> > > >> Processors"
> > > >> > or
> > > >> > > > > > suchlike.
> > > >> > > > > > >> > > Close
> > > >> > > > > > >> > > > > >> > >> collaboration between the two projects
> > will
> > > >> be
> > > >> > > > needed
> > > >> > > > > in
> > > >> > > > > > >> any
> > > >> > > > > > >> > > > case.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> From a project management perspective,
> I
> > > >> guess
> > > >> > the
> > > >> > > > > "new
> > > >> > > > > > >> > Samza"
> > > >> > > > > > >> > > > > would
> > > >> > > > > > >> > > > > >> > have
> > > >> > > > > > >> > > > > >> > >> to be developed on a branch alongside
> > > ongoing
> > > >> > > > > > maintenance
> > > >> > > > > > >> of
> > > >> > > > > > >> > > the
> > > >> > > > > > >> > > > > >> current
> > > >> > > > > > >> > > > > >> > >> line of development? I think it would
> be
> > > >> > important
> > > >> > > > to
> > > >> > > > > > >> > continue
> > > >> > > > > > >> > > > > >> > supporting
> > > >> > > > > > >> > > > > >> > >> existing users, and provide a graceful
> > > >> migration
> > > >> > > > path
> > > >> > > > > to
> > > >> > > > > > >> the
> > > >> > > > > > >> > > new
> > > >> > > > > > >> > > > > >> > version.
> > > >> > > > > > >> > > > > >> > >> Leaving the current versions
> unsupported
> > > and
> > > >> > > forcing
> > > >> > > > > > people
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > > >> rewrite
> > > >> > > > > > >> > > > > >> > >> their jobs would send a bad signal.
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> Best,
> > > >> > > > > > >> > > > > >> > >> Martin
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >> On 2 Jul 2015, at 16:59, Jay Kreps <
> > > >> > > > j...@confluent.io>
> > > >> > > > > > >> wrote:
> > > >> > > > > > >> > > > > >> > >>
> > > >> > > > > > >> > > > > >> > >>> Hey Garry,
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>> Yeah that's super frustrating. I'd be
> > > happy
> > > >> to
> > > >> > > chat
> > > >> > > > > > more
> > > >> > > > > > >> > about
> > > >> > > > > > >> > > > > this
> > > >> > > > > > >> > > > > >> if
> > > >> > > > > > >> > > > > >> > >>> you'd be interested. I think Chris
> and I
> > > >> > started
> > > >> > > > with
> > > >> > > > > > the
> > > >> > > > > > >> > idea
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> "what
> > > >> > > > > > >> > > > > >> > >>> would it take to make Samza a kick-ass
> > > >> > ingestion
> > > >> > > > > tool"
> > > >> > > > > > but
> > > >> > > > > > >> > > > > >> ultimately
> > > >> > > > > > >> > > > > >> > we
> > > >> > > > > > >> > > > > >> > >>> kind of came around to the idea that
> > > >> ingestion
> > > >> > > and
> > > >> > > > > > >> > > > transformation
> > > >> > > > > > >> > > > > >> had
> > > >> > > > > > >> > > > > >> > >>> pretty different needs and coupling
> the
> > > two
> > > >> > made
> > > >> > > > > things
> > > >> > > > > > >> > hard.
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>> For what it's worth I think copycat
> > > (KIP-26)
> > > >> > > > actually
> > > >> > > > > > will
> > > >> > > > > > >> > do
> > > >> > > > > > >> > > > what
> > > >> > > > > > >> > > > > >> you
> > > >> > > > > > >> > > > > >> > >> are
> > > >> > > > > > >> > > > > >> > >>> looking for.
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>> With regard to your point about
> slider,
> > I
> > > >> don't
> > > >> > > > > > >> necessarily
> > > >> > > > > > >> > > > > >> disagree.
> > > >> > > > > > >> > > > > >> > >> But I
> > > >> > > > > > >> > > > > >> > >>> think getting good YARN support is
> quite
> > > >> doable
> > > >> > > > and I
> > > >> > > > > > >> think
> > > >> > > > > > >> > we
> > > >> > > > > > >> > > > can
> > > >> > > > > > >> > > > > >> make
> > > >> > > > > > >> > > > > >> > >>> that work well. I think the issue this
> > > >> proposal
> > > >> > > > > solves
> > > >> > > > > > is
> > > >> > > > > > >> > that
> > > >> > > > > > >> > > > > >> > >> technically
> > > >> > > > > > >> > > > > >> > >>> it is pretty hard to support multiple
> > > >> cluster
> > > >> > > > > > management
> > > >> > > > > > >> > > systems
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> > way
> > > >> > > > > > >> > > > > >> > >>> things are now, you need to write an
> > "app
> > > >> > master"
> > > >> > > > or
> > > >> > > > > > >> > > "framework"
> > > >> > > > > > >> > > > > for
> > > >> > > > > > >> > > > > >> > each
> > > >> > > > > > >> > > > > >> > >>> and they are all a little different so
> > > >> testing
> > > >> > is
> > > >> > > > > > really
> > > >> > > > > > >> > hard.
> > > >> > > > > > >> > > > In
> > > >> > > > > > >> > > > > >> the
> > > >> > > > > > >> > > > > >> > >>> absence of this we have been stuck
> with
> > > just
> > > >> > YARN
> > > >> > > > > which
> > > >> > > > > > >> has
> > > >> > > > > > >> > > > > >> fantastic
> > > >> > > > > > >> > > > > >> > >>> penetration in the Hadoopy part of the
> > > org,
> > > >> but
> > > >> > > > zero
> > > >> > > > > > >> > > penetration
> > > >> > > > > > >> > > > > >> > >> elsewhere.
> > > >> > > > > > >> > > > > >> > >>> Given the huge amount of work being
> put
> > in
> > > >> to
> > > >> > > > slider,
> > > >> > > > > > >> > > marathon,
> > > >> > > > > > >> > > > > aws
> > > >> > > > > > >> > > > > >> > >>> tooling, not to mention the umpteen
> > > related
> > > >> > > > packaging
> > > >> > > > > > >> > > > technologies
> > > >> > > > > > >> > > > > >> > people
> > > >> > > > > > >> > > > > >> > >>> want to use (Docker, Kubernetes,
> various
> > > >> > > > > cloud-specific
> > > >> > > > > > >> > deploy
> > > >> > > > > > >> > > > > >> tools,
> > > >> > > > > > >> > > > > >> > >> etc)
> > > >> > > > > > >> > > > > >> > >>> I really think it is important to get
> > this
> > > >> > right.
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>> -Jay
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>> On Thu, Jul 2, 2015 at 4:17 AM, Garry
> > > >> > Turkington
> > > >> > > <
> > > >> > > > > > >> > > > > >> > >>> g.turking...@improvedigital.com>
> wrote:
> > > >> > > > > > >> > > > > >> > >>>
> > > >> > > > > > >> > > > > >> > >>>> Hi all,
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> I think the question below re does
> > Samza
> > > >> > become
> > > >> > > a
> > > >> > > > > > >> > sub-project
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> Kafka
> > > >> > > > > > >> > > > > >> > >>>> highlights the broader point around
> > > >> migration.
> > > >> > > > Chris
> > > >> > > > > > >> > mentions
> > > >> > > > > > >> > > > > >> Samza's
> > > >> > > > > > >> > > > > >> > >>>> maturity is heading towards a v1
> > release
> > > >> but
> > > >> > I'm
> > > >> > > > not
> > > >> > > > > > sure
> > > >> > > > > > >> > it
> > > >> > > > > > >> > > > > feels
> > > >> > > > > > >> > > > > >> > >> right to
> > > >> > > > > > >> > > > > >> > >>>> launch a v1 then immediately plan to
> > > >> deprecate
> > > >> > > > most
> > > >> > > > > of
> > > >> > > > > > >> it.
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> From a selfish perspective I have
> some
> > > guys
> > > >> > who
> > > >> > > > have
> > > >> > > > > > >> > started
> > > >> > > > > > >> > > > > >> working
> > > >> > > > > > >> > > > > >> > >> with
> > > >> > > > > > >> > > > > >> > >>>> Samza and building some new
> > > >> > consumers/producers
> > > >> > > > was
> > > >> > > > > > next
> > > >> > > > > > >> > up.
> > > >> > > > > > >> > > > > Sounds
> > > >> > > > > > >> > > > > >> > like
> > > >> > > > > > >> > > > > >> > >>>> that is absolutely not the direction
> to
> > > >> go. I
> > > >> > > need
> > > >> > > > > to
> > > >> > > > > > >> look
> > > >> > > > > > >> > > into
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> > KIP
> > > >> > > > > > >> > > > > >> > >> in
> > > >> > > > > > >> > > > > >> > >>>> more detail but for me the
> > attractiveness
> > > >> of
> > > >> > > > adding
> > > >> > > > > > new
> > > >> > > > > > >> > Samza
> > > >> > > > > > >> > > > > >> > >>>> consumer/producers -- even if yes all
> > > they
> > > >> > were
> > > >> > > > > doing
> > > >> > > > > > was
> > > >> > > > > > >> > > > really
> > > >> > > > > > >> > > > > >> > getting
> > > >> > > > > > >> > > > > >> > >>>> data into and out of Kafka --  was to
> > > avoid
> > > >> > > > having
> > > >> > > > > to
> > > >> > > > > > >> > worry
> > > >> > > > > > >> > > > > about
> > > >> > > > > > >> > > > > >> the
> > > >> > > > > > >> > > > > >> > >>>> lifecycle management of external
> > clients.
> > > >> If
> > > >> > > there
> > > >> > > > > is
> > > >> > > > > > a
> > > >> > > > > > >> > > generic
> > > >> > > > > > >> > > > > >> Kafka
> > > >> > > > > > >> > > > > >> > >>>> ingress/egress layer that I can plug
> a
> > > new
> > > >> > > > connector
> > > >> > > > > > into
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > > have
> > > >> > > > > > >> > > > > >> a
> > > >> > > > > > >> > > > > >> > >> lot of
> > > >> > > > > > >> > > > > >> > >>>> the heavy lifting re scale and
> > > reliability
> > > >> > done
> > > >> > > > for
> > > >> > > > > me
> > > >> > > > > > >> then
> > > >> > > > > > >> > > it
> > > >> > > > > > >> > > > > >> gives
> > > >> > > > > > >> > > > > >> > me
> > > >> > > > > > >> > > > > >> > >> all
> > > >> > > > > > >> > > > > >> > >>>> the pushing new consumers/producers
> > > would.
> > > >> If
> > > >> > > not
> > > >> > > > > > then it
> > > >> > > > > > >> > > > > >> complicates
> > > >> > > > > > >> > > > > >> > my
> > > >> > > > > > >> > > > > >> > >>>> operational deployments.
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> Which is similar to my other question
> > > with
> > > >> the
> > > >> > > > > > proposal
> > > >> > > > > > >> --
> > > >> > > > > > >> > if
> > > >> > > > > > >> > > > we
> > > >> > > > > > >> > > > > >> > build a
> > > >> > > > > > >> > > > > >> > >>>> fully available/stand-alone Samza
> plus
> > > the
> > > >> > > > requisite
> > > >> > > > > > >> shims
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > > >> > integrate
> > > >> > > > > > >> > > > > >> > >>>> with Slider etc I suspect the former
> > may
> > > >> be a
> > > >> > > lot
> > > >> > > > > more
> > > >> > > > > > >> work
> > > >> > > > > > >> > > > than
> > > >> > > > > > >> > > > > we
> > > >> > > > > > >> > > > > >> > >> think.
> > > >> > > > > > >> > > > > >> > >>>> We may make it much easier for a
> > newcomer
> > > >> to
> > > >> > get
> > > >> > > > > > >> something
> > > >> > > > > > >> > > > > running
> > > >> > > > > > >> > > > > >> but
> > > >> > > > > > >> > > > > >> > >>>> having them step up and get a
> reliable
> > > >> > > production
> > > >> > > > > > >> > deployment
> > > >> > > > > > >> > > > may
> > > >> > > > > > >> > > > > >> still
> > > >> > > > > > >> > > > > >> > >>>> dominate mailing list  traffic, if
> for
> > > >> > different
> > > >> > > > > > reasons
> > > >> > > > > > >> > than
> > > >> > > > > > >> > > > > >> today.
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> Don't get me wrong -- I'm comfortable
> > > with
> > > >> > > making
> > > >> > > > > the
> > > >> > > > > > >> Samza
> > > >> > > > > > >> > > > > >> dependency
> > > >> > > > > > >> > > > > >> > >> on
> > > >> > > > > > >> > > > > >> > >>>> Kafka much more explicit and I
> > absolutely
> > > >> see
> > > >> > > the
> > > >> > > > > > >> benefits
> > > >> > > > > > >> > > in
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> > >>>> reduction of duplication and clashing
> > > >> > > > > > >> > > > terminologies/abstractions
> > > >> > > > > > >> > > > > >> that
> > > >> > > > > > >> > > > > >> > >>>> Chris/Jay describe. Samza as a
> library
> > > >> would
> > > >> > > > likely
> > > >> > > > > > be a
> > > >> > > > > > >> > very
> > > >> > > > > > >> > > > > nice
> > > >> > > > > > >> > > > > >> > tool
> > > >> > > > > > >> > > > > >> > >> to
> > > >> > > > > > >> > > > > >> > >>>> add to the Kafka ecosystem. I just
> have
> > > the
> > > >> > > > concerns
> > > >> > > > > > >> above
> > > >> > > > > > >> > re
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > >> > >>>> operational side.
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> Garry
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> -----Original Message-----
> > > >> > > > > > >> > > > > >> > >>>> From: Gianmarco De Francisci Morales
> > > >> [mailto:
> > > >> > > > > > >> > g...@apache.org
> > > >> > > > > > >> > > ]
> > > >> > > > > > >> > > > > >> > >>>> Sent: 02 July 2015 12:56
> > > >> > > > > > >> > > > > >> > >>>> To: dev@samza.apache.org
> > > >> > > > > > >> > > > > >> > >>>> Subject: Re: Thoughts and
> obesrvations
> > on
> > > >> > Samza
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> Very interesting thoughts.
> > > >> > > > > > >> > > > > >> > >>>> From outside, I have always perceived
> > > Samza
> > > >> > as a
> > > >> > > > > > >> computing
> > > >> > > > > > >> > > > layer
> > > >> > > > > > >> > > > > >> over
> > > >> > > > > > >> > > > > >> > >>>> Kafka.
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> The question, maybe a bit
> provocative,
> > is
> > > >> > > "should
> > > >> > > > > > Samza
> > > >> > > > > > >> be
> > > >> > > > > > >> > a
> > > >> > > > > > >> > > > > >> > sub-project
> > > >> > > > > > >> > > > > >> > >>>> of Kafka then?"
> > > >> > > > > > >> > > > > >> > >>>> Or does it make sense to keep it as a
> > > >> separate
> > > >> > > > > project
> > > >> > > > > > >> > with a
> > > >> > > > > > >> > > > > >> separate
> > > >> > > > > > >> > > > > >> > >>>> governance?
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> Cheers,
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> --
> > > >> > > > > > >> > > > > >> > >>>> Gianmarco
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>> On 2 July 2015 at 08:59, Yan Fang <
> > > >> > > > > > yanfang...@gmail.com>
> > > >> > > > > > >> > > > wrote:
> > > >> > > > > > >> > > > > >> > >>>>
> > > >> > > > > > >> > > > > >> > >>>>> Overall, I agree to couple with
> Kafka
> > > more
> > > >> > > > tightly.
> > > >> > > > > > >> > Because
> > > >> > > > > > >> > > > > Samza
> > > >> > > > > > >> > > > > >> de
> > > >> > > > > > >> > > > > >> > >>>>> facto is based on Kafka, and it
> should
> > > >> > leverage
> > > >> > > > > what
> > > >> > > > > > >> Kafka
> > > >> > > > > > >> > > > has.
> > > >> > > > > > >> > > > > At
> > > >> > > > > > >> > > > > >> > the
> > > >> > > > > > >> > > > > >> > >>>>> same time, Kafka does not need to
> > > reinvent
> > > >> > what
> > > >> > > > > Samza
> > > >> > > > > > >> > > already
> > > >> > > > > > >> > > > > >> has. I
> > > >> > > > > > >> > > > > >> > >>>>> also like the idea of separating the
> > > >> > ingestion
> > > >> > > > and
> > > >> > > > > > >> > > > > transformation.
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> But it is a little difficult for me
> to
> > > >> image
> > > >> > > how
> > > >> > > > > the
> > > >> > > > > > >> Samza
> > > >> > > > > > >> > > > will
> > > >> > > > > > >> > > > > >> look
> > > >> > > > > > >> > > > > >> > >>>> like.
> > > >> > > > > > >> > > > > >> > >>>>> And I feel Chris and Jay have a
> little
> > > >> > > difference
> > > >> > > > > in
> > > >> > > > > > >> terms
> > > >> > > > > > >> > > of
> > > >> > > > > > >> > > > > how
> > > >> > > > > > >> > > > > >> > >>>>> Samza should look like.
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> *** Will it look like what Jay's
> code
> > > >> shows
> > > >> > (A
> > > >> > > > > > client of
> > > >> > > > > > >> > > > Kakfa)
> > > >> > > > > > >> > > > > ?
> > > >> > > > > > >> > > > > >> And
> > > >> > > > > > >> > > > > >> > >>>>> user's application code calls this
> > > client?
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> 1. If we make Samza be a library of
> > > Kafka
> > > >> > (like
> > > >> > > > > what
> > > >> > > > > > the
> > > >> > > > > > >> > > code
> > > >> > > > > > >> > > > > >> shows),
> > > >> > > > > > >> > > > > >> > >>>>> how do we implement auto-balance and
> > > >> > > > > fault-tolerance?
> > > >> > > > > > >> Are
> > > >> > > > > > >> > > they
> > > >> > > > > > >> > > > > >> taken
> > > >> > > > > > >> > > > > >> > >>>>> care by the Kafka broker or other
> > > >> mechanism,
> > > >> > > such
> > > >> > > > > as
> > > >> > > > > > >> > "Samza
> > > >> > > > > > >> > > > > >> worker"
> > > >> > > > > > >> > > > > >> > >>>>> (just make up the name) ?
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> 2. What about other features, such
> as
> > > >> > > > auto-scaling,
> > > >> > > > > > >> shared
> > > >> > > > > > >> > > > > state,
> > > >> > > > > > >> > > > > >> > >>>>> monitoring?
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> *** If we have Samza standalone, (is
> > > this
> > > >> > what
> > > >> > > > > Chris
> > > >> > > > > > >> > > > suggests?)
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> 1. we still need to ingest data from
> > > Kakfa
> > > >> > and
> > > >> > > > > > produce
> > > >> > > > > > >> to
> > > >> > > > > > >> > > it.
> > > >> > > > > > >> > > > > >> Then it
> > > >> > > > > > >> > > > > >> > >>>>> becomes the same as what Samza looks
> > > like
> > > >> > now,
> > > >> > > > > > except it
> > > >> > > > > > >> > > does
> > > >> > > > > > >> > > > > not
> > > >> > > > > > >> > > > > >> > rely
> > > >> > > > > > >> > > > > >> > >>>>> on Yarn anymore.
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> 2. if it is standalone, how can it
> > > >> leverage
> > > >> > > > Kafka's
> > > >> > > > > > >> > metrics,
> > > >> > > > > > >> > > > > logs,
> > > >> > > > > > >> > > > > >> > >>>>> etc? Use Kafka code as the
> dependency?
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> Thanks,
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> Fang, Yan
> > > >> > > > > > >> > > > > >> > >>>>> yanfang...@gmail.com
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>> On Wed, Jul 1, 2015 at 5:46 PM,
> > Guozhang
> > > >> > Wang <
> > > >> > > > > > >> > > > > wangg...@gmail.com
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > > >> > >>>> wrote:
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> Read through the code example and
> it
> > > >> looks
> > > >> > > good
> > > >> > > > to
> > > >> > > > > > me.
> > > >> > > > > > >> A
> > > >> > > > > > >> > > few
> > > >> > > > > > >> > > > > >> > >>>>>> thoughts regarding deployment:
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> Today Samza deploys as executable
> > > >> runnable
> > > >> > > like:
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> deploy/samza/bin/run-job.sh
> > > >> > > --config-factory=...
> > > >> > > > > > >> > > > > >> > >>>> --config-path=file://...
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> And this proposal advocate for
> > > deploying
> > > >> > Samza
> > > >> > > > > more
> > > >> > > > > > as
> > > >> > > > > > >> > > > embedded
> > > >> > > > > > >> > > > > >> > >>>>>> libraries in user application code
> > > >> (ignoring
> > > >> > > the
> > > >> > > > > > >> > > terminology
> > > >> > > > > > >> > > > > >> since
> > > >> > > > > > >> > > > > >> > >>>>>> it is not the
> > > >> > > > > > >> > > > > >> > >>>>> same
> > > >> > > > > > >> > > > > >> > >>>>>> as the prototype code):
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> StreamTask task = new
> > > >> MyStreamTask(configs);
> > > >> > > > > Thread
> > > >> > > > > > >> > thread
> > > >> > > > > > >> > > =
> > > >> > > > > > >> > > > > new
> > > >> > > > > > >> > > > > >> > >>>>>> Thread(task); thread.start();
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> I think both of these deployment
> > modes
> > > >> are
> > > >> > > > > important
> > > >> > > > > > >> for
> > > >> > > > > > >> > > > > >> different
> > > >> > > > > > >> > > > > >> > >>>>>> types
> > > >> > > > > > >> > > > > >> > >>>>> of
> > > >> > > > > > >> > > > > >> > >>>>>> users. That said, I think making
> > Samza
> > > >> > purely
> > > >> > > > > > >> standalone
> > > >> > > > > > >> > is
> > > >> > > > > > >> > > > > still
> > > >> > > > > > >> > > > > >> > >>>>>> sufficient for either runnable or
> > > library
> > > >> > > modes.
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> Guozhang
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>> On Tue, Jun 30, 2015 at 11:33 PM,
> Jay
> > > >> Kreps
> > > >> > <
> > > >> > > > > > >> > > > j...@confluent.io>
> > > >> > > > > > >> > > > > >> > wrote:
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>> Looks like gmail mangled the code
> > > >> example,
> > > >> > it
> > > >> > > > was
> > > >> > > > > > >> > supposed
> > > >> > > > > > >> > > > to
> > > >> > > > > > >> > > > > >> look
> > > >> > > > > > >> > > > > >> > >>>>>>> like
> > > >> > > > > > >> > > > > >> > >>>>>>> this:
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>> Properties props = new
> Properties();
> > > >> > > > > > >> > > > > >> > >>>>>>> props.put("bootstrap.servers",
> > > >> > > > "localhost:4242");
> > > >> > > > > > >> > > > > >> StreamingConfig
> > > >> > > > > > >> > > > > >> > >>>>>>> config = new
> StreamingConfig(props);
> > > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > > >> > > > "test-topic-2");
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > config.processor(ExampleStreamProcessor.class);
> > > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> > > >> > StringSerializer(),
> > > >> > > > new
> > > >> > > > > > >> > > > > >> > >>>>>>> StringDeserializer());
> > KafkaStreaming
> > > >> > > > container =
> > > >> > > > > > new
> > > >> > > > > > >> > > > > >> > >>>>>>> KafkaStreaming(config);
> > > container.run();
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>> -Jay
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>> On Tue, Jun 30, 2015 at 11:32 PM,
> > Jay
> > > >> > Kreps <
> > > >> > > > > > >> > > > j...@confluent.io
> > > >> > > > > > >> > > > > >
> > > >> > > > > > >> > > > > >> > >>>> wrote:
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> Hey guys,
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> This came out of some
> conversations
> > > >> Chris
> > > >> > > and
> > > >> > > > I
> > > >> > > > > > were
> > > >> > > > > > >> > > having
> > > >> > > > > > >> > > > > >> > >>>>>>>> around
> > > >> > > > > > >> > > > > >> > >>>>>>> whether
> > > >> > > > > > >> > > > > >> > >>>>>>>> it would make sense to use Samza
> > as a
> > > >> kind
> > > >> > > of
> > > >> > > > > data
> > > >> > > > > > >> > > > ingestion
> > > >> > > > > > >> > > > > >> > >>>>> framework
> > > >> > > > > > >> > > > > >> > >>>>>>> for
> > > >> > > > > > >> > > > > >> > >>>>>>>> Kafka (which ultimately lead to
> > > KIP-26
> > > >> > > > > "copycat").
> > > >> > > > > > >> This
> > > >> > > > > > >> > > > kind
> > > >> > > > > > >> > > > > of
> > > >> > > > > > >> > > > > >> > >>>>>> combined
> > > >> > > > > > >> > > > > >> > >>>>>>>> with complaints around config and
> > > YARN
> > > >> and
> > > >> > > the
> > > >> > > > > > >> > discussion
> > > >> > > > > > >> > > > > >> around
> > > >> > > > > > >> > > > > >> > >>>>>>>> how
> > > >> > > > > > >> > > > > >> > >>>>> to
> > > >> > > > > > >> > > > > >> > >>>>>>>> best do a standalone mode.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> So the thought experiment was,
> > given
> > > >> that
> > > >> > > > Samza
> > > >> > > > > > was
> > > >> > > > > > >> > > > basically
> > > >> > > > > > >> > > > > >> > >>>>>>>> already totally Kafka specific,
> > what
> > > if
> > > >> > you
> > > >> > > > just
> > > >> > > > > > >> > embraced
> > > >> > > > > > >> > > > > that
> > > >> > > > > > >> > > > > >> > >>>>>>>> and turned it
> > > >> > > > > > >> > > > > >> > >>>>>> into
> > > >> > > > > > >> > > > > >> > >>>>>>>> something less like a heavyweight
> > > >> > framework
> > > >> > > > and
> > > >> > > > > > more
> > > >> > > > > > >> > > like a
> > > >> > > > > > >> > > > > >> > >>>>>>>> third
> > > >> > > > > > >> > > > > >> > >>>>> Kafka
> > > >> > > > > > >> > > > > >> > >>>>>>>> client--a kind of "producing
> > > consumer"
> > > >> > with
> > > >> > > > > state
> > > >> > > > > > >> > > > management
> > > >> > > > > > >> > > > > >> > >>>>>> facilities.
> > > >> > > > > > >> > > > > >> > >>>>>>>> Basically a library. Instead of a
> > > >> complex
> > > >> > > > stream
> > > >> > > > > > >> > > processing
> > > >> > > > > > >> > > > > >> > >>>>>>>> framework
> > > >> > > > > > >> > > > > >> > >>>>>>> this
> > > >> > > > > > >> > > > > >> > >>>>>>>> would actually be a very simple
> > > thing,
> > > >> not
> > > >> > > > much
> > > >> > > > > > more
> > > >> > > > > > >> > > > > >> complicated
> > > >> > > > > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > > >> > > > > >> > >>>>> use
> > > >> > > > > > >> > > > > >> > >>>>>>> or
> > > >> > > > > > >> > > > > >> > >>>>>>>> operate than a Kafka consumer. As
> > > Chris
> > > >> > said
> > > >> > > > we
> > > >> > > > > > >> thought
> > > >> > > > > > >> > > > about
> > > >> > > > > > >> > > > > >> it
> > > >> > > > > > >> > > > > >> > >>>>>>>> a
> > > >> > > > > > >> > > > > >> > >>>>> lot
> > > >> > > > > > >> > > > > >> > >>>>>> of
> > > >> > > > > > >> > > > > >> > >>>>>>>> what Samza (and the other stream
> > > >> > processing
> > > >> > > > > > systems
> > > >> > > > > > >> > were
> > > >> > > > > > >> > > > > doing)
> > > >> > > > > > >> > > > > >> > >>>>> seemed
> > > >> > > > > > >> > > > > >> > >>>>>>> like
> > > >> > > > > > >> > > > > >> > >>>>>>>> kind of a hangover from
> MapReduce.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> Of course you need to
> ingest/output
> > > >> data
> > > >> > to
> > > >> > > > and
> > > >> > > > > > from
> > > >> > > > > > >> > the
> > > >> > > > > > >> > > > > stream
> > > >> > > > > > >> > > > > >> > >>>>>>>> processing. But when we actually
> > > looked
> > > >> > into
> > > >> > > > how
> > > >> > > > > > that
> > > >> > > > > > >> > > would
> > > >> > > > > > >> > > > > >> > >>>>>>>> work,
> > > >> > > > > > >> > > > > >> > >>>>> Samza
> > > >> > > > > > >> > > > > >> > >>>>>>>> isn't really an ideal data
> > ingestion
> > > >> > > framework
> > > >> > > > > > for a
> > > >> > > > > > >> > > bunch
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> > >>>>> reasons.
> > > >> > > > > > >> > > > > >> > >>>>>> To
> > > >> > > > > > >> > > > > >> > >>>>>>>> really do that right you need a
> > > pretty
> > > >> > > > different
> > > >> > > > > > >> > internal
> > > >> > > > > > >> > > > > data
> > > >> > > > > > >> > > > > >> > >>>>>>>> model
> > > >> > > > > > >> > > > > >> > >>>>>> and
> > > >> > > > > > >> > > > > >> > >>>>>>>> set of apis. So what if you split
> > > them
> > > >> and
> > > >> > > had
> > > >> > > > > an
> > > >> > > > > > api
> > > >> > > > > > >> > for
> > > >> > > > > > >> > > > > Kafka
> > > >> > > > > > >> > > > > >> > >>>>>>>> ingress/egress (copycat AKA
> KIP-26)
> > > >> and a
> > > >> > > > > separate
> > > >> > > > > > >> api
> > > >> > > > > > >> > > for
> > > >> > > > > > >> > > > > >> Kafka
> > > >> > > > > > >> > > > > >> > >>>>>>>> transformation (Samza).
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> This would also allow really
> > > embracing
> > > >> the
> > > >> > > > same
> > > >> > > > > > >> > > terminology
> > > >> > > > > > >> > > > > and
> > > >> > > > > > >> > > > > >> > >>>>>>>> conventions. One complaint about
> > the
> > > >> > current
> > > >> > > > > > state is
> > > >> > > > > > >> > > that
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> > >>>>>>>> two
> > > >> > > > > > >> > > > > >> > >>>>>>> systems
> > > >> > > > > > >> > > > > >> > >>>>>>>> kind of feel bolted on.
> Terminology
> > > >> like
> > > >> > > > > "stream"
> > > >> > > > > > vs
> > > >> > > > > > >> > > > "topic"
> > > >> > > > > > >> > > > > >> and
> > > >> > > > > > >> > > > > >> > >>>>>>> different
> > > >> > > > > > >> > > > > >> > >>>>>>>> config and monitoring systems
> means
> > > you
> > > >> > kind
> > > >> > > > of
> > > >> > > > > > have
> > > >> > > > > > >> to
> > > >> > > > > > >> > > > learn
> > > >> > > > > > >> > > > > >> > >>>>>>>> Kafka's
> > > >> > > > > > >> > > > > >> > >>>>>>> way,
> > > >> > > > > > >> > > > > >> > >>>>>>>> then learn Samza's slightly
> > different
> > > >> way,
> > > >> > > > then
> > > >> > > > > > kind
> > > >> > > > > > >> of
> > > >> > > > > > >> > > > > >> > >>>>>>>> understand
> > > >> > > > > > >> > > > > >> > >>>>> how
> > > >> > > > > > >> > > > > >> > >>>>>>> they
> > > >> > > > > > >> > > > > >> > >>>>>>>> map to each other, which having
> > > walked
> > > >> a
> > > >> > few
> > > >> > > > > > people
> > > >> > > > > > >> > > through
> > > >> > > > > > >> > > > > >> this
> > > >> > > > > > >> > > > > >> > >>>>>>>> is surprisingly tricky for folks
> to
> > > >> get.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> Since I have been spending a lot
> of
> > > >> time
> > > >> > on
> > > >> > > > > > >> airplanes I
> > > >> > > > > > >> > > > > hacked
> > > >> > > > > > >> > > > > >> > >>>>>>>> up an ernest but still somewhat
> > > >> incomplete
> > > >> > > > > > prototype
> > > >> > > > > > >> of
> > > >> > > > > > >> > > > what
> > > >> > > > > > >> > > > > >> > >>>>>>>> this would
> > > >> > > > > > >> > > > > >> > >>>>> look
> > > >> > > > > > >> > > > > >> > >>>>>>>> like. This is just
> unceremoniously
> > > >> dumped
> > > >> > > into
> > > >> > > > > > Kafka
> > > >> > > > > > >> as
> > > >> > > > > > >> > > it
> > > >> > > > > > >> > > > > >> > >>>>>>>> required a
> > > >> > > > > > >> > > > > >> > >>>>>> few
> > > >> > > > > > >> > > > > >> > >>>>>>>> changes to the new consumer. Here
> > is
> > > >> the
> > > >> > > code:
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> >
> > > >> > > > > >
> > > >> > >
> > > >>
> > https://github.com/jkreps/kafka/tree/streams/clients/src/main/java/org
> > > >> > > > > > >> > > > > >> > >>>>> /apache/kafka/clients/streaming
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> For the purpose of the prototype
> I
> > > just
> > > >> > > > > liberally
> > > >> > > > > > >> > renamed
> > > >> > > > > > >> > > > > >> > >>>>>>>> everything
> > > >> > > > > > >> > > > > >> > >>>>> to
> > > >> > > > > > >> > > > > >> > >>>>>>>> try to align it with Kafka with
> no
> > > >> regard
> > > >> > > for
> > > >> > > > > > >> > > > compatibility.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> To use this would be something
> like
> > > >> this:
> > > >> > > > > > >> > > > > >> > >>>>>>>> Properties props = new
> > Properties();
> > > >> > > > > > >> > > > > >> > >>>>>>>> props.put("bootstrap.servers",
> > > >> > > > > "localhost:4242");
> > > >> > > > > > >> > > > > >> > >>>>>>>> StreamingConfig config = new
> > > >> > > > > > >> > > > > >> > >>>>> StreamingConfig(props);
> > > >> > > > > > >> > > > > >> > >>>>>>> config.subscribe("test-topic-1",
> > > >> > > > > > >> > > > > >> > >>>>>>>> "test-topic-2");
> > > >> > > > > > >> > > > > >>
> > config.processor(ExampleStreamProcessor.class);
> > > >> > > > > > >> > > > > >> > >>>>>>> config.serialization(new
> > > >> > > > > > >> > > > > >> > >>>>>>>> StringSerializer(), new
> > > >> > > StringDeserializer());
> > > >> > > > > > >> > > > KafkaStreaming
> > > >> > > > > > >> > > > > >> > >>>>>> container =
> > > >> > > > > > >> > > > > >> > >>>>>>>> new KafkaStreaming(config);
> > > >> > container.run();
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> KafkaStreaming is basically the
> > > >> > > > SamzaContainer;
> > > >> > > > > > >> > > > > StreamProcessor
> > > >> > > > > > >> > > > > >> > >>>>>>>> is basically StreamTask.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> So rather than putting all the
> > class
> > > >> names
> > > >> > > in
> > > >> > > > a
> > > >> > > > > > file
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > then
> > > >> > > > > > >> > > > > >> > >>>>>>>> having
> > > >> > > > > > >> > > > > >> > >>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>>>> job assembled by reflection, you
> > just
> > > >> > > > > instantiate
> > > >> > > > > > the
> > > >> > > > > > >> > > > > container
> > > >> > > > > > >> > > > > >> > >>>>>>>> programmatically. Work is
> balanced
> > > over
> > > >> > > > however
> > > >> > > > > > many
> > > >> > > > > > >> > > > > instances
> > > >> > > > > > >> > > > > >> > >>>>>>>> of
> > > >> > > > > > >> > > > > >> > >>>>> this
> > > >> > > > > > >> > > > > >> > >>>>>>> are
> > > >> > > > > > >> > > > > >> > >>>>>>>> alive at any time (i.e. if an
> > > instance
> > > >> > dies,
> > > >> > > > new
> > > >> > > > > > >> tasks
> > > >> > > > > > >> > > are
> > > >> > > > > > >> > > > > >> added
> > > >> > > > > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > > >> > > > > >> > >>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>>>> existing containers without
> > shutting
> > > >> them
> > > >> > > > down).
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> We would provide some glue for
> > > running
> > > >> > this
> > > >> > > > > stuff
> > > >> > > > > > in
> > > >> > > > > > >> > YARN
> > > >> > > > > > >> > > > via
> > > >> > > > > > >> > > > > >> > >>>>>>>> Slider, Mesos via Marathon, and
> AWS
> > > >> using
> > > >> > > some
> > > >> > > > > of
> > > >> > > > > > >> their
> > > >> > > > > > >> > > > tools
> > > >> > > > > > >> > > > > >> > >>>>>>>> but from the
> > > >> > > > > > >> > > > > >> > >>>>>> point
> > > >> > > > > > >> > > > > >> > >>>>>>> of
> > > >> > > > > > >> > > > > >> > >>>>>>>> view of these frameworks these
> > stream
> > > >> > > > processing
> > > >> > > > > > jobs
> > > >> > > > > > >> > are
> > > >> > > > > > >> > > > > just
> > > >> > > > > > >> > > > > >> > >>>>>> stateless
> > > >> > > > > > >> > > > > >> > >>>>>>>> services that can come and go and
> > > >> expand
> > > >> > and
> > > >> > > > > > contract
> > > >> > > > > > >> > at
> > > >> > > > > > >> > > > > will.
> > > >> > > > > > >> > > > > >> > >>>>>>>> There
> > > >> > > > > > >> > > > > >> > >>>>> is
> > > >> > > > > > >> > > > > >> > >>>>>>> no
> > > >> > > > > > >> > > > > >> > >>>>>>>> more custom scheduler.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> Here are some relevant details:
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>  1. It is only ~1300 lines of
> code,
> > > it
> > > >> > would
> > > >> > > > get
> > > >> > > > > > >> larger
> > > >> > > > > > >> > > if
> > > >> > > > > > >> > > > we
> > > >> > > > > > >> > > > > >> > >>>>>>>>  productionized but not vastly
> > > larger.
> > > >> We
> > > >> > > > really
> > > >> > > > > > do
> > > >> > > > > > >> > get a
> > > >> > > > > > >> > > > ton
> > > >> > > > > > >> > > > > >> > >>>>>>>> of
> > > >> > > > > > >> > > > > >> > >>>>>>> leverage
> > > >> > > > > > >> > > > > >> > >>>>>>>>  out of Kafka.
> > > >> > > > > > >> > > > > >> > >>>>>>>>  2. Partition management is fully
> > > >> > delegated
> > > >> > > to
> > > >> > > > > the
> > > >> > > > > > >> new
> > > >> > > > > > >> > > > > >> consumer.
> > > >> > > > > > >> > > > > >> > >>>>> This
> > > >> > > > > > >> > > > > >> > >>>>>>>>  is nice since now any partition
> > > >> > management
> > > >> > > > > > strategy
> > > >> > > > > > >> > > > > available
> > > >> > > > > > >> > > > > >> > >>>>>>>> to
> > > >> > > > > > >> > > > > >> > >>>>>> Kafka
> > > >> > > > > > >> > > > > >> > >>>>>>>>  consumer is also available to
> > Samza
> > > >> (and
> > > >> > > vice
> > > >> > > > > > versa)
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > > with
> > > >> > > > > > >> > > > > >> > >>>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>>> exact
> > > >> > > > > > >> > > > > >> > >>>>>>>>  same configs.
> > > >> > > > > > >> > > > > >> > >>>>>>>>  3. It supports state as well as
> > > state
> > > >> > reuse
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> Anyhow take a look, hopefully it
> is
> > > >> > thought
> > > >> > > > > > >> provoking.
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> -Jay
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>> On Tue, Jun 30, 2015 at 6:55 PM,
> > > Chris
> > > >> > > > > Riccomini <
> > > >> > > > > > >> > > > > >> > >>>>>> criccom...@apache.org>
> > > >> > > > > > >> > > > > >> > >>>>>>>> wrote:
> > > >> > > > > > >> > > > > >> > >>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Hey all,
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> I have had some discussions with
> > > Samza
> > > >> > > > > engineers
> > > >> > > > > > at
> > > >> > > > > > >> > > > LinkedIn
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > > >> > > > > >> > >>>>>>> Confluent
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and we came up with a few
> > > observations
> > > >> > and
> > > >> > > > > would
> > > >> > > > > > >> like
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > > >> > >>>>>>>>> propose
> > > >> > > > > > >> > > > > >> > >>>>> some
> > > >> > > > > > >> > > > > >> > >>>>>>>>> changes.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> We've observed some things that
> I
> > > >> want to
> > > >> > > > call
> > > >> > > > > > out
> > > >> > > > > > >> > about
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza's
> > > >> > > > > > >> > > > > >> > >>>>>> design,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and I'd like to propose some
> > > changes.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is dependent upon a
> > dynamic
> > > >> > > > deployment
> > > >> > > > > > >> system.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza is too pluggable.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Samza's
> > > >> SystemConsumer/SystemProducer
> > > >> > and
> > > >> > > > > > Kafka's
> > > >> > > > > > >> > > > consumer
> > > >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> > > >> > > > > > >> > > > > >> > >>>>> are
> > > >> > > > > > >> > > > > >> > >>>>>>>>> trying to solve a lot of the
> same
> > > >> > problems.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> All three of these issues are
> > > related,
> > > >> > but
> > > >> > > > I'll
> > > >> > > > > > >> > address
> > > >> > > > > > >> > > > them
> > > >> > > > > > >> > > > > >> in
> > > >> > > > > > >> > > > > >> > >>>>> order.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Deployment
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza strongly depends on the
> use
> > > of a
> > > >> > > > dynamic
> > > >> > > > > > >> > > deployment
> > > >> > > > > > >> > > > > >> > >>>>>>>>> scheduler
> > > >> > > > > > >> > > > > >> > >>>>>> such
> > > >> > > > > > >> > > > > >> > >>>>>>>>> as
> > > >> > > > > > >> > > > > >> > >>>>>>>>> YARN, Mesos, etc. When we
> > initially
> > > >> built
> > > >> > > > > Samza,
> > > >> > > > > > we
> > > >> > > > > > >> > bet
> > > >> > > > > > >> > > > that
> > > >> > > > > > >> > > > > >> > >>>>>>>>> there
> > > >> > > > > > >> > > > > >> > >>>>>> would
> > > >> > > > > > >> > > > > >> > >>>>>>>>> be
> > > >> > > > > > >> > > > > >> > >>>>>>>>> one or two winners in this area,
> > and
> > > >> we
> > > >> > > could
> > > >> > > > > > >> support
> > > >> > > > > > >> > > > them,
> > > >> > > > > > >> > > > > >> and
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>> rest
> > > >> > > > > > >> > > > > >> > >>>>>>>>> would go away. In reality, there
> > are
> > > >> many
> > > >> > > > > > >> variations.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Furthermore,
> > > >> > > > > > >> > > > > >> > >>>>>> many
> > > >> > > > > > >> > > > > >> > >>>>>>>>> people still prefer to just
> start
> > > >> their
> > > >> > > > > > processors
> > > >> > > > > > >> > like
> > > >> > > > > > >> > > > > normal
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Java processes, and use
> > traditional
> > > >> > > > deployment
> > > >> > > > > > >> scripts
> > > >> > > > > > >> > > > such
> > > >> > > > > > >> > > > > as
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Fabric,
> > > >> > > > > > >> > > > > >> > >>>>>> Chef,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Ansible, etc. Forcing a
> deployment
> > > >> system
> > > >> > > on
> > > >> > > > > > users
> > > >> > > > > > >> > makes
> > > >> > > > > > >> > > > the
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza start-up process really
> > > painful
> > > >> for
> > > >> > > > first
> > > >> > > > > > time
> > > >> > > > > > >> > > > users.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Dynamic deployment as a
> > requirement
> > > >> was
> > > >> > > also
> > > >> > > > a
> > > >> > > > > > bit
> > > >> > > > > > >> of
> > > >> > > > > > >> > a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> mis-fire
> > > >> > > > > > >> > > > > >> > >>>>>> because
> > > >> > > > > > >> > > > > >> > >>>>>>>>> of
> > > >> > > > > > >> > > > > >> > >>>>>>>>> a fundamental misunderstanding
> > > between
> > > >> > the
> > > >> > > > > > nature of
> > > >> > > > > > >> > > batch
> > > >> > > > > > >> > > > > >> jobs
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > > >> > > > > >> > >>>>>>> stream
> > > >> > > > > > >> > > > > >> > >>>>>>>>> processing jobs. Early on, we
> made
> > > >> > > conscious
> > > >> > > > > > effort
> > > >> > > > > > >> to
> > > >> > > > > > >> > > > favor
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>> Hadoop
> > > >> > > > > > >> > > > > >> > >>>>>>>>> (Map/Reduce) way of doing
> things,
> > > >> since
> > > >> > it
> > > >> > > > > worked
> > > >> > > > > > >> and
> > > >> > > > > > >> > > was
> > > >> > > > > > >> > > > > well
> > > >> > > > > > >> > > > > >> > >>>>>>> understood.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> One thing that we missed was
> that
> > > >> batch
> > > >> > > jobs
> > > >> > > > > > have a
> > > >> > > > > > >> > > > definite
> > > >> > > > > > >> > > > > >> > >>>>>> beginning,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and
> > > >> > > > > > >> > > > > >> > >>>>>>>>> end, and stream processing jobs
> > > don't
> > > >> > > > > (usually).
> > > >> > > > > > >> This
> > > >> > > > > > >> > > > leads
> > > >> > > > > > >> > > > > to
> > > >> > > > > > >> > > > > >> > >>>>>>>>> a
> > > >> > > > > > >> > > > > >> > >>>>> much
> > > >> > > > > > >> > > > > >> > >>>>>>>>> simpler scheduling problem for
> > > stream
> > > >> > > > > processors.
> > > >> > > > > > >> You
> > > >> > > > > > >> > > > > >> basically
> > > >> > > > > > >> > > > > >> > >>>>>>>>> just
> > > >> > > > > > >> > > > > >> > >>>>>>> need
> > > >> > > > > > >> > > > > >> > >>>>>>>>> to find a place to start the
> > > >> processor,
> > > >> > and
> > > >> > > > > start
> > > >> > > > > > >> it.
> > > >> > > > > > >> > > The
> > > >> > > > > > >> > > > > way
> > > >> > > > > > >> > > > > >> > >>>>>>>>> we run grids, at LinkedIn,
> there's
> > > no
> > > >> > > concept
> > > >> > > > > of
> > > >> > > > > > a
> > > >> > > > > > >> > > cluster
> > > >> > > > > > >> > > > > >> > >>>>>>>>> being "full". We always
> > > >> > > > > > >> > > > > >> > >>>>>> add
> > > >> > > > > > >> > > > > >> > >>>>>>>>> more machines. The problem with
> > > >> coupling
> > > >> > > > Samza
> > > >> > > > > > with
> > > >> > > > > > >> a
> > > >> > > > > > >> > > > > >> scheduler
> > > >> > > > > > >> > > > > >> > >>>>>>>>> is
> > > >> > > > > > >> > > > > >> > >>>>>> that
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza (as a framework) now has
> to
> > > >> handle
> > > >> > > > > > deployment.
> > > >> > > > > > >> > > This
> > > >> > > > > > >> > > > > >> pulls
> > > >> > > > > > >> > > > > >> > >>>>>>>>> in a
> > > >> > > > > > >> > > > > >> > >>>>>>> bunch
> > > >> > > > > > >> > > > > >> > >>>>>>>>> of things such as configuration
> > > >> > > distribution
> > > >> > > > > > (config
> > > >> > > > > > >> > > > > stream),
> > > >> > > > > > >> > > > > >> > >>>>>>>>> shell
> > > >> > > > > > >> > > > > >> > >>>>>>> scrips
> > > >> > > > > > >> > > > > >> > >>>>>>>>> (bin/run-job.sh, JobRunner),
> > > packaging
> > > >> > (all
> > > >> > > > the
> > > >> > > > > > .tgz
> > > >> > > > > > >> > > > stuff),
> > > >> > > > > > >> > > > > >> etc.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Another reason for requiring
> > dynamic
> > > >> > > > deployment
> > > >> > > > > > was
> > > >> > > > > > >> to
> > > >> > > > > > >> > > > > support
> > > >> > > > > > >> > > > > >> > >>>>>>>>> data locality. If you want to
> have
> > > >> > > locality,
> > > >> > > > > you
> > > >> > > > > > >> need
> > > >> > > > > > >> > to
> > > >> > > > > > >> > > > put
> > > >> > > > > > >> > > > > >> > >>>>>>>>> your
> > > >> > > > > > >> > > > > >> > >>>>>> processors
> > > >> > > > > > >> > > > > >> > >>>>>>>>> close to the data they're
> > > processing.
> > > >> > Upon
> > > >> > > > > > further
> > > >> > > > > > >> > > > > >> > >>>>>>>>> investigation,
> > > >> > > > > > >> > > > > >> > >>>>>>> though,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> this feature is not that
> > beneficial.
> > > >> > There
> > > >> > > is
> > > >> > > > > > some
> > > >> > > > > > >> > good
> > > >> > > > > > >> > > > > >> > >>>>>>>>> discussion
> > > >> > > > > > >> > > > > >> > >>>>>> about
> > > >> > > > > > >> > > > > >> > >>>>>>>>> some problems with it on
> > SAMZA-335.
> > > >> > Again,
> > > >> > > we
> > > >> > > > > > took
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Map/Reduce
> > > >> > > > > > >> > > > > >> > >>>>>> path,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> but
> > > >> > > > > > >> > > > > >> > >>>>>>>>> there are some fundamental
> > > differences
> > > >> > > > between
> > > >> > > > > > HDFS
> > > >> > > > > > >> > and
> > > >> > > > > > >> > > > > Kafka.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> HDFS
> > > >> > > > > > >> > > > > >> > >>>>>> has
> > > >> > > > > > >> > > > > >> > >>>>>>>>> blocks, while Kafka has
> > partitions.
> > > >> This
> > > >> > > > leads
> > > >> > > > > to
> > > >> > > > > > >> less
> > > >> > > > > > >> > > > > >> > >>>>>>>>> optimization potential with
> stream
> > > >> > > processors
> > > >> > > > > on
> > > >> > > > > > top
> > > >> > > > > > >> > of
> > > >> > > > > > >> > > > > Kafka.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> This feature is also used as a
> > > crutch.
> > > >> > > Samza
> > > >> > > > > > doesn't
> > > >> > > > > > >> > > have
> > > >> > > > > > >> > > > > any
> > > >> > > > > > >> > > > > >> > >>>>>>>>> built
> > > >> > > > > > >> > > > > >> > >>>>> in
> > > >> > > > > > >> > > > > >> > >>>>>>>>> fault-tolerance logic. Instead,
> it
> > > >> > depends
> > > >> > > on
> > > >> > > > > the
> > > >> > > > > > >> > > dynamic
> > > >> > > > > > >> > > > > >> > >>>>>>>>> deployment scheduling system to
> > > handle
> > > >> > > > restarts
> > > >> > > > > > >> when a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> processor dies. This has
> > > >> > > > > > >> > > > > >> > >>>>>>> made
> > > >> > > > > > >> > > > > >> > >>>>>>>>> it very difficult to write a
> > > >> standalone
> > > >> > > Samza
> > > >> > > > > > >> > container
> > > >> > > > > > >> > > > > >> > >>>> (SAMZA-516).
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Pluggability
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> In some cases pluggability is
> > good,
> > > >> but I
> > > >> > > > think
> > > >> > > > > > that
> > > >> > > > > > >> > > we've
> > > >> > > > > > >> > > > > >> gone
> > > >> > > > > > >> > > > > >> > >>>>>>>>> too
> > > >> > > > > > >> > > > > >> > >>>>>> far
> > > >> > > > > > >> > > > > >> > >>>>>>>>> with it. Currently, Samza has:
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable config.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable metrics.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable deployment systems.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable streaming systems
> > > >> > > > (SystemConsumer,
> > > >> > > > > > >> > > > > SystemProducer,
> > > >> > > > > > >> > > > > >> > >>>> etc).
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable serdes.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable storage engines.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Pluggable strategies for just
> > > about
> > > >> > every
> > > >> > > > > > >> component
> > > >> > > > > > >> > > > > >> > >>>>> (MessageChooser,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> SystemStreamPartitionGrouper,
> > > >> > > ConfigRewriter,
> > > >> > > > > > etc).
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> There's probably more that I've
> > > >> > forgotten,
> > > >> > > as
> > > >> > > > > > well.
> > > >> > > > > > >> > Some
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> > >>>>>>>>> these
> > > >> > > > > > >> > > > > >> > >>>>> are
> > > >> > > > > > >> > > > > >> > >>>>>>>>> useful, but some have proven not
> > to
> > > >> be.
> > > >> > > This
> > > >> > > > > all
> > > >> > > > > > >> comes
> > > >> > > > > > >> > > at
> > > >> > > > > > >> > > > a
> > > >> > > > > > >> > > > > >> cost:
> > > >> > > > > > >> > > > > >> > >>>>>>>>> complexity. This complexity is
> > > making
> > > >> it
> > > >> > > > harder
> > > >> > > > > > for
> > > >> > > > > > >> > our
> > > >> > > > > > >> > > > > users
> > > >> > > > > > >> > > > > >> > >>>>>>>>> to
> > > >> > > > > > >> > > > > >> > >>>>> pick
> > > >> > > > > > >> > > > > >> > >>>>>> up
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and use Samza out of the box. It
> > > also
> > > >> > makes
> > > >> > > > it
> > > >> > > > > > >> > difficult
> > > >> > > > > > >> > > > for
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza developers to reason about
> > > what
> > > >> the
> > > >> > > > > > >> > > characteristics
> > > >> > > > > > >> > > > of
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the container (since the
> > > >> characteristics
> > > >> > > > change
> > > >> > > > > > >> > > depending
> > > >> > > > > > >> > > > on
> > > >> > > > > > >> > > > > >> > >>>>>>>>> which plugins are use).
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> The issues with pluggability are
> > > most
> > > >> > > visible
> > > >> > > > > in
> > > >> > > > > > the
> > > >> > > > > > >> > > > System
> > > >> > > > > > >> > > > > >> APIs.
> > > >> > > > > > >> > > > > >> > >>>>> What
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza really requires to be
> > > >> functional is
> > > >> > > > Kafka
> > > >> > > > > > as
> > > >> > > > > > >> its
> > > >> > > > > > >> > > > > >> > >>>>>>>>> transport
> > > >> > > > > > >> > > > > >> > >>>>>> layer.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> But
> > > >> > > > > > >> > > > > >> > >>>>>>>>> we've conflated two unrelated
> use
> > > >> cases
> > > >> > > into
> > > >> > > > > one
> > > >> > > > > > >> API:
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Get data into/out of Kafka.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Process the data in Kafka.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> The current System API supports
> > both
> > > >> of
> > > >> > > these
> > > >> > > > > use
> > > >> > > > > > >> > cases.
> > > >> > > > > > >> > > > The
> > > >> > > > > > >> > > > > >> > >>>>>>>>> problem
> > > >> > > > > > >> > > > > >> > >>>>>> is,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> we
> > > >> > > > > > >> > > > > >> > >>>>>>>>> actually want different features
> > for
> > > >> each
> > > >> > > use
> > > >> > > > > > case.
> > > >> > > > > > >> By
> > > >> > > > > > >> > > > > >> papering
> > > >> > > > > > >> > > > > >> > >>>>>>>>> over
> > > >> > > > > > >> > > > > >> > >>>>>>> these
> > > >> > > > > > >> > > > > >> > >>>>>>>>> two use cases, and providing a
> > > single
> > > >> > API,
> > > >> > > > > we've
> > > >> > > > > > >> > > > introduced
> > > >> > > > > > >> > > > > a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> > > >> > > > > > >> > > > > >> > >>>>>>> leaky
> > > >> > > > > > >> > > > > >> > >>>>>>>>> abstractions.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> For example, what we'd really
> like
> > > in
> > > >> (2)
> > > >> > > is
> > > >> > > > to
> > > >> > > > > > have
> > > >> > > > > > >> > > > > >> > >>>>>>>>> monotonically increasing longs
> for
> > > >> > offsets
> > > >> > > > > (like
> > > >> > > > > > >> > Kafka).
> > > >> > > > > > >> > > > > This
> > > >> > > > > > >> > > > > >> > >>>>>>>>> would be at odds
> > > >> > > > > > >> > > > > >> > >>>>> with
> > > >> > > > > > >> > > > > >> > >>>>>>> (1),
> > > >> > > > > > >> > > > > >> > >>>>>>>>> though, since different systems
> > have
> > > >> > > > different
> > > >> > > > > > >> > > > > >> > >>>>>>> SCNs/Offsets/UUIDs/vectors.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> There was discussion both on the
> > > >> mailing
> > > >> > > list
> > > >> > > > > and
> > > >> > > > > > >> the
> > > >> > > > > > >> > > SQL
> > > >> > > > > > >> > > > > >> JIRAs
> > > >> > > > > > >> > > > > >> > >>>>> about
> > > >> > > > > > >> > > > > >> > >>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>>>>> need for this.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> The same thing holds true for
> > > >> > > replayability.
> > > >> > > > > > Kafka
> > > >> > > > > > >> > > allows
> > > >> > > > > > >> > > > us
> > > >> > > > > > >> > > > > >> to
> > > >> > > > > > >> > > > > >> > >>>>> rewind
> > > >> > > > > > >> > > > > >> > >>>>>>>>> when
> > > >> > > > > > >> > > > > >> > >>>>>>>>> we have a failure. Many other
> > > systems
> > > >> > > don't.
> > > >> > > > In
> > > >> > > > > > some
> > > >> > > > > > >> > > > cases,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> systems
> > > >> > > > > > >> > > > > >> > >>>>>>> return
> > > >> > > > > > >> > > > > >> > >>>>>>>>> null for their offsets (e.g.
> > > >> > > > > > >> WikipediaSystemConsumer)
> > > >> > > > > > >> > > > > because
> > > >> > > > > > >> > > > > >> > >>>>>>>>> they
> > > >> > > > > > >> > > > > >> > >>>>>> have
> > > >> > > > > > >> > > > > >> > >>>>>>> no
> > > >> > > > > > >> > > > > >> > >>>>>>>>> offsets.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Partitioning is another example.
> > > Kafka
> > > >> > > > supports
> > > >> > > > > > >> > > > > partitioning,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> but
> > > >> > > > > > >> > > > > >> > >>>>> many
> > > >> > > > > > >> > > > > >> > >>>>>>>>> systems don't. We model this by
> > > >> having a
> > > >> > > > single
> > > >> > > > > > >> > > partition
> > > >> > > > > > >> > > > > for
> > > >> > > > > > >> > > > > >> > >>>>>>>>> those systems. Still, other
> > systems
> > > >> model
> > > >> > > > > > >> partitioning
> > > >> > > > > > >> > > > > >> > >>>> differently (e.g.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Kinesis).
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> The SystemAdmin interface is
> also
> > a
> > > >> mess.
> > > >> > > > > > Creating
> > > >> > > > > > >> > > streams
> > > >> > > > > > >> > > > > in
> > > >> > > > > > >> > > > > >> a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> system-agnostic way is almost
> > > >> impossible.
> > > >> > > As
> > > >> > > > is
> > > >> > > > > > >> > modeling
> > > >> > > > > > >> > > > > >> > >>>>>>>>> metadata
> > > >> > > > > > >> > > > > >> > >>>>> for
> > > >> > > > > > >> > > > > >> > >>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>>>>>> system (replication factor,
> > > >> partitions,
> > > >> > > > > location,
> > > >> > > > > > >> > etc).
> > > >> > > > > > >> > > > The
> > > >> > > > > > >> > > > > >> > >>>>>>>>> list
> > > >> > > > > > >> > > > > >> > >>>>> goes
> > > >> > > > > > >> > > > > >> > >>>>>>> on.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Duplicate work
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> At the time that we began
> writing
> > > >> Samza,
> > > >> > > > > Kafka's
> > > >> > > > > > >> > > consumer
> > > >> > > > > > >> > > > > and
> > > >> > > > > > >> > > > > >> > >>>>> producer
> > > >> > > > > > >> > > > > >> > >>>>>>>>> APIs
> > > >> > > > > > >> > > > > >> > >>>>>>>>> had a relatively weak feature
> set.
> > > On
> > > >> the
> > > >> > > > > > >> > consumer-side,
> > > >> > > > > > >> > > > you
> > > >> > > > > > >> > > > > >> > >>>>>>>>> had two
> > > >> > > > > > >> > > > > >> > >>>>>>>>> options: use the high level
> > > consumer,
> > > >> or
> > > >> > > the
> > > >> > > > > > simple
> > > >> > > > > > >> > > > > consumer.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> The
> > > >> > > > > > >> > > > > >> > >>>>>>> problem
> > > >> > > > > > >> > > > > >> > >>>>>>>>> with the high-level consumer was
> > > that
> > > >> it
> > > >> > > > > > controlled
> > > >> > > > > > >> > your
> > > >> > > > > > >> > > > > >> > >>>>>>>>> offsets, partition assignments,
> > and
> > > >> the
> > > >> > > order
> > > >> > > > > in
> > > >> > > > > > >> which
> > > >> > > > > > >> > > you
> > > >> > > > > > >> > > > > >> > >>>>>>>>> received messages. The
> > > >> > > > > > >> > > > > >> > >>>>> problem
> > > >> > > > > > >> > > > > >> > >>>>>>>>> with
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the simple consumer is that it's
> > not
> > > >> > > simple.
> > > >> > > > > It's
> > > >> > > > > > >> > basic.
> > > >> > > > > > >> > > > You
> > > >> > > > > > >> > > > > >> > >>>>>>>>> end up
> > > >> > > > > > >> > > > > >> > >>>>>>> having
> > > >> > > > > > >> > > > > >> > >>>>>>>>> to handle a lot of really
> > low-level
> > > >> stuff
> > > >> > > > that
> > > >> > > > > > you
> > > >> > > > > > >> > > > > shouldn't.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> We
> > > >> > > > > > >> > > > > >> > >>>>>> spent a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> lot of time to make Samza's
> > > >> > > > KafkaSystemConsumer
> > > >> > > > > > very
> > > >> > > > > > >> > > > robust.
> > > >> > > > > > >> > > > > >> It
> > > >> > > > > > >> > > > > >> > >>>>>>>>> also allows us to support some
> > cool
> > > >> > > features:
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Per-partition message ordering
> > and
> > > >> > > > > > prioritization.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over partition
> > > >> assignment
> > > >> > > to
> > > >> > > > > > support
> > > >> > > > > > >> > > > joins,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> global
> > > >> > > > > > >> > > > > >> > >>>>>> state
> > > >> > > > > > >> > > > > >> > >>>>>>>>> (if we want to implement it :)),
> > > etc.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> * Tight control over offset
> > > >> > checkpointing.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> What we didn't realize at the
> time
> > > is
> > > >> > that
> > > >> > > > > these
> > > >> > > > > > >> > > features
> > > >> > > > > > >> > > > > >> > >>>>>>>>> should
> > > >> > > > > > >> > > > > >> > >>>>>>> actually
> > > >> > > > > > >> > > > > >> > >>>>>>>>> be in Kafka. A lot of Kafka
> > > consumers
> > > >> > (not
> > > >> > > > just
> > > >> > > > > > >> Samza
> > > >> > > > > > >> > > > stream
> > > >> > > > > > >> > > > > >> > >>>>>> processors)
> > > >> > > > > > >> > > > > >> > >>>>>>>>> end up wanting to do things like
> > > joins
> > > >> > and
> > > >> > > > > > partition
> > > >> > > > > > >> > > > > >> > >>>>>>>>> assignment. The
> > > >> > > > > > >> > > > > >> > >>>>>>> Kafka
> > > >> > > > > > >> > > > > >> > >>>>>>>>> community has come to the same
> > > >> > conclusion.
> > > >> > > > > > They're
> > > >> > > > > > >> > > adding
> > > >> > > > > > >> > > > a
> > > >> > > > > > >> > > > > >> ton
> > > >> > > > > > >> > > > > >> > >>>>>>>>> of upgrades into their new Kafka
> > > >> consumer
> > > >> > > > > > >> > > implementation.
> > > >> > > > > > >> > > > > To a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> large extent,
> > > >> > > > > > >> > > > > >> > >>>>> it's
> > > >> > > > > > >> > > > > >> > >>>>>>>>> duplicate work to what we've
> > already
> > > >> done
> > > >> > > in
> > > >> > > > > > Samza.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> On top of this, Kafka ended up
> > > taking
> > > >> a
> > > >> > > very
> > > >> > > > > > similar
> > > >> > > > > > >> > > > > approach
> > > >> > > > > > >> > > > > >> > >>>>>>>>> to
> > > >> > > > > > >> > > > > >> > >>>>>> Samza's
> > > >> > > > > > >> > > > > >> > >>>>>>>>> KafkaCheckpointManager
> > > implementation
> > > >> for
> > > >> > > > > > handling
> > > >> > > > > > >> > > offset
> > > >> > > > > > >> > > > > >> > >>>>>> checkpointing.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Like Samza, Kafka's new offset
> > > >> management
> > > >> > > > > feature
> > > >> > > > > > >> > stores
> > > >> > > > > > >> > > > > >> offset
> > > >> > > > > > >> > > > > >> > >>>>>>>>> checkpoints in a topic, and
> allows
> > > >> you to
> > > >> > > > fetch
> > > >> > > > > > them
> > > >> > > > > > >> > > from
> > > >> > > > > > >> > > > > the
> > > >> > > > > > >> > > > > >> > >>>>>>>>> broker.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> A lot of this seems like a
> waste,
> > > >> since
> > > >> > we
> > > >> > > > > could
> > > >> > > > > > >> have
> > > >> > > > > > >> > > > shared
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the
> > > >> > > > > > >> > > > > >> > >>>>> work
> > > >> > > > > > >> > > > > >> > >>>>>> if
> > > >> > > > > > >> > > > > >> > >>>>>>>>> it
> > > >> > > > > > >> > > > > >> > >>>>>>>>> had been done in Kafka from the
> > > >> get-go.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Vision
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> All of this leads me to a rather
> > > >> radical
> > > >> > > > > > proposal.
> > > >> > > > > > >> > Samza
> > > >> > > > > > >> > > > is
> > > >> > > > > > >> > > > > >> > >>>>> relatively
> > > >> > > > > > >> > > > > >> > >>>>>>>>> stable at this point. I'd
> venture
> > to
> > > >> say
> > > >> > > that
> > > >> > > > > > we're
> > > >> > > > > > >> > > near a
> > > >> > > > > > >> > > > > 1.0
> > > >> > > > > > >> > > > > >> > >>>>>> release.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> I'd
> > > >> > > > > > >> > > > > >> > >>>>>>>>> like to propose that we take
> what
> > > >> we've
> > > >> > > > > learned,
> > > >> > > > > > and
> > > >> > > > > > >> > > begin
> > > >> > > > > > >> > > > > >> > >>>>>>>>> thinking
> > > >> > > > > > >> > > > > >> > >>>>>>> about
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Samza beyond 1.0. What would we
> > > >> change if
> > > >> > > we
> > > >> > > > > were
> > > >> > > > > > >> > > starting
> > > >> > > > > > >> > > > > >> from
> > > >> > > > > > >> > > > > >> > >>>>>> scratch?
> > > >> > > > > > >> > > > > >> > >>>>>>>>> My
> > > >> > > > > > >> > > > > >> > >>>>>>>>> proposal is to:
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> 1. Make Samza standalone the
> > *only*
> > > >> way
> > > >> > to
> > > >> > > > run
> > > >> > > > > > Samza
> > > >> > > > > > >> > > > > >> > >>>>>>>>> processors, and eliminate all
> > direct
> > > >> > > > > dependences
> > > >> > > > > > on
> > > >> > > > > > >> > > YARN,
> > > >> > > > > > >> > > > > >> Mesos,
> > > >> > > > > > >> > > > > >> > >>>> etc.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> 2. Make a definitive call to
> > support
> > > >> only
> > > >> > > > Kafka
> > > >> > > > > > as
> > > >> > > > > > >> the
> > > >> > > > > > >> > > > > stream
> > > >> > > > > > >> > > > > >> > >>>>>> processing
> > > >> > > > > > >> > > > > >> > >>>>>>>>> layer.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> 3. Eliminate Samza's metrics,
> > > logging,
> > > >> > > > > > >> serialization,
> > > >> > > > > > >> > > and
> > > >> > > > > > >> > > > > >> > >>>>>>>>> config
> > > >> > > > > > >> > > > > >> > >>>>>>> systems,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> and simply use Kafka's instead.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> This would fix all of the issues
> > > that
> > > >> I
> > > >> > > > > outlined
> > > >> > > > > > >> > above.
> > > >> > > > > > >> > > It
> > > >> > > > > > >> > > > > >> > >>>>>>>>> should
> > > >> > > > > > >> > > > > >> > >>>>> also
> > > >> > > > > > >> > > > > >> > >>>>>>>>> shrink the Samza code base
> pretty
> > > >> > > > dramatically.
> > > >> > > > > > >> > > Supporting
> > > >> > > > > > >> > > > > >> only
> > > >> > > > > > >> > > > > >> > >>>>>>>>> a standalone container will
> allow
> > > >> Samza
> > > >> > to
> > > >> > > be
> > > >> > > > > > >> executed
> > > >> > > > > > >> > > on
> > > >> > > > > > >> > > > > YARN
> > > >> > > > > > >> > > > > >> > >>>>>>>>> (using Slider), Mesos (using
> > > >> > > > Marathon/Aurora),
> > > >> > > > > or
> > > >> > > > > > >> most
> > > >> > > > > > >> > > > other
> > > >> > > > > > >> > > > > >> > >>>>>>>>> in-house
> > > >> > > > > > >> > > > > >> > >>>>>>> deployment
> > > >> > > > > > >> > > > > >> > >>>>>>>>> systems. This should make life a
> > lot
> > > >> > easier
> > > >> > > > for
> > > >> > > > > > new
> > > >> > > > > > >> > > users.
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Imagine
> > > >> > > > > > >> > > > > >> > >>>>>>> having
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the hello-samza tutorial without
> > > YARN.
> > > >> > The
> > > >> > > > drop
> > > >> > > > > > in
> > > >> > > > > > >> > > mailing
> > > >> > > > > > >> > > > > >> list
> > > >> > > > > > >> > > > > >> > >>>>>> traffic
> > > >> > > > > > >> > > > > >> > >>>>>>>>> will be pretty dramatic.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Coupling with Kafka seems long
> > > >> overdue to
> > > >> > > me.
> > > >> > > > > The
> > > >> > > > > > >> > > reality
> > > >> > > > > > >> > > > > is,
> > > >> > > > > > >> > > > > >> > >>>>> everyone
> > > >> > > > > > >> > > > > >> > >>>>>>>>> that
> > > >> > > > > > >> > > > > >> > >>>>>>>>> I'm aware of is using Samza with
> > > >> Kafka.
> > > >> > We
> > > >> > > > > > basically
> > > >> > > > > > >> > > > require
> > > >> > > > > > >> > > > > >> it
> > > >> > > > > > >> > > > > >> > >>>>>> already
> > > >> > > > > > >> > > > > >> > >>>>>>> in
> > > >> > > > > > >> > > > > >> > >>>>>>>>> order for most features to work.
> > > Those
> > > >> > that
> > > >> > > > are
> > > >> > > > > > >> using
> > > >> > > > > > >> > > > other
> > > >> > > > > > >> > > > > >> > >>>>>>>>> systems
> > > >> > > > > > >> > > > > >> > >>>>>> are
> > > >> > > > > > >> > > > > >> > >>>>>>>>> generally using it for ingest
> into
> > > >> Kafka
> > > >> > > (1),
> > > >> > > > > and
> > > >> > > > > > >> then
> > > >> > > > > > >> > > > they
> > > >> > > > > > >> > > > > do
> > > >> > > > > > >> > > > > >> > >>>>>>>>> the processing on top. There is
> > > >> already
> > > >> > > > > > discussion (
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>
> > > >> > > > > > >> > > > > >> >
> > > >> > > > > > >> > > > >
> > > >> > > > > > >> >
> > > >> > > > > >
> > > >> > >
> > > >>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851
> > > >> > > > > > >> > > > > >> > >>>>> 767
> > > >> > > > > > >> > > > > >> > >>>>>>>>> )
> > > >> > > > > > >> > > > > >> > >>>>>>>>> in Kafka to make ingesting into
> > > Kafka
> > > >> > > > extremely
> > > >> > > > > > >> easy.
> > > >> > > > > > >> > > > > >> > >>>>>>>>>
> > > >> > > > > > >> > > > > >> > >>>>>>>>> Once we make the call to couple
> > with
> > > >> > Kafka,
> > > >> > > > we
> > > >> > > > > > can
> > > >> > > > > > >> > > > leverage
> > > >> > > > > > >> > > > > a
> > > >> > > > > > >> > > > > >> > >>>>>>>>> ton of
> > > >> > > > > > >> > > > > >> > >>>>>>> their
> > > >> > > > > > >> > > > > >> > >>>>>>>>> ecosystem. We no longer have to
> > > >> maintain
> > > >> > > our
> > > >> > > > > own
> > > >> > > > > > >> > config,
> > > >> > > > > > >> > > > > >> > >>>>>>>>> metrics,
> > > >> > > > > > >> > > > > >> > >>>>> etc.
> > > >> > > > > > >> > > > > >> > >>>>>>> We
> > > >> > > > > > >> > > > > >> > >>>>>>>>> can all share the same
> libraries,
> > > and
> > > >> > make
> > > >> > > > them
> > > >> > > > > > >> > better.
> > > >> > > > > > >> > > > This
> > > >> > > > > > >> > > > > >> > >>>>>>>>> will
> > > >> > > > > > >> > > > > >> >
> > > >> ...
> > > >>
> > > >> [Message clipped]
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Jordan Shaw
> > Full Stack Software Engineer
> > PubNub Inc
> > 1045 17th St
> > San Francisco, CA 94107
> >
>

Re: Thoughts and obesrvations on Samza

Reply via email to