Hi Jason,

You make some good points.

First, I agree that we could have false positives which result in an
OutOfOrderSequence exception for topics with low retention where messages
are removed intentionally. I 100% agree that this should be tightened up.

One solution is to return the log start offset in each produce response and
also introduce 'UnknownProducer' error code when the broker receives a
non-zero sequence number but has no record of the producer. These two
pieces of information would enable to us to solve the false positive
problem: If the producer receives an 'UknownProducer' error and the log
start offset is in front of the last acknowledged offset, then it is a
false positive.

Whether we should explore solutions to this problem before making this the
default is a good question. I will look into solutions and try to estimate
the level of effort to fix it. If we can make this in the 1.0.0 release,
then that would be ideal.

If it can't make the 1.0.0 release (feature freeze is only a month away, so
it will be tight), we should recall that the false positive would only
occur with fairly inactive producers writing to topics with low retention,
so in practice we won't commonly hit this. Additionally, this is already
something people may hit if they enable idempotence. From this point of
view, it may make sense to make the leap and ship Kafka with stronger
default semantics in the 1.0.0 release and fix forward.

It seems that the community so far is in agreement that making idempotence
the default is generally a good idea.

The question now is around timing. Is it too soon? Should we let the
feature bake for one more release cycle, fix known holes, etc., and then
enable it in the 1.1.0 release? At this point I am leaning toward waiting
till 1.1.0 if the false positive problem is not solved in the 1.0.0
timeframe: I would like the feature to be as solid as possible so that
people build trust around the guarantees.

What does the rest of the community think?

Thanks,
Apurva



On Thu, Aug 17, 2017 at 11:40 AM, Jason Gustafson <ja...@confluent.io>
wrote:

> Some additional points for discussion:
>
> 1. The idempotent producer is still a new feature and I doubt it has gotten
> much use yet (in particular since it depends on a message format upgrade).
> Do we feel it has reached a level of stability where we are comfortable
> making it the default?
>
> 2. I'm still a little uncomfortable with our handling of
> OutOfOrderSequence. I'm honestly not sure what a user should do to handle
> this error. If they care about delivery, should they "rewind" to some
> previous point and start again? One of the problems is that this error may
> be a false positive if the last sequence from the producer was removed from
> the log to enforce the retention policy. It would be nice if we could
> tighten this up so that we only send OutOfOrderSequence for actual data
> loss, and in that case, if we can tell the user (say) which offset was last
> successfully written so that they know what was lost. We discussed a few
> ideas offline to address this, but do you think fixing it should be a
> prerequisite for making idempotence the default?
>
> 3. Also, in the current handling, when we receive OutOfOrderSequence, there
> is no way for the user to retry and preserve order because we do not fail
> all of the queued batches for the same partition. My understanding is that
> you are looking to change this behavior in your patch to support more
> in-flight requests. Is that correct?
>
> Overall I'm definitely supportive of making idempotence the default
> eventually, but I think it might be a tad premature now.
>
> Thanks,
> Jason
>
> On Wed, Aug 16, 2017 at 8:58 PM, Apurva Mehta <apu...@confluent.io> wrote:
>
> > Thanks for the followup Becket. It sounds we are on agreement on the
> scope
> > of this KIP, and the discussion has definitely clarified a lot of the
> > subtle points.
> >
> > Apurva
> >
> > On Tue, Aug 15, 2017 at 10:49 PM, Becket Qin <becket....@gmail.com>
> wrote:
> >
> > > Hi Apurva,
> > >
> > > Thanks for the clarification of the definition. The definitions are
> clear
> > > and helpful.
> > >
> > > It seems the scope of this KIP is just about the producer side
> > > configuration change, but not attempting to achieve the exactly once
> > > semantic with all default settings out of the box. The broker still
> needs
> > > to be configured appropriately to achieve the exactly once semantic. If
> > so,
> > > the current proposal sounds reasonable to me. Apologies if I
> > misunderstood
> > > the goal of this KIP.
> > >
> > > Regarding the max.in.flight.requests.per.connection, I don't think we
> > have
> > > to support infinite number of in flight requests. But admittedly there
> > are
> > > use cases that people would want to have reasonably high in flight
> > > requests. Given that we need to make code changes to support
> idempotence
> > > and in.flight.request > 1, it would be nice to see if we can cover
> those
> > > use cases instead of doing that later. We can discuss this in a
> separate
> > > thread.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > > On Tue, Aug 15, 2017 at 1:46 PM, Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Hi Jay,
> > > >
> > > > I chatted with Apurva offline, and we think the key of the discussion
> > is
> > > > that, as summarized in the updated KIP wiki, whether we should
> consider
> > > > replication as a necessary condition of at-least-once, and of course
> > also
> > > > exactly-once. Originally I think replication is not a necessary
> > condition
> > > > for at-least-once, since the scope of failures that we should be
> > covering
> > > > is different in my definition; if we claim that "even for
> > at-least-once,
> > > > you should have replication factor larger than 2, let alone
> > exactly-once"
> > > > then I agree that having acks=all on the client side should also be a
> > > > necessary condition for at-least-once, and for exactly-once as well.
> > Then
> > > > this KIP would be just providing what is necessary but not sufficient
> > > > conditions, from client-side configs to achieve EOS, while you also
> > need
> > > > the broker-side configs together to really support it.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Tue, Aug 15, 2017 at 1:15 PM, Jay Kreps <j...@confluent.io> wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > > I think the argument is that with acks=1 the message could be lost
> > and
> > > > > hence you aren't guaranteeing exactly once delivery.
> > > > >
> > > > > -Jay
> > > > >
> > > > > On Mon, Aug 14, 2017 at 1:36 PM, Guozhang Wang <wangg...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Just want to clarify that regarding 1), I'm fine with changing it
> > to
> > > > > `all`
> > > > > > but just wanted to argue it is not necessarily correlate with the
> > > > > > exactly-once semantics, but rather on persistence v.s.
> availability
> > > > > > trade-offs, so I'd like to discuss them separately.
> > > > > >
> > > > > > Regarding 2), one minor concern I had is that the enforcement is
> on
> > > the
> > > > > > client side while the parts it affects is on the broker side.
> I.e.
> > > the
> > > > > > broker code would assume at most 5 in.flight when idempotent is
> > > turned
> > > > > on,
> > > > > > but this is not enforced at the broker but relying at the client
> > > side's
> > > > > > sanity. So other implementations of the client that may not obey
> > this
> > > > may
> > > > > > likely break the broker code. If we do enforce this we'd better
> > > enforce
> > > > > it
> > > > > > at the broker side. Also, I'm wondering if we have considered the
> > > > > approach
> > > > > > for brokers to read the logs in order to get the starting offset
> > when
> > > > it
> > > > > > does not about it in its snapshot, that whether it is worthwhile
> if
> > > we
> > > > > > assume that such issues are very rare to happen?
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Aug 14, 2017 at 11:01 AM, Apurva Mehta <
> > apu...@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I just want to summarize where we are in this discussion
> > > > > > >
> > > > > > > There are two major points of contention: should we have acks=1
> > or
> > > > > > acsk=all
> > > > > > > by default? and how to cap max.in.flight.requests.per.
> > connection?
> > > > > > >
> > > > > > > 1) acks=1 vs acks=all1
> > > > > > >
> > > > > > > Here are the tradeoffs of each:
> > > > > > >
> > > > > > > If you have replication-factor=N, your data is resilient N-1 to
> > > disk
> > > > > > > failures. For N>1, here is the tradeoff between acks=1 and
> > > acks=all.
> > > > > > >
> > > > > > > With proposed defaults and acks=all, the stock Kafka producer
> and
> > > the
> > > > > > > default broker settings would guarantee that ack'd messages
> would
> > > be
> > > > in
> > > > > > the
> > > > > > > log exactly once.
> > > > > > >
> > > > > > > With the proposed defaults and acks=1, the stock Kafka producer
> > and
> > > > the
> > > > > > > default broker settings would guarantee that 'retained ack'd
> > > messages
> > > > > > would
> > > > > > > be in the log exactly once. But all ack'd messages may not be
> > > > > retained'.
> > > > > > >
> > > > > > > If you leave replication-factor=1, acks=1 and acks=all have
> > > identical
> > > > > > > semantics and performance, but you are resilient to 0 disk
> > > failures.
> > > > > > >
> > > > > > > I think the measured cost (again the performance details are in
> > the
> > > > > wiki)
> > > > > > > of acks=all is well worth the much clearer semantics. What does
> > the
> > > > > rest
> > > > > > of
> > > > > > > the community think?
> > > > > > >
> > > > > > > 2) capping max.in.flight at 5 when idempotence is enabled.
> > > > > > >
> > > > > > > We need to limit the max.in.flight for the broker to
> de-duplicate
> > > > > > messages
> > > > > > > properly. The limitation would only apply when idempotence is
> > > > enabled.
> > > > > > The
> > > > > > > shared numbers show that when the client-broker latency is low,
> > > there
> > > > > is
> > > > > > no
> > > > > > > performance gain for max.inflight > 2.
> > > > > > >
> > > > > > > Further, it is highly debatable that max.in.flight=500 is
> > > > significantly
> > > > > > > better than max.in.flight=5  for a really high latency
> > > client-broker
> > > > > > link,
> > > > > > > and so far there are no hard numbers one way or another.
> However,
> > > > > > assuming
> > > > > > > that max.in.flight=500 is significantly better than
> > max.inflight=5
> > > in
> > > > > > some
> > > > > > > niche use case, the user would have to sacrifice idempotence
> for
> > > > > > > throughput. In this extreme corner case, I think it is an
> > > acceptable
> > > > > > > tradeoff.
> > > > > > >
> > > > > > > What does the community think?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Apurva
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Reply via email to