Some additional points for discussion:

1. The idempotent producer is still a new feature and I doubt it has gotten
much use yet (in particular since it depends on a message format upgrade).
Do we feel it has reached a level of stability where we are comfortable
making it the default?

2. I'm still a little uncomfortable with our handling of
OutOfOrderSequence. I'm honestly not sure what a user should do to handle
this error. If they care about delivery, should they "rewind" to some
previous point and start again? One of the problems is that this error may
be a false positive if the last sequence from the producer was removed from
the log to enforce the retention policy. It would be nice if we could
tighten this up so that we only send OutOfOrderSequence for actual data
loss, and in that case, if we can tell the user (say) which offset was last
successfully written so that they know what was lost. We discussed a few
ideas offline to address this, but do you think fixing it should be a
prerequisite for making idempotence the default?

3. Also, in the current handling, when we receive OutOfOrderSequence, there
is no way for the user to retry and preserve order because we do not fail
all of the queued batches for the same partition. My understanding is that
you are looking to change this behavior in your patch to support more
in-flight requests. Is that correct?

Overall I'm definitely supportive of making idempotence the default
eventually, but I think it might be a tad premature now.

Thanks,
Jason

On Wed, Aug 16, 2017 at 8:58 PM, Apurva Mehta <apu...@confluent.io> wrote:

> Thanks for the followup Becket. It sounds we are on agreement on the scope
> of this KIP, and the discussion has definitely clarified a lot of the
> subtle points.
>
> Apurva
>
> On Tue, Aug 15, 2017 at 10:49 PM, Becket Qin <becket....@gmail.com> wrote:
>
> > Hi Apurva,
> >
> > Thanks for the clarification of the definition. The definitions are clear
> > and helpful.
> >
> > It seems the scope of this KIP is just about the producer side
> > configuration change, but not attempting to achieve the exactly once
> > semantic with all default settings out of the box. The broker still needs
> > to be configured appropriately to achieve the exactly once semantic. If
> so,
> > the current proposal sounds reasonable to me. Apologies if I
> misunderstood
> > the goal of this KIP.
> >
> > Regarding the max.in.flight.requests.per.connection, I don't think we
> have
> > to support infinite number of in flight requests. But admittedly there
> are
> > use cases that people would want to have reasonably high in flight
> > requests. Given that we need to make code changes to support idempotence
> > and in.flight.request > 1, it would be nice to see if we can cover those
> > use cases instead of doing that later. We can discuss this in a separate
> > thread.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Tue, Aug 15, 2017 at 1:46 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > Hi Jay,
> > >
> > > I chatted with Apurva offline, and we think the key of the discussion
> is
> > > that, as summarized in the updated KIP wiki, whether we should consider
> > > replication as a necessary condition of at-least-once, and of course
> also
> > > exactly-once. Originally I think replication is not a necessary
> condition
> > > for at-least-once, since the scope of failures that we should be
> covering
> > > is different in my definition; if we claim that "even for
> at-least-once,
> > > you should have replication factor larger than 2, let alone
> exactly-once"
> > > then I agree that having acks=all on the client side should also be a
> > > necessary condition for at-least-once, and for exactly-once as well.
> Then
> > > this KIP would be just providing what is necessary but not sufficient
> > > conditions, from client-side configs to achieve EOS, while you also
> need
> > > the broker-side configs together to really support it.
> > >
> > > Guozhang
> > >
> > >
> > > On Tue, Aug 15, 2017 at 1:15 PM, Jay Kreps <j...@confluent.io> wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > I think the argument is that with acks=1 the message could be lost
> and
> > > > hence you aren't guaranteeing exactly once delivery.
> > > >
> > > > -Jay
> > > >
> > > > On Mon, Aug 14, 2017 at 1:36 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > > > Just want to clarify that regarding 1), I'm fine with changing it
> to
> > > > `all`
> > > > > but just wanted to argue it is not necessarily correlate with the
> > > > > exactly-once semantics, but rather on persistence v.s. availability
> > > > > trade-offs, so I'd like to discuss them separately.
> > > > >
> > > > > Regarding 2), one minor concern I had is that the enforcement is on
> > the
> > > > > client side while the parts it affects is on the broker side. I.e.
> > the
> > > > > broker code would assume at most 5 in.flight when idempotent is
> > turned
> > > > on,
> > > > > but this is not enforced at the broker but relying at the client
> > side's
> > > > > sanity. So other implementations of the client that may not obey
> this
> > > may
> > > > > likely break the broker code. If we do enforce this we'd better
> > enforce
> > > > it
> > > > > at the broker side. Also, I'm wondering if we have considered the
> > > > approach
> > > > > for brokers to read the logs in order to get the starting offset
> when
> > > it
> > > > > does not about it in its snapshot, that whether it is worthwhile if
> > we
> > > > > assume that such issues are very rare to happen?
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Aug 14, 2017 at 11:01 AM, Apurva Mehta <
> apu...@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I just want to summarize where we are in this discussion
> > > > > >
> > > > > > There are two major points of contention: should we have acks=1
> or
> > > > > acsk=all
> > > > > > by default? and how to cap max.in.flight.requests.per.
> connection?
> > > > > >
> > > > > > 1) acks=1 vs acks=all1
> > > > > >
> > > > > > Here are the tradeoffs of each:
> > > > > >
> > > > > > If you have replication-factor=N, your data is resilient N-1 to
> > disk
> > > > > > failures. For N>1, here is the tradeoff between acks=1 and
> > acks=all.
> > > > > >
> > > > > > With proposed defaults and acks=all, the stock Kafka producer and
> > the
> > > > > > default broker settings would guarantee that ack'd messages would
> > be
> > > in
> > > > > the
> > > > > > log exactly once.
> > > > > >
> > > > > > With the proposed defaults and acks=1, the stock Kafka producer
> and
> > > the
> > > > > > default broker settings would guarantee that 'retained ack'd
> > messages
> > > > > would
> > > > > > be in the log exactly once. But all ack'd messages may not be
> > > > retained'.
> > > > > >
> > > > > > If you leave replication-factor=1, acks=1 and acks=all have
> > identical
> > > > > > semantics and performance, but you are resilient to 0 disk
> > failures.
> > > > > >
> > > > > > I think the measured cost (again the performance details are in
> the
> > > > wiki)
> > > > > > of acks=all is well worth the much clearer semantics. What does
> the
> > > > rest
> > > > > of
> > > > > > the community think?
> > > > > >
> > > > > > 2) capping max.in.flight at 5 when idempotence is enabled.
> > > > > >
> > > > > > We need to limit the max.in.flight for the broker to de-duplicate
> > > > > messages
> > > > > > properly. The limitation would only apply when idempotence is
> > > enabled.
> > > > > The
> > > > > > shared numbers show that when the client-broker latency is low,
> > there
> > > > is
> > > > > no
> > > > > > performance gain for max.inflight > 2.
> > > > > >
> > > > > > Further, it is highly debatable that max.in.flight=500 is
> > > significantly
> > > > > > better than max.in.flight=5  for a really high latency
> > client-broker
> > > > > link,
> > > > > > and so far there are no hard numbers one way or another. However,
> > > > > assuming
> > > > > > that max.in.flight=500 is significantly better than
> max.inflight=5
> > in
> > > > > some
> > > > > > niche use case, the user would have to sacrifice idempotence for
> > > > > > throughput. In this extreme corner case, I think it is an
> > acceptable
> > > > > > tradeoff.
> > > > > >
> > > > > > What does the community think?
> > > > > >
> > > > > > Thanks,
> > > > > > Apurva
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Reply via email to