Thanks for the updates Jason. I'm pretty satisfied with the overall
motivation and proposed solution, just a couple of more comments.

1. Why do we need to use type string for `StatesFilter` instead of a short
value, as we could translate it and save space?

2. I'm wondering whether the requirement for Describe permission on
TransactionalId works when we are heading towards
https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
consumer group id instead of defining the transactional id. At a first
look, I think it should be ok but just want to raise this point.

3. Could the --find-hanging work with checking all brokers in the cluster,
or multiple brokers as a list?

4. Similar to transaction abortion, I guess there is a trade-off for
too-specific vs too-general for the required number of arguments. However,
supposedly I would like to wipe out all the associated transactions with
the given transactional id, or I want to clean up *all *hanging
transactions in the cluster, do I need to write the script on my own? Maybe
we could discuss a bit on whether we would like to support a more holistic
API, or this is good for now.


On Thu, Sep 10, 2020 at 7:53 AM Tom Bentley <tbent...@redhat.com> wrote:

> Sounds good to me, thanks!
>
> On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hey Tom,
> >
> > Yeah, that's fair. I will update the proposal. I was also thinking of
> > adding a separate column for duration, just to save users the trouble of
> > computing it.
> >
> > Thanks,
> > Jason
> >
> > On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tbent...@redhat.com> wrote:
> >
> > > Hi Jason,
> > >
> > > The KIP looks good to me, but I had one question. AFAIU the
> LastTimestamp
> > > column in the output of --describe-producers and --find-hanging is
> there
> > so
> > > the users of the tool know the txnLastUpdateTimestamp of the
> > > TransactionMetadata and from that and the (max) timeout can infer
> > something
> > > about the likelihood that this really is a stuck transaction. If that's
> > the
> > > case then what is the benefit in displaying it as a ms offset from the
> > unix
> > > epoch, rather than an actual date time?
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> > > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wangg...@gmail.com>
> > wrote:
> > >
> > > > Thanks Jason, I do not have more comments on the KIP then.
> > > >
> > > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > DescribeProducers
> > > > > response either?
> > > > >
> > > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the
> > schema.
> > > > > Fixed now!
> > > > >
> > > > > -Jason
> > > > >
> > > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wangg...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <
> > ja...@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Guozhang,
> > > > > > >
> > > > > > > Thanks for the detailed comments. Responses inline:
> > > > > > >
> > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> old
> > > > > brokers,
> > > > > > > since without the additional field "Partitions" the tool needs
> to
> > > set
> > > > > the
> > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > still
> > > > > doable
> > > > > > > but would require different call paths, and it's not clear
> > whether
> > > > > that's
> > > > > > > worth doing for old versions.
> > > > > > >
> > > > > > > That's a good question. What I had in mind was to write the
> > marker
> > > > > using
> > > > > > > the last coordinator epoch that was used by the respective
> > > > ProducerId.
> > > > > I
> > > > > > > realized that I left the coordinator epoch out of the
> > > > > `DescribeProducers`
> > > > > > > response, so I have updated the KIP to include it. It is
> possible
> > > > that
> > > > > > > there is no coordinator epoch associated with a given
> ProducerId
> > > > (e.g.
> > > > > if
> > > > > > > it is the first transaction from that producer), but in this
> case
> > > we
> > > > > can
> > > > > > > use 0.
> > > > > > >
> > > > > > > As for whether this is worth doing, I guess I would be more
> > > inclined
> > > > to
> > > > > > > leave it out if users had a reasonable alternative today to
> > address
> > > > > this
> > > > > > > problem.
> > > > > > >
> > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> to
> > > only
> > > > > > > leaders
> > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > really
> > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > workflow
> > > > > > > you've described, based on the results back from
> > DescribeProducers,
> > > > we
> > > > > > > should just immediately send ListTransactions to the
> > > > > > > corresponding coordinators based on the collected producer ids,
> > > > instead
> > > > > > of
> > > > > > > trying to send to any brokers right?
> > > > > > >
> > > > > > > I'm going to change `DescribeProducers` so that it can be
> handled
> > > by
> > > > > any
> > > > > > > replica of a topic partition. This was suggested by Lucas in
> > order
> > > to
> > > > > > allow
> > > > > > > this API to be used for replica consistency testing. As far as
> > > > > > > `ListTransactions`, I was treating this similarly to
> > `ListGroups`.
> > > > > > Although
> > > > > > > we know that the coordinators are the leaders of the
> > > > > __transaction_state
> > > > > > > partitions, this is more of an implementation detail. From an
> API
> > > > > > > perspective, we say that any broker could be a transaction
> > > > coordinator.
> > > > > > >
> > > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > > > permission
> > > > > > > on
> > > > > > > the associated topic sufficient to allow any users to get all
> > > > producer
> > > > > > > information writing to the specific topic-partitions including
> > last
> > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > sensitive?
> > > > > > > Should we require "ClusterAction" to only allow operators only?
> > > > > > >
> > > > > > > That's a fair point. Do you think `Read` permission would be
> > > > > reasonable?
> > > > > > > This is all information that could be obtained by reading the
> > > topic.
> > > > > > >
> > > > > > > Yeah that makes sense.
> > > > > >
> > > > > >
> > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > included
> > > in
> > > > > the
> > > > > > > DescribeTransaction response schema? Otherwise the user would
> not
> > > get
> > > > > it
> > > > > > in
> > > > > > > the following WriteTxnMarker request.
> > > > > > >
> > > > > > > The `DescribeTransaction` API is sent to the transaction
> > > coordinator,
> > > > > > which
> > > > > > > does not know the start offset of a transaction in each topic
> > > > > partition.
> > > > > > > That is why we need `DescribeProducers`.
> > > > > > >
> > > > > >
> > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > DescribeProducers
> > > > > > response either?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > 5. It is a bit easier for readers to highlight the added
> fields
> > > in
> > > > > the
> > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> > > adding
> > > > > > > "Partitions" with the starting offset, right?). Also as for its
> > > > > response
> > > > > > it
> > > > > > > seems we do not make any schema changes except adding one more
> > > > > potential
> > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case
> > we
> > > > can
> > > > > > just
> > > > > > > state that explicitly.
> > > > > > >
> > > > > > > I highlighted the new field in the request. For the response,
> the
> > > KIP
> > > > > > > states the following: "There are no changes to the response
> > schema,
> > > > but
> > > > > > it
> > > > > > > will be bumped. Note that we are also enabling flexible version
> > > > > support."
> > > > > > >
> > > > > > > > 6. It is not clear to me for the overloaded function that the
> > > > > following
> > > > > > > option classes are not specified, what should be the default
> > > options?
> > > > > > > ...
> > > > > > >
> > > > > > > I was just trying to stick with existing conventions, but I
> will
> > > add
> > > > > some
> > > > > > > more detail here. I think we should probably still include
> > > > > > > `AbortTransactionOptions`. The `Options` classes are how users
> > > > override
> > > > > > > timeouts.
> > > > > > >
> > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > presume
> > > we
> > > > > > would
> > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > >
> > > > > > > I think it should be required as a reasonable way to limit the
> > > scope
> > > > of
> > > > > > the
> > > > > > > search. This is meant to be guided by metrics after all. If we
> do
> > > not
> > > > > > limit
> > > > > > > the scope to a single broker, then the behavior might get worse
> > as
> > > > the
> > > > > > > cluster grows. I will clarify this.
> > > > > > >
> > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > feature
> > > > in
> > > > > > the
> > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > intentional?
> > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > too:
> > > if
> > > > > we
> > > > > > > believe the user has the right ACL, it is legitimate to return
> > the
> > > > > > producer
> > > > > > > information to her anyways. But that is debatable in the meta
> > point
> > > > 3)
> > > > > > > above.
> > > > > > >
> > > > > > > Yeah, I was planning to add this to support the use case that
> > Lucas
> > > > > > > mentioned. There is some awkwardness since it is a little
> > difficult
> > > > to
> > > > > > > convey different sources of information through the same
> > command. I
> > > > > guess
> > > > > > > we can do `--list producers` and `--list transactions` and
> > explain
> > > in
> > > > > the
> > > > > > > documentation. Maybe that is good enough.
> > > > > > >
> > > > > > > > 7.3 "Describing Transactions": we should also explain how
> that
> > > > would
> > > > > be
> > > > > > > executed, e.g. at least we should clarify that we would first
> > find
> > > > the
> > > > > > > coordinator based on the transactional.id and hence users do
> not
> > > > need
> > > > > to
> > > > > > > specify one.
> > > > > > >
> > > > > > > Sure, makes sense.
> > > > > > >
> > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > "--broker"
> > > > > > > node
> > > > > > > as a required option? Otherwise we would not know which broker
> to
> > > > send
> > > > > > to.
> > > > > > >
> > > > > > > The --topic and --partition arguments are required, so the
> target
> > > is
> > > > > > always
> > > > > > > the leader of that partition.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > > > bob.barr...@confluent.io
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > Thanks for this KIP, I think this will be a huge operational
> > > > > > improvement
> > > > > > > > and overall it looks great to me.
> > > > > > > >
> > > > > > > > I'm not sure how much value the MaxActiveTransactionDuration
> > > metric
> > > > > > adds,
> > > > > > > > given that we have the --find-hanging option in the tool. As
> > you
> > > > > > mention,
> > > > > > > > instances of these transactions are expected to be rare, and
> a
> > > > > > > > partition-level metric, which can generate a lot of data,
> seems
> > > > very
> > > > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > > > kafka-transactions
> > > > > > > > --find-hanging on the relevant broker" is a reasonable
> process
> > > for
> > > > > > > cluster
> > > > > > > > operators to follow.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bob
> > > > > > > >
> > > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> > > wangg...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks for the written KIP. I think this is going to be a
> > very
> > > > > useful
> > > > > > > > tool
> > > > > > > > > for operational improvements since with eos in its current
> > > stage,
> > > > > we
> > > > > > > > cannot
> > > > > > > > > confidently assert that we are bug-free, and even in the
> > future
> > > > > when
> > > > > > we
> > > > > > > > are
> > > > > > > > > confident this is still going to be leveraged by older
> > > versioned
> > > > > > > brokers.
> > > > > > > > > Regarding the solution, I've also debated myself whether
> > Kafka
> > > > > should
> > > > > > > > > "self-heal" automatically when detected in such situations,
> > or
> > > > > should
> > > > > > > we
> > > > > > > > > instead build into ecosystem tooling to let operators do
> it.
> > > And
> > > > > I've
> > > > > > > > also
> > > > > > > > > convinced myself that the latter should be a better
> solution
> > to
> > > > > keep
> > > > > > > > Kafka
> > > > > > > > > software itself simpler.
> > > > > > > > >
> > > > > > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > > > > > >
> > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> > old
> > > > > > brokers,
> > > > > > > > > since without the additional field "Partitions" the tool
> > needs
> > > to
> > > > > set
> > > > > > > the
> > > > > > > > > coordinator epoch correctly instead of "-1"? Arguably
> that's
> > > > still
> > > > > > > doable
> > > > > > > > > but would require different call paths, and it's not clear
> > > > whether
> > > > > > > that's
> > > > > > > > > worth doing for old versions.
> > > > > > > > >
> > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> > to
> > > > only
> > > > > > > > leaders
> > > > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > > > really
> > > > > > > > > "ListTransactions to be sent to coordinators only"? From
> the
> > > > > workflow
> > > > > > > > > you've described, based on the results back from
> > > > DescribeProducers,
> > > > > > we
> > > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > > corresponding coordinators based on the collected producer
> > ids,
> > > > > > instead
> > > > > > > > of
> > > > > > > > > trying to send to any brokers right?
> > > > > > > > >
> > > > > > > > > Also I'm a bit concerned if "ListTransactions" could
> > > potentially
> > > > > > return
> > > > > > > > too
> > > > > > > > > much data with "StateFilters" set to all states, including
> > > > > completed
> > > > > > > > ones.
> > > > > > > > > Do we expect users ever want to know transactions that are
> > not
> > > > > > pending?
> > > > > > > > On
> > > > > > > > > the other hand, maybe we can just require users to specify
> > the
> > > > > > "pids[]"
> > > > > > > > in
> > > > > > > > > this request too to further filter those un-interested
> > > > > transactions.
> > > > > > > This
> > > > > > > > > also works well with the workflow: we know exactly from
> > > > > > > > "DescribeProducers"
> > > > > > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > > > > > "ListTransactions" we should also only care for those
> > > partitions
> > > > > > only.
> > > > > > > > >
> > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> `Describe`
> > > > > > permission
> > > > > > > > on
> > > > > > > > > the associated topic sufficient to allow any users to get
> all
> > > > > > producer
> > > > > > > > > information writing to the specific topic-partitions
> > including
> > > > last
> > > > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > > > sensitive?
> > > > > > > > > Should we require "ClusterAction" to only allow operators
> > only?
> > > > > > > > >
> > > > > > > > > Below are more detailed comments:
> > > > > > > > >
> > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > included
> > > > in
> > > > > > the
> > > > > > > > > DescribeTransaction response schema? Otherwise the user
> would
> > > not
> > > > > get
> > > > > > > it
> > > > > > > > in
> > > > > > > > > the following WriteTxnMarker request.
> > > > > > > > >
> > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > fields
> > > > in
> > > > > > the
> > > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> > only
> > > > > adding
> > > > > > > > > "Partitions" with the starting offset, right?). Also as for
> > its
> > > > > > > response
> > > > > > > > it
> > > > > > > > > seems we do not make any schema changes except adding one
> > more
> > > > > > > potential
> > > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the
> > case
> > > > we
> > > > > > can
> > > > > > > > just
> > > > > > > > > state that explicitly.
> > > > > > > > >
> > > > > > > > > 6. It is not clear to me for the overloaded function that
> the
> > > > > > following
> > > > > > > > > option classes are not specified, what should be the
> default
> > > > > options?
> > > > > > > > >
> > > > > > > > > * ListTransactionsOptions
> > > > > > > > > * DescribeTransactionsOptions
> > > > > > > > > * DescribeProducersOptions
> > > > > > > > >
> > > > > > > > > Also, it seems AbortTransactionOptions would just be empty?
> > If
> > > > yes
> > > > > do
> > > > > > > we
> > > > > > > > > really need this option class for now?
> > > > > > > > >
> > > > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > presume
> > > > we
> > > > > > > would
> > > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > feature
> > > > > in
> > > > > > > the
> > > > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > > > intentional?
> > > > > > > > > Personally I feel exposing a "--list-producers" may be
> useful
> > > > too:
> > > > > if
> > > > > > > we
> > > > > > > > > believe the user has the right ACL, it is legitimate to
> > return
> > > > the
> > > > > > > > producer
> > > > > > > > > information to her anyways. But that is debatable in the
> meta
> > > > point
> > > > > > 3)
> > > > > > > > > above.
> > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > that
> > > > > would
> > > > > > be
> > > > > > > > > executed, e.g. at least we should clarify that we would
> first
> > > > find
> > > > > > the
> > > > > > > > > coordinator based on the transactional.id and hence users
> do
> > > not
> > > > > > need
> > > > > > > to
> > > > > > > > > specify one.
> > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > > "--broker"
> > > > > > > > node
> > > > > > > > > as a required option? Otherwise we would not know which
> > broker
> > > to
> > > > > > send
> > > > > > > > to.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, nice written one, thanks Jason.
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > > > lu...@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > >> Would it be worth returning
> > > transactional.id.expiration.ms
> > > > in
> > > > > > the
> > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > >
> > > > > > > > > > > That's an interesting thought as well. Are you trying
> to
> > > > avoid
> > > > > > the
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > specify it through the command line? The tool could also
> > > query
> > > > > the
> > > > > > > > value
> > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > >
> > > > > > > > > > Basically. I'm not sure how useful this will be in
> > practice,
> > > > > though
> > > > > > > it
> > > > > > > > > > might help when debugging.
> > > > > > > > > >
> > > > > > > > > > Lucas
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > > > ja...@confluent.io
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Lucas,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > > > >
> > > > > > > > > > > > Given that it's possible for replica producer states
> to
> > > > > diverge
> > > > > > > > from
> > > > > > > > > > each
> > > > > > > > > > > other, it would be very useful if
> > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > and
> > > > > > > > > > > tooling is able to query all partition replicas for
> their
> > > > > > producers
> > > > > > > > > > >
> > > > > > > > > > > Yes, it makes sense to me to let DescribeProducers work
> > on
> > > > both
> > > > > > > > > followers
> > > > > > > > > > > and leaders. In fact, I'm encouraged that there are use
> > > cases
> > > > > for
> > > > > > > > this
> > > > > > > > > > work
> > > > > > > > > > > other than detecting hanging transactions. That was
> > indeed
> > > > the
> > > > > > > hope,
> > > > > > > > > but
> > > > > > > > > > I
> > > > > > > > > > > didn't have anything specific in mind. I will update
> the
> > > > > > proposal.
> > > > > > > > > > >
> > > > > > > > > > > > Would it be worth returning
> > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > >
> > > > > > > > > > > That's an interesting thought as well. Are you trying
> to
> > > > avoid
> > > > > > the
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > specify it through the command line? The tool could
> also
> > > > query
> > > > > > the
> > > > > > > > > value
> > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > > > lu...@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jason,
> > > > > > > > > > > >
> > > > > > > > > > > > This looks like a very useful tool, thanks for
> writing
> > it
> > > > up.
> > > > > > > > > > > >
> > > > > > > > > > > > Given that it's possible for replica producer states
> to
> > > > > diverge
> > > > > > > > from
> > > > > > > > > > each
> > > > > > > > > > > > other, it would be very useful if
> > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > and
> > > > > > > > > > > > tooling is able to query all partition replicas for
> > their
> > > > > > > > producers.
> > > > > > > > > > One
> > > > > > > > > > > > way I can see this being used immediately is in
> kafka's
> > > > > system
> > > > > > > > tests,
> > > > > > > > > > > > especially the ones that inject failures. At the end
> of
> > > the
> > > > > > test
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > > query all replicas and make sure that their states
> have
> > > not
> > > > > > > > > diverged. I
> > > > > > > > > > > can
> > > > > > > > > > > > also see it being useful when debugging production
> > > clusters
> > > > > > too.
> > > > > > > > > > > >
> > > > > > > > > > > > Would it be worth returning
> > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >
> > > > > > > > > > > > Lucas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > > > rndg...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks,
> > Jason!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ron
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > > > ja...@confluent.io>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We do not typically backport new APIs to older
> > > > versions.
> > > > > I
> > > > > > > > think
> > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > > > > > however make the --abort command compatible with
> > > older
> > > > > > > > versions.
> > > > > > > > > It
> > > > > > > > > > > > would
> > > > > > > > > > > > > > require a user to do some analysis on their own
> to
> > > > > > identify a
> > > > > > > > > > hanging
> > > > > > > > > > > > > > transaction, but then they can use the tool from
> a
> > > new
> > > > > > > release
> > > > > > > > to
> > > > > > > > > > > > > recover.
> > > > > > > > > > > > > > For example, users could detect a hanging
> > transaction
> > > > > > through
> > > > > > > > the
> > > > > > > > > > > > > existing
> > > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect the
> > > > needed
> > > > > > > > > > information
> > > > > > > > > > > > > from a
> > > > > > > > > > > > > > dump of the log (or producer snapshot). It's more
> > > work,
> > > > > but
> > > > > > > at
> > > > > > > > > > least
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > > > > > rndg...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written
> > KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Will the implementation be backported to prior
> > > Kafka
> > > > > > > > versions?
> > > > > > > > > > The
> > > > > > > > > > > > > > reason
> > > > > > > > > > > > > > > I ask is because if it is not backported and
> > > similar
> > > > > > > > > > functionality
> > > > > > > > > > > is
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > otherwise made available for older versions,
> then
> > > the
> > > > > > only
> > > > > > > > > > recourse
> > > > > > > > > > > > > > (aside
> > > > > > > > > > > > > > > from deleting and recreating the topic as you
> > > pointed
> > > > > > out)
> > > > > > > > may
> > > > > > > > > be
> > > > > > > > > > > to
> > > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up
> > getting
> > > > > this
> > > > > > > > > > > > > functionality).
> > > > > > > > > > > > > > > Such an upgrade may not be desirable,
> especially
> > if
> > > > the
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > > > > intermediate versions is considerable. I
> > understand
> > > > the
> > > > > > > > mantra
> > > > > > > > > of
> > > > > > > > > > > > > "never
> > > > > > > > > > > > > > > fall too many versions behind" but the reality
> of
> > > it
> > > > is
> > > > > > > that
> > > > > > > > it
> > > > > > > > > > > isn't
> > > > > > > > > > > > > > > always the case.  Even if the version is
> > relatively
> > > > > > recent,
> > > > > > > > an
> > > > > > > > > > > > upgrade
> > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > still not be possible for some time, and a
> > quicker
> > > > > > > resolution
> > > > > > > > > may
> > > > > > > > > > > be
> > > > > > > > > > > > > > > necessary.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason
> Gustafson <
> > > > > > > > > > > ja...@confluent.io>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I've added a proposal to handle the problem
> of
> > > > > hanging
> > > > > > > > > > > > transactions:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > > > .
> > > > > > > > > > > > > > > > In theory, this should never happen. In
> > practice,
> > > > we
> > > > > > have
> > > > > > > > hit
> > > > > > > > > > one
> > > > > > > > > > > > bug
> > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > it was possible and there are few good
> options
> > > > today
> > > > > to
> > > > > > > > > > recover.
> > > > > > > > > > > > > Take a
> > > > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Reply via email to