Re: Transactions, delivery timeout and changing transactional producer behavior

Urbán Dániel Sat, 22 Oct 2022 00:45:47 -0700

Hi everyone,

I would like to bump this transactional producer bugagain:https://issues.apache.org/jira/browse/KAFKA-14053

Which already has an open PR: https://github.com/apache/kafka/pull/12392

There is a not-so-frequent, but - in my opinion - critical bug intransactional producers. If the bug occurs, it corrupts the partitionand blocks read_committed consumers.

Can someone please take a look? I see 2 possible solutions, implementedone of those. I'm willing to work on the 2nd approach as well, but Ineed some input on the PR.


Thanks in advance,
Daniel

2022. 09. 13. 18:46 keltezéssel, Chris Egerton írta:

Hi Colt,

You can certainly review PRs, but you cannot merge them. Reviews are still
valuable as they help lighten the workload on committers and, to some
degree, provide a signal of how high-priority a change is to the community.

Cheers,

Chris

On Sun, Sep 11, 2022 at 12:19 PM Colt McNealy <[email protected]> wrote:

Hi all—

I'm not a committer so I can't review this PR (or is that not true?).
However, I'd like to bump this as well. I believe that I've encountered
this bug during chaos testing with the transactional producer. I can
sometimes produce this error when killing a broker during a long-running
transaction, which causes a batch to encounter delivery timeout as
described in the Jira. I have observed some inconsistencies with
the consumer offset being advanced prematurely (i.e. perhaps after the
delivery of the EndTxnRequest).

Daniel, thank you for the PR.

Cheers,
Colt McNealy
*Founder, LittleHorse.io*

On Fri, Sep 9, 2022 at 9:54 AM Dániel Urbán <[email protected]> wrote:

Hi all,

I would like to bump this and bring some attention to the issue.
This is a nasty bug in the transactional producer, would be nice if I

could

get some feedback on the PR: https://github.com/apache/kafka/pull/12392

Thanks in advance,
Daniel

Viktor Somogyi-Vass <[email protected]> ezt írta
(időpont: 2022. júl. 25., H, 15:28):

Hi Luke & Artem,

We prepared the fix, would you please help in getting a

committer-reviewer

to get this issue resolved?

Thanks,
Viktor

On Fri, Jul 8, 2022 at 12:57 PM Dániel Urbán <[email protected]>
wrote:

Submitted a PR with the fix:

https://github.com/apache/kafka/pull/12392

In the PR I tried keeping the producer in a usable state after the

forced

bump. I understand that it might be the cleanest solution, but the

only

other option I know of is to transition into a fatal state, meaning

that

the producer has to be recreated after a delivery timeout. I think

that

is

still fine compared to the out-of-order messages.

Looking forward to your reviews,
Daniel

Dániel Urbán <[email protected]> ezt írta (időpont: 2022. júl.

7.,

Cs,

12:04):

Thanks for the feedback, I created
https://issues.apache.org/jira/browse/KAFKA-14053 and started

working

on

a PR.

Luke, for the workaround, we used the transaction admin tool

released

in

3.0 to "abort" these hanging batches manually.
Naturally, the cluster health should be stabilized. This issue

popped

up

most frequently around times when some partitions went into a few

minute

window of unavailability. The infinite retries on the producer side

caused

a situation where the last retry was still in-flight, but the

delivery

timeout was triggered on the client side. We reduced the retries

and

increased the delivery timeout to avoid such situations.
Still, the issue can occur in other scenarios, for example a client
queueing up many batches in the producer buffer, and causing those

batches

to spend most of the delivery timeout window in the client memory.

Thanks,
Daniel

Luke Chen <[email protected]> ezt írta (időpont: 2022. júl. 7.,

Cs,

5:13):

Hi Daniel,

Thanks for reporting the issue, and the investigation.
I'm curious, so, what's your workaround for this issue?

I agree with Artem, it makes sense. Please file a bug in JIRA.
And looking forward to your PR! :)

Thank you.
Luke

On Thu, Jul 7, 2022 at 3:07 AM Artem Livshits
<[email protected]> wrote:

Hi Daniel,

What you say makes sense.  Could you file a bug and put this

info

there

so

that it's easier to track?

-Artem

On Wed, Jul 6, 2022 at 8:34 AM Dániel Urbán <

[email protected]>

wrote:

Hello everyone,

I've been investigating some transaction related issues in a

very

problematic cluster. Besides finding some interesting issues,

had

some

ideas about how transactional producer behavior could be

improved.

My suggestion in short is: when the transactional producer

encounters

an

error which doesn't necessarily mean that the in-flight

request

was

processed (for example a client side timeout), the producer

should

not

send

an EndTxnRequest on abort, but instead it should bump the

producer

epoch.

The long description about the issue I found, and how I came

to

the

suggestion:

First, the description of the issue. When I say that the

cluster

is

"very

problematic", I mean all kinds of different issues, be it

infra

(disks

and

network) or throughput (high volume producers without fine

tuning).

In this cluster, Kafka transactions are widely used by many

producers.

And

in this cluster, partitions get "stuck" frequently (few times

every

week).

The exact meaning of a partition being "stuck" is this:

On the client side:
1. A transactional producer sends X batches to a partition in

single

transaction
2. Out of the X batches, the last few get sent, but are timed

out

thanks

to

the delivery timeout config
3. producer.flush() is unblocked due to all batches being

"finished"

4. Based on the errors reported in the producer.send()

callback,

producer.abortTransaction() is called
5. Then producer.close() is also invoked with a 5s timeout

(this

application does not reuse the producer instances optimally)
6. The transactional.id of the producer is never reused (it

was

random

generated)

On the partition leader side (what appears in the log segment

of

the

partition):
1. The batches sent by the producer are all appended to the

log

2. But the ABORT marker of the transaction was appended before

the

last 1

or 2 batches of the transaction

On the transaction coordinator side (what appears in the

transaction

state

partition):
The transactional.id is present with the Empty state.

These happenings result in the following:
1. The partition leader handles the first batch after the

ABORT

marker as

the first message of a new transaction of the same producer

id +

epoch.

(LSO is blocked at this point)
2. The transaction coordinator is not aware of any in-progress

transaction

of the producer, thus never aborting the transaction, not even

after

the

transaction.timeout.ms passes.

This is happening with Kafka 2.5 running in the cluster,

producer

versions

range between 2.0 and 2.6.
I scanned through a lot of tickets, and I believe that this

issue

is

not

specific to these versions, and could happen with newest

versions

as

well.

If I'm mistaken, some pointers would be appreciated.

Assuming that the issue could occur with any version, I

believe

this

issue

boils down to one oversight on the client side:
When a request fails without a definitive response (e.g. a

delivery

timeout), the client cannot assume that the request is

"finished",

and

simply abort the transaction. If the request is still in

flight,

and

the

EndTxnRequest, then the WriteTxnMarkerRequest gets sent and

processed

earlier, the contract is violated by the client.
This could be avoided by providing more information to the

partition

leader. Right now, a new transactional batch signals the start

of

new

transaction, and there is no way for the partition leader to

decide

whether

the batch is an out-of-order message.
In a naive and wasteful protocol, we could have a unique

transaction

id

added to each batch and marker, meaning that the leader would

be

capable

of

refusing batches which arrive after the control marker of the

transaction.

But instead of changing the log format and the protocol, we

can

achieve

the

same by bumping the producer epoch.

Bumping the epoch has a similar effect to "changing the

transaction

id" -

the in-progress transaction will be aborted with a bumped

producer

epoch,

telling the partition leader about the producer epoch change.

From

this

point on, any batches sent with the old epoch will be refused

by

the

leader

due to the fencing mechanism. It doesn't really matter how

many

batches

will get appended to the log, and how many will be refused -

this

is

an

aborted transaction - but the out-of-order message cannot

occur,

and

cannot

block the LSO infinitely.

My suggestion is, that the TransactionManager inside the

producer

should

keep track of what type of errors were encountered by the

batches

of

the

transaction, and categorize them along the lines of

"definitely

completed"

and "might not be completed". When the transaction goes into

an

abortable

state, and there is at least one batch with "might not be

completed",

the

EndTxnRequest should be skipped, and an epoch bump should be

sent.

As for what type of error counts as "might not be completed",

can

only

think of client side timeouts.

I believe this is a relatively small change (only affects the

client

lib),

but it helps in avoiding some corrupt states in Kafka

transactions.

Looking forward to your input. If it seems like a sane idea, I

go

ahead

and

submit a PR for it as well.

Thanks in advance,
Daniel


--
Ezt az e-mailt átvizsgálta az Avast AntiVirus szoftver.
www.avast.com

Re: Transactions, delivery timeout and changing transactional producer behavior

Reply via email to