Travis Bischel created KAFKA-18957:
--------------------------------------

             Summary: Impossible to recover hanging transactions in Kafka 3.9 
in KRaft mode using EOS
                 Key: KAFKA-18957
                 URL: https://issues.apache.org/jira/browse/KAFKA-18957
             Project: Kafka
          Issue Type: Bug
          Components: kraft
    Affects Versions: 3.8.1, 3.9.0
            Reporter: Travis Bischel
         Attachments: docker-compose.yml, failures.tar.gz, log4j.properties, 
script

h1. The problem

I am able to semi-reliably reproduce hanging unrecoverable transactions on 
Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of both 
1 and 3. In the failure scenario, a transaction enters and permanently hangs in 
either PrepareCommit or PrepareAbort. To try to fix the hanging transaction, I 
have tried to:
 * Wait until after the transactional timeout. Even after waiting hours just to 
see, the transaction hung.
 * Send a InitProducerID request with the LastProducerID and LastProducerEpoch 
fields set. I receive the CONCURRENT_TRANSACTIONS error code back, permanently 
(i.e. even after retrying this request for minutes)
 * I have tried to initialize a _new_ transaction – that is, send 
InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I 
permanently receive CONCURRENT_TRANSACTIONS.
 * I have tried to use kafka-transactions.sh to try to forcefully abort the 
transaction. Nothing happens; the transaction still shows as PrepareCommit or 
PrepareAbort.

I encountered this error first while trying to add support for KIP-848 to my 
franz-go client. I first thought this was an error with the branch, but to 
double check, I looped my integration tests on the main branch against Kafka 
3.9 and periodically encountered the same permanent-hang failure. This is a new 
failure; my tests have reliably passed against Kafka for years. I *do* 
encounter this failure with eager consumer rebalancers, but it is less common. 
Where the cooperative-sticky test suite may fail after three or four runs, the 
range test suite (same tests, just using a different balancer) fails after ten 
or so.
h1. Description of the test suite

franz-go has an integration test that
 * Starts one transactional producer that produces 500K records to t1 (the name 
in the tests is random), ending and beginning transactions every 10K records 
([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29])
 * Starts a chain of EOS consumer/producers. The first level (c1) consumes from 
t1 and produces to t2, the second level (c2) consumes from t2 and produces to 
t3, and the third level (c3) consumes from t3 and produces to t4.
 * Starts a second set of EOS consumer/producers. One tries to join c1, but 
aborts the first batch consumed and then leaves the group. Another also joins 
c1 and aborts and leaves after two successful commits. The third joins c2 and 
also aborts and leaves after two successful commits.
 * After another 5 seconds, a second consumer that survives until the end joins 
each of c1, c2, and c3 (to be clear: one new consumer joins c1, another new 
consumer joins c2, a third new consumer joins c3).
 * Another consumer joins c2 and aborts on the first transaction and leaves the 
group. Another consumer joins c1 and aborts after one successful transaction 
and leaves the group.
 * The test waits. All groups stop processing once they have consumed 500K 
records _and then_ timeout after 100ms of waiting for more records.
 * Finally, the test verifies a bunch of things about what was produced & seen. 
For example, all records must be seen in order, they cannot be seen twice, and 
they must be modified at each step in the chain appropriately.

The chaining setup can be seen 
[here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513].
 The helpers_test.go file and txn_test.go file are the two to look at to read 
everything that is happening.

The point is, the test is _somewhat_ rigorous in trying to trigger a few odd 
edge conditions.
h1. Logs

I have captured both client and server logs to show the client is sending 
requests in the proper order and to show that the failure is happening entirely 
broker side, not client side (i.e., if my understanding is correct, I am not 
sending requests improperly that would cause this).

I have attached client and container logs in the .tar.gz attachment below, with 
a README in each directory giving a high level overview of what we're seeing in 
each directory. I'm including the README for the single broker setup here:
 * client_15_logs: these are the isolated logs for the client in question that 
entered a hanging state. You can see EndTxn sent on l290 and we receive a 
successful reply. We then send ListTransactions to verify the client has exited 
the Prepare state. We loop issuing ListTransactions for >10s, until we stop 
trying to verify on l400. We try to force bumping the epoch by one to fix it in 
InitProducerID, but that permanently loops receiving a CONCURRENT_TRANSACTIONS 
error. This continues for the rest of the file.

 * client_raw_logs: these are raw logs for all clients combined in the 
TestTxn/cooperative-sticky output.

 * container_logs: these are raw logs for the container, the output of `docker 
container logs`

 * container_logs_for_client: this is container_logs but with all lines that do 
not mention the hung transactional ID deleted. The hung transactional ID is 
314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, we 
can see __consumer_offsets being brought into the transaction, and then you see 
an ADD_OFFSETS_TO_TXN request being finished on l92 and then a 
TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares to 
transition from Ongoing to PrepareCommit. On l97, the broker prepares to 
transition from PrepareCommit to CompleteCommit, and on l98 the broker replies 
to the client's END_TXN request with a success. However, immediately after it, 
you can see LIST_TRANSACTIONS being looped with state filters PrepareCommit or 
PrepareAbort, and you can see the transactions never transitions out of 
PrepareCommit. LIST_TRANSACTIONS loops for a while, then you see the 
INIT_PRODUCER_ID loop for a long time, permanently receiving 
CONCURRENT_TRANSACTIONS.

h1. Files
 * I have attached the docker-compose.yml I used for both a single broker setup 
and a three broker setup. The single broker setup is commented out; the setups 
can be easily swapped.
 * I've attached the log4j.properties that I used to opt into debug logging on 
the request logger as well as the transaction logger.
 * I've attached the script I used to run the tests in a loop. The script 
contains my local directory paths but should be easy enough to modify. 
Essentially, it takes down the cluster and restarts it to have a clean slate of 
logs, changes into my kgo test directory, and then runs the test piping client 
log output to `logs`. *NOTE* The following environment variables are required 
to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and when 
running with a single broker, `KGO_TEST_RF=1`. The test setup does not always 
hang, but for me, it hangs quite frequently (every few runs).
 * I've attached a .tar.gz file containing two directories, one containing 
client & broker logs for a single broker setup, another containing broker only 
logs for the three broker setup.

h1. franz-go branch

The ListTransactions verification is not in franz-go main; I have added it to a 
branch and to not plan to merge it to main. If you are trying to run this setup 
locally, the code is here: 
[https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8]

I can also change stuff on my side to help debug what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to