Travis Bischel created KAFKA-18957: -------------------------------------- Summary: Impossible to recover hanging transactions in Kafka 3.9 in KRaft mode using EOS Key: KAFKA-18957 URL: https://issues.apache.org/jira/browse/KAFKA-18957 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.8.1, 3.9.0 Reporter: Travis Bischel Attachments: docker-compose.yml, failures.tar.gz, log4j.properties, script
h1. The problem I am able to semi-reliably reproduce hanging unrecoverable transactions on Kafka 3.8 using the bitnami container in KRaft mode with a cluster size of both 1 and 3. In the failure scenario, a transaction enters and permanently hangs in either PrepareCommit or PrepareAbort. To try to fix the hanging transaction, I have tried to: * Wait until after the transactional timeout. Even after waiting hours just to see, the transaction hung. * Send a InitProducerID request with the LastProducerID and LastProducerEpoch fields set. I receive the CONCURRENT_TRANSACTIONS error code back, permanently (i.e. even after retrying this request for minutes) * I have tried to initialize a _new_ transaction – that is, send InitProducerID with LastProducerID=-1 and LastProducerEpoch=-1. Again, I permanently receive CONCURRENT_TRANSACTIONS. * I have tried to use kafka-transactions.sh to try to forcefully abort the transaction. Nothing happens; the transaction still shows as PrepareCommit or PrepareAbort. I encountered this error first while trying to add support for KIP-848 to my franz-go client. I first thought this was an error with the branch, but to double check, I looped my integration tests on the main branch against Kafka 3.9 and periodically encountered the same permanent-hang failure. This is a new failure; my tests have reliably passed against Kafka for years. I *do* encounter this failure with eager consumer rebalancers, but it is less common. Where the cooperative-sticky test suite may fail after three or four runs, the range test suite (same tests, just using a different balancer) fails after ten or so. h1. Description of the test suite franz-go has an integration test that * Starts one transactional producer that produces 500K records to t1 (the name in the tests is random), ending and beginning transactions every 10K records ([here|https://github.com/twmb/franz-go/blob/master/pkg/kgo/txn_test.go#L29]) * Starts a chain of EOS consumer/producers. The first level (c1) consumes from t1 and produces to t2, the second level (c2) consumes from t2 and produces to t3, and the third level (c3) consumes from t3 and produces to t4. * Starts a second set of EOS consumer/producers. One tries to join c1, but aborts the first batch consumed and then leaves the group. Another also joins c1 and aborts and leaves after two successful commits. The third joins c2 and also aborts and leaves after two successful commits. * After another 5 seconds, a second consumer that survives until the end joins each of c1, c2, and c3 (to be clear: one new consumer joins c1, another new consumer joins c2, a third new consumer joins c3). * Another consumer joins c2 and aborts on the first transaction and leaves the group. Another consumer joins c1 and aborts after one successful transaction and leaves the group. * The test waits. All groups stop processing once they have consumed 500K records _and then_ timeout after 100ms of waiting for more records. * Finally, the test verifies a bunch of things about what was produced & seen. For example, all records must be seen in order, they cannot be seen twice, and they must be modified at each step in the chain appropriately. The chaining setup can be seen [here|https://github.com/twmb/franz-go/blob/f30c518d6b727b9169a90b8c10e2127301822a3a/pkg/kgo/helpers_test.go#L482-L513]. The helpers_test.go file and txn_test.go file are the two to look at to read everything that is happening. The point is, the test is _somewhat_ rigorous in trying to trigger a few odd edge conditions. h1. Logs I have captured both client and server logs to show the client is sending requests in the proper order and to show that the failure is happening entirely broker side, not client side (i.e., if my understanding is correct, I am not sending requests improperly that would cause this). I have attached client and container logs in the .tar.gz attachment below, with a README in each directory giving a high level overview of what we're seeing in each directory. I'm including the README for the single broker setup here: * client_15_logs: these are the isolated logs for the client in question that entered a hanging state. You can see EndTxn sent on l290 and we receive a successful reply. We then send ListTransactions to verify the client has exited the Prepare state. We loop issuing ListTransactions for >10s, until we stop trying to verify on l400. We try to force bumping the epoch by one to fix it in InitProducerID, but that permanently loops receiving a CONCURRENT_TRANSACTIONS error. This continues for the rest of the file. * client_raw_logs: these are raw logs for all clients combined in the TestTxn/cooperative-sticky output. * container_logs: these are raw logs for the container, the output of `docker container logs` * container_logs_for_client: this is container_logs but with all lines that do not mention the hung transactional ID deleted. The hung transactional ID is 314f54cdee329433ec598bd4584fd7f689b501fc29b3c172d9819be1d3d8ee60. On l89, we can see __consumer_offsets being brought into the transaction, and then you see an ADD_OFFSETS_TO_TXN request being finished on l92 and then a TXN_OFFSET_COMMIT request being finished on l93. On l94, the broker prepares to transition from Ongoing to PrepareCommit. On l97, the broker prepares to transition from PrepareCommit to CompleteCommit, and on l98 the broker replies to the client's END_TXN request with a success. However, immediately after it, you can see LIST_TRANSACTIONS being looped with state filters PrepareCommit or PrepareAbort, and you can see the transactions never transitions out of PrepareCommit. LIST_TRANSACTIONS loops for a while, then you see the INIT_PRODUCER_ID loop for a long time, permanently receiving CONCURRENT_TRANSACTIONS. h1. Files * I have attached the docker-compose.yml I used for both a single broker setup and a three broker setup. The single broker setup is commented out; the setups can be easily swapped. * I've attached the log4j.properties that I used to opt into debug logging on the request logger as well as the transaction logger. * I've attached the script I used to run the tests in a loop. The script contains my local directory paths but should be easy enough to modify. Essentially, it takes down the cluster and restarts it to have a clean slate of logs, changes into my kgo test directory, and then runs the test piping client log output to `logs`. *NOTE* The following environment variables are required to be set: `KGO_TEST_STABLE_FETCH=true`, `KGO_LOG_LEVEL=debug`, and when running with a single broker, `KGO_TEST_RF=1`. The test setup does not always hang, but for me, it hangs quite frequently (every few runs). * I've attached a .tar.gz file containing two directories, one containing client & broker logs for a single broker setup, another containing broker only logs for the three broker setup. h1. franz-go branch The ListTransactions verification is not in franz-go main; I have added it to a branch and to not plan to merge it to main. If you are trying to run this setup locally, the code is here: [https://github.com/twmb/franz-go/pull/928/commits/474557f754df27abe0d417fdfb3b4857d135ded8] I can also change stuff on my side to help debug what's going on here. -- This message was sent by Atlassian Jira (v8.20.10#820010)