chia7712 commented on a change in pull request #9026: URL: https://github.com/apache/kafka/pull/9026#discussion_r455500775
########## File path: tests/kafkatest/tests/core/transactions_test.py ########## @@ -47,11 +47,15 @@ def __init__(self, test_context): self.num_output_partitions = 3 self.num_seed_messages = 100000 self.transaction_size = 750 - # The timeout of transaction should be lower than the timeout of verification. The transactional message sent by - # client may be not correctly completed in hard_bounce mode. The pending transaction (unstable offset) stored by - # broker obstructs TransactionMessageCopier from getting offset of partition which is used to calculate - # remaining messages after restarting. - self.transaction_timeout = 5000 + + # The transaction timeout should be lower than the progress timeout, but at + # least as high as the request timeout (which is 30s by default). When the + # client is hard-bounced, progress may depend on the previous transaction + # being aborted. When the broker is hard-bounced, we may have to wait as + # long as the request timeout to get a `Produce` response and we do not + # want the coordinator timing out the transaction. + self.transaction_timeout = 40000 + self.progress_timeout_sec = 60 Review comment: > I decided to hold off on this, but I can reconsider it if you think it's worthwhile. it is fine to me as your approach is more simple :) > What might be preferable is to provide a way to override the request timeout in TransactionalMessageCopier so that we can use lower values in all cases. the root cause I observed is different to https://issues.apache.org/jira/browse/KAFKA-9802. On my local, ```TransactionalMessageCopier``` fails due to ProducerFencedException which is caused by that broker increases the producer epoch when aborting transaction. ``` } catch (ProducerFencedException | OutOfOrderSequenceException e) { // We cannot recover from these errors, so just rethrow them and let the process fail throw e; } catch (KafkaException e) { producer.abortTransaction(); resetToLastCommittedPositions(consumer); } ``` Perhaps we should make ```TransactionalMessageCopier``` recoverable from transaction timeout before [KIP-558](https://cwiki.apache.org/confluence/display/KAFKA/KIP-588%3A+Allow+producers+to+recover+gracefully+from+transaction+timeouts) is addressed. BTW, [group_mode_transactions_test.py]( https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/core/group_mode_transactions_test.py#L256) has similar issue. Could you fix it also? Or we can apply your approach to fix ```group_mode_transactions_test.py``` in another PR. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org