thetumbled opened a new issue, #19200:
URL: https://github.com/apache/pulsar/issues/19200

   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Version
   
   master
   
   ### Minimal reproduce step
   
   start transactional producing and consuming, and restart broker.
   
   ### What did you expect to see?
   
   transaction with configured `timeout` will be abort when it is expired.
   
   ### What did you see instead?
   
   transaction lasted for long time and will not be aborted, which cause TB's 
`MaxReadPosition` do not move and will not take snapshot. With an old snapshot, 
TB will read a lot of entry while doing recovery. 
   In worst cases, there are 30 minutes of unavailable time with Topics.
   
   ### Anything else?
   
   searching the broker log, i found
   ```
   2023-01-12T10:45:21,841+0800 
[transaction_coordinator_TransactionCoordinatorID(id=10)thread_factory-57-1] 
ERROR org.apache.pulsar.broker.TransactionMetadataStoreService - **End 
transaction fail! TxnId : **(10,326026)**, TxnAction : 1**
   
org.apache.pulsar.transaction.coordinator.exceptions.CoordinatorException$**CoordinatorNotFoundException**:
 Transaction coordinator with id 10 not found!
       at 
org.apache.pulsar.broker.TransactionMetadataStoreService.getTxnMeta(TransactionMetadataStoreService.java:336)
 ~[org.apache.pulsar-pulsar-broker-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.broker.TransactionMetadataStoreService.endTransaction(TransactionMetadataStoreService.java:384)
 ~[org.apache.pulsar-pulsar-broker-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.broker.TransactionMetadataStoreService.endTransaction(TransactionMetadataStoreService.java:363)
 ~[org.apache.pulsar-pulsar-broker-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.broker.transaction.recover.TransactionRecoverTrackerImpl.lambda$**handleCommittingAndAbortingTransaction**$1(TransactionRecoverTrackerImpl.java:134)
 ~[org.apache.pulsar-pulsar-broker-2.9.4.jar:2.9.4]
       at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_92]
       at 
org.apache.pulsar.broker.transaction.recover.TransactionRecoverTrackerImpl.handleCommittingAndAbortingTransaction(TransactionRecoverTrackerImpl.java:133)
 ~[org.apache.pulsar-pulsar-broker-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.transaction.coordinator.impl.MLTransactionMetadataStore$1.replayComplete(MLTransactionMetadataStore.java:131)
 ~[org.apache.pulsar-pulsar-transaction-coordinator-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.transaction.coordinator.impl.MLTransactionLogImpl$TransactionLogReplayer.start(MLTransactionLogImpl.java:293)
 ~[org.apache.pulsar-pulsar-transaction-coordinator-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.transaction.coordinator.impl.MLTransactionLogImpl.replayAsync(MLTransactionLogImpl.java:149)
 ~[org.apache.pulsar-pulsar-transaction-coordinator-2.9.4.jar:2.9.4]
       at 
org.apache.pulsar.transaction.coordinator.impl.MLTransactionMetadataStore.lambda$init$0(MLTransactionMetadataStore.java:118)
 ~[org.apache.pulsar-pulsar-transaction-coordinator-2.9.4.jar:2.9.4]
       at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[?:1.8.0_92]
       at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_92]
       at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
 ~[?:1.8.0_92]
       at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 ~[?:1.8.0_92]
       at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
~[?:1.8.0_92]
       at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
~[?:1.8.0_92]
       at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 ~[io.netty-netty-common-4.1.77.Final.jar:4.1.77.Final]
       at java.lang.Thread.run(Thread.java:745) ~[?:1.8.0_92]
   ```
   
   and i query low transaction.
   ```
   bin/pulsar-admin transactions slow-transactions -t 1s |  python3 
calculateLowTransaction.py
   ```
   
   ```
   exceed timeout! last for 871.7668531457583 minutes
   openTimestamp:Wed Jan 11 20:18:13 2023
   "(10,326030)    TransactionMetadata(txnId=(10,326030), status=ABORTING, 
openTimestamp=1673439493157, timeoutAt=10000, producedPartitions={}, 
ackedPartitions={persistent://test/tb5/testTxn10-partition-9={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-3={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-6={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-2={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)}})"
   exceed timeout! **last for 871.8101231058439 minutes**
   openTimestamp:Wed Jan 11 20:18:10 2023
   " **(10,326026)**    TransactionMetadata(txnId=(10,326026), status=ABORTING, 
openTimestamp=1673439490561, timeoutAt=10000, producedPartitions={}, 
ackedPartitions={persistent://test/tb5/testTxn10-partition-8={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-7={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-6={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-5={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)},
 
persistent://test/tb5/testTxn10-partition-0={sub=TransactionInPendingAckStats(cumulativeAckPosition=null)}})"
   exceed timeout! last for 1011.1614414294561 minutes
   openTimestamp:Wed Jan 11 17:58:49 2023
   "(10,320436)    TransactionMetadata(txnId=(10,320436), status=COMMITTING, 
openTimestamp=1673431129482, timeoutAt=10000, 
producedPartitions={persistent://test/tb5/testTxn10-partition-1=TransactionInBufferStats(startPosition=null,
 aborted=false)}, ackedPartitions={})"
   read end, goodbye.
   ```
   
   we can see that the transaction `(10,326026)` has last for 871.8101231058439 
minutes, which greatly exceed the configured timeout 10s. The reason for such 
situation is that when broker restarts and scan for the transaction log, it 
found that there are transactions such as  `(10,326026)`  in `ABORTING` status, 
so it put it into 
   
`org.apache.pulsar.broker.transaction.recover.TransactionRecoverTrackerImpl#abortingTransactions`
   and method 
`org.apache.pulsar.broker.transaction.recover.TransactionRecoverTrackerImpl#handleCommittingAndAbortingTransaction`
 will try to abort such transaction.
   But, when it try to abort these transaction, the corresponding TC is not put 
into `org.apache.pulsar.broker.TransactionMetadataStoreService#stores` , so it 
throws `CoordinatorNotFoundException`. 
   If the exception is retryable, a TimeOut task will be arranged in 
`transactionOpRetryTimer`.
   <img width="598" alt="image" 
src="https://user-images.githubusercontent.com/52550727/211976272-26dafbf0-956f-450a-be2a-e1d4a33e1145.png";>
   But, `CoordinatorNotFoundException` is not retryable.
   <img width="723" alt="image" 
src="https://user-images.githubusercontent.com/52550727/211976416-38f188a7-3c90-4dda-9a76-462b7fdbda9f.png";>
   
   
   
   ### Are you willing to submit a PR?
   
   - [X] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to