kevinrr888 opened a new pull request, #5813: URL: https://github.com/apache/accumulo/pull/5813
This PR partially addresses #5787 I have reached a dead end with debugging this test. The test logic has no issues as far as I can tell and the FATE logic (as far as I can tell) has one potential concurrency issue (which I addressed in this PR), but the failure still occurs occassionally. From jstacking the test process in a failure case, it appears that the thread is either getting stuck on the `workQueue.poll(100, MILLISECONDS)` call or it is repeatedly retrying it, neither of which should be possible given the shutdown logic. Here is the code: ``` while (fate.getKeepRunning().get() && !stop.get()) { FateId unreservedFateId = workQueue.poll(100, MILLISECONDS); ... ``` The jstack trace shows this throughout the time FATE is trying to shutdown: ``` "accumulo.pool.manager.fate.user.commit_compaction.namespace_create.namespace_delete.namespace_rename.shutdown_tserver.system_split.system_merge.table_bulk_import2.table_cancel_compact.table_clone.table_compact-Worker-1" #57 daemon prio=5 os_prio=0 cpu=82600.00ms elapsed=86.73s tid=0x00007693e00058f0 nid=0x2304f runnable [0x00007694a5ef9000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@17.0.15/LinkedTransferQueue.java:652) at java.util.concurrent.LinkedTransferQueue.xfer(java.base@17.0.15/LinkedTransferQueue.java:616) at java.util.concurrent.LinkedTransferQueue.poll(java.base@17.0.15/LinkedTransferQueue.java:1294) at org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.reserveFateTx(FateExecutor.java:349) at org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.run(FateExecutor.java:378) at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.15/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.15/ThreadPoolExecutor.java:635) at org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52) at java.lang.Thread.run(java.base@17.0.15/Thread.java:840) ``` This doesn't make sense as: 1) When we shutdown FATE, we first set keepRunning to false, so the while loop should terminate 2) The poll will return after, at most, 100ms I have run out of ideas. This could use another set of eyes, if anyone has the time. I can explain anything in regards to test logic or the fate logic, if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org