kevinrr888 opened a new pull request, #5813:
URL: https://github.com/apache/accumulo/pull/5813

   This PR partially addresses #5787
   
   I have reached a dead end with debugging this test.
   
   The test logic has no issues as far as I can tell and the FATE logic (as far 
as I can tell) has one potential concurrency issue (which I addressed in this 
PR), but the failure still occurs occassionally. From jstacking the test 
process in a failure case, it appears that the thread is either getting stuck 
on the `workQueue.poll(100, MILLISECONDS)` call or it is repeatedly retrying 
it, neither of which should be possible given the shutdown logic. Here is the 
code:
   ```
   while (fate.getKeepRunning().get() && !stop.get()) {
       FateId unreservedFateId = workQueue.poll(100, MILLISECONDS);
       ...
   ```
   The jstack trace shows this throughout the time FATE is trying to shutdown:
   ```
   
"accumulo.pool.manager.fate.user.commit_compaction.namespace_create.namespace_delete.namespace_rename.shutdown_tserver.system_split.system_merge.table_bulk_import2.table_cancel_compact.table_clone.table_compact-Worker-1"
 #57 daemon prio=5 os_prio=0 cpu=82600.00ms elapsed=86.73s 
tid=0x00007693e00058f0 nid=0x2304f runnable  [0x00007694a5ef9000]
      java.lang.Thread.State: RUNNABLE
           at 
java.util.concurrent.LinkedTransferQueue.awaitMatch(java.base@17.0.15/LinkedTransferQueue.java:652)
           at 
java.util.concurrent.LinkedTransferQueue.xfer(java.base@17.0.15/LinkedTransferQueue.java:616)
           at 
java.util.concurrent.LinkedTransferQueue.poll(java.base@17.0.15/LinkedTransferQueue.java:1294)
           at 
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.reserveFateTx(FateExecutor.java:349)
           at 
org.apache.accumulo.core.fate.FateExecutor$TransactionRunner.run(FateExecutor.java:378)
           at 
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.15/ThreadPoolExecutor.java:1136)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.15/ThreadPoolExecutor.java:635)
           at 
org.apache.accumulo.core.trace.TraceWrappedRunnable.run(TraceWrappedRunnable.java:52)
           at java.lang.Thread.run(java.base@17.0.15/Thread.java:840)
   ```
   This doesn't make sense as:
   1) When we shutdown FATE, we first set keepRunning to false, so the while 
loop should terminate
   2) The poll will return after, at most, 100ms
   
   I have run out of ideas. This could use another set of eyes, if anyone has 
the time. I can explain anything in regards to test logic or the fate logic, if 
needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to