keith-turner commented on PR #5707:
URL: https://github.com/apache/accumulo/pull/5707#issuecomment-3029095785

    There are some notes on what I did to track this bug down.  Started w/ 
running fate list and kept seeing these compactions not moving.
   
   ```
   TABLE_BULK_IMPORT2 txid: 10ca761c03d9876d  status: SUBMITTED          
locked: []              locking: [R:4]           op: PrepBulkImport  created: 
2025-07-01T22:56:00.758Z
   TABLE_BULK_IMPORT2 txid: 67be5b2616e8163c  status: SUBMITTED          
locked: []              locking: [R:4]           op: PrepBulkImport  created: 
2025-07-01T22:56:03.157Z
   TABLE_COMPACT   txid: 2f2ff4546d9052e5  status: IN_PROGRESS        locked: 
[R:+default, R:4] locking: []              op: CompactionDriver created: 
2025-07-01T22:55:29.987Z
   TABLE_BULK_IMPORT2 txid: 6102ab68e0ae9b54  status: SUBMITTED          
locked: []              locking: [R:4]           op: PrepBulkImport  created: 
2025-07-01T22:56:09.768Z
   TABLE_MERGE     txid: 188da098b244da57  status: SUBMITTED          locked: 
[R:+default]    locking: [W:4]           op: TableRangeOp    created: 
2025-07-01T22:55:59.759Z
   TABLE_BULK_IMPORT2 txid: 545753cd7ac69fd2  status: SUBMITTED          
locked: []              locking: [R:4]           op: PrepBulkImport  created: 
2025-07-01T22:56:03.246Z
   TABLE_BULK_IMPORT2 txid: 1f12560b9db5834c  status: SUBMITTED          
locked: []              locking: [R:4]           op: PrepBulkImport  created: 
2025-07-01T22:56:01.417Z
   TABLE_COMPACT   txid: 5481217c586dcaf6  status: IN_PROGRESS        locked: 
[R:+default, R:4] locking: []              op: CompactionDriver created: 
2025-07-01T22:55:22.614Z
   ```
   
   Enabled trace logging in the manager and got some info that indicated that 
both fate ops were waiting on a single tablet.
   
   ```
   2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE: 
FATE[2f2ff4546d9052e5] tablets compacted:33/34 servers contacted:1 expected 
id:49 compaction extent:4;r165f8;r0786c sleepTime:500
   2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE: 
FATE[5481217c586dcaf6] tablets compacted:56/57 servers contacted:1 expected 
id:48 compaction extent:4<;r003f8 sleepTime:500
   ```
   
   Looked in the metadata table and found the tablet with a lower compact id 
that was in the range.
   
   ```
   4;r13e11 srv:compact []      50
   4;r16d06 srv:compact []      46
   4;r172ae srv:compact []      50
   4;r172da srv:compact []      50
   ```
   
   Enabled trace logging on the tablet server and saw the following that helped 
get in the neighborhood of the problem. Needed to filter on the tablet.
   
   ```
   2025-07-02T00:29:17,968 [compactions.CompactionService] TRACE: Did not 
submit compaction plan 4;r16d06;r13e11 id:default files:Files 
[allFiles=[[C00009ux.rf, 33437 252441], [A00009dw.rf, 3638005 7727960], 
[C00009uz.rf, 25527 168294], [I00
   009m2.rf, 25286 0], [I00009m6.rf, 18179 0], [I00009lv.rf, 21772 0], 
[C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948 
204357], [I00009p7.rf, 18369 0]], candidates=[[I00009m6.rf, 18179 0], 
[I00009lv.rf, 21772 0], [A0
   0009dw.rf, 3638005 7727960], [C00009dz.rf, 18750 132363], [I00009ld.rf, 
18232 0], [C00009qf.rf, 27948 204357], [I00009m2.rf, 25286 0]], compacting=[], 
hints={}] plan:jobs: [CompactionJob [priority=18, executor=e.small, 
files=[[I00009m6.rf
   , 18179 0], [I00009lv.rf, 21772 0], [A00009dw.rf, 3638005 7727960], 
[C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948 
204357], [I00009m2.rf, 25286 0]], kind=USER]]
   ```
   
   Took a heap dump of the tablet server and ran the following OQL query to 
find the tablet object in the heap dump.  Got the dir name from the metadata 
table.  Used the dirName field because its a string, would like to use the 
extent to find the tablet but that is a byte array and did not know how to 
query that in OQL.  Would be useful to figure that out.
   
   ```
   select t from org.apache.accumulo.tserver.tablet.Tablet t where 
t.dirName.toString()=="t-00009q0"
   ```
   
   After finding the tablet in the heap dump was able to find a ExternalJob 
that indirectly referenced the tablet.  This ExternalJob had a state of running 
and a null ecid.  Also the ExternalJob was only referenced by 
CompactoinService.submittedJob.  This was the information that lead to this bug 
fix.
   
   Not 100% this change will fix this  problem, but it seems like it will .  
Also not sure how to test this fix as its a race condition.  May try to see if 
this problem is reproducible w/o this fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to