keith-turner commented on PR #5707: URL: https://github.com/apache/accumulo/pull/5707#issuecomment-3029095785
There are some notes on what I did to track this bug down. Started w/ running fate list and kept seeing these compactions not moving. ``` TABLE_BULK_IMPORT2 txid: 10ca761c03d9876d status: SUBMITTED locked: [] locking: [R:4] op: PrepBulkImport created: 2025-07-01T22:56:00.758Z TABLE_BULK_IMPORT2 txid: 67be5b2616e8163c status: SUBMITTED locked: [] locking: [R:4] op: PrepBulkImport created: 2025-07-01T22:56:03.157Z TABLE_COMPACT txid: 2f2ff4546d9052e5 status: IN_PROGRESS locked: [R:+default, R:4] locking: [] op: CompactionDriver created: 2025-07-01T22:55:29.987Z TABLE_BULK_IMPORT2 txid: 6102ab68e0ae9b54 status: SUBMITTED locked: [] locking: [R:4] op: PrepBulkImport created: 2025-07-01T22:56:09.768Z TABLE_MERGE txid: 188da098b244da57 status: SUBMITTED locked: [R:+default] locking: [W:4] op: TableRangeOp created: 2025-07-01T22:55:59.759Z TABLE_BULK_IMPORT2 txid: 545753cd7ac69fd2 status: SUBMITTED locked: [] locking: [R:4] op: PrepBulkImport created: 2025-07-01T22:56:03.246Z TABLE_BULK_IMPORT2 txid: 1f12560b9db5834c status: SUBMITTED locked: [] locking: [R:4] op: PrepBulkImport created: 2025-07-01T22:56:01.417Z TABLE_COMPACT txid: 5481217c586dcaf6 status: IN_PROGRESS locked: [R:+default, R:4] locking: [] op: CompactionDriver created: 2025-07-01T22:55:22.614Z ``` Enabled trace logging in the manager and got some info that indicated that both fate ops were waiting on a single tablet. ``` 2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE: FATE[2f2ff4546d9052e5] tablets compacted:33/34 servers contacted:1 expected id:49 compaction extent:4;r165f8;r0786c sleepTime:500 2025-07-01T23:28:17,267 [compact.CompactionDriver] TRACE: FATE[5481217c586dcaf6] tablets compacted:56/57 servers contacted:1 expected id:48 compaction extent:4<;r003f8 sleepTime:500 ``` Looked in the metadata table and found the tablet with a lower compact id that was in the range. ``` 4;r13e11 srv:compact [] 50 4;r16d06 srv:compact [] 46 4;r172ae srv:compact [] 50 4;r172da srv:compact [] 50 ``` Enabled trace logging on the tablet server and saw the following that helped get in the neighborhood of the problem. Needed to filter on the tablet. ``` 2025-07-02T00:29:17,968 [compactions.CompactionService] TRACE: Did not submit compaction plan 4;r16d06;r13e11 id:default files:Files [allFiles=[[C00009ux.rf, 33437 252441], [A00009dw.rf, 3638005 7727960], [C00009uz.rf, 25527 168294], [I00 009m2.rf, 25286 0], [I00009m6.rf, 18179 0], [I00009lv.rf, 21772 0], [C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948 204357], [I00009p7.rf, 18369 0]], candidates=[[I00009m6.rf, 18179 0], [I00009lv.rf, 21772 0], [A0 0009dw.rf, 3638005 7727960], [C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948 204357], [I00009m2.rf, 25286 0]], compacting=[], hints={}] plan:jobs: [CompactionJob [priority=18, executor=e.small, files=[[I00009m6.rf , 18179 0], [I00009lv.rf, 21772 0], [A00009dw.rf, 3638005 7727960], [C00009dz.rf, 18750 132363], [I00009ld.rf, 18232 0], [C00009qf.rf, 27948 204357], [I00009m2.rf, 25286 0]], kind=USER]] ``` Took a heap dump of the tablet server and ran the following OQL query to find the tablet object in the heap dump. Got the dir name from the metadata table. Used the dirName field because its a string, would like to use the extent to find the tablet but that is a byte array and did not know how to query that in OQL. Would be useful to figure that out. ``` select t from org.apache.accumulo.tserver.tablet.Tablet t where t.dirName.toString()=="t-00009q0" ``` After finding the tablet in the heap dump was able to find a ExternalJob that indirectly referenced the tablet. This ExternalJob had a state of running and a null ecid. Also the ExternalJob was only referenced by CompactoinService.submittedJob. This was the information that lead to this bug fix. Not 100% this change will fix this problem, but it seems like it will . Also not sure how to test this fix as its a race condition. May try to see if this problem is reproducible w/o this fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@accumulo.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org