[
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941340#comment-15941340
]
Todd Lipcon commented on KUDU-1956:
-----------------------------------
Looking at the code, it seems like this has just been wrong forever.
Tablet::PickRowSetsToCompact does:
1. grab a snapshot of RowSetTree under component_lock
2. acquire compact_select_lock_
2.a. run the policy against the snapshot, which only selects rowsets which are
not currently locked for compaction
3. reacquire component_lock and look at the new RowSetTree, which might have
changed since the first one
3.a. assume that any selected rowsets are still in the tree, and lock them
(here's where we crash)
This will crash if we have the following interleaving:
{code}
T1 T2
grab rowsettree
grab rowsettree
acquire compact_select_lock
run policy
take locks on selected rowsets
release compact_select_lock
run compaction
swap in new RowSetTree
unlock the old "removed" instances
acquire compact_select_lock
run policy (may pick a rowset already
removed by the T1 compaction)
reacquire component_lock
those rowsets will be missing from the tree
CRASH!
{code}
> Crash with "rowset selected for compaction but not available anymore"
> ---------------------------------------------------------------------
>
> Key: KUDU-1956
> URL: https://issues.apache.org/jira/browse/KUDU-1956
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 1.3.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch
> to make the MM thread wake up and do scheduling as soon as any prior op
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708 5801 tablet.cc:1207] T
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762 5801 tablet.cc:1207] T
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777 5801 tablet.cc:1210] T
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was
> unable to find all rowsets selected for compaction
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)