[ 
https://issues.apache.org/jira/browse/KUDU-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15941340#comment-15941340
 ] 

Todd Lipcon commented on KUDU-1956:
-----------------------------------

Looking at the code, it seems like this has just been wrong forever. 
Tablet::PickRowSetsToCompact does:

1. grab a snapshot of RowSetTree under component_lock
2. acquire compact_select_lock_
2.a. run the policy against the snapshot, which only selects rowsets which are 
not currently locked for compaction
3. reacquire component_lock and look at the new RowSetTree, which might have 
changed since the first one
3.a. assume that any selected rowsets are still in the tree, and lock them 
(here's where we crash)

This will crash if we have the following interleaving:
{code}
T1                                 T2
grab rowsettree
                                   grab rowsettree
acquire compact_select_lock
run policy
take locks on selected rowsets
release compact_select_lock
run compaction
swap in new RowSetTree
unlock the old "removed" instances

                                   acquire compact_select_lock
                                   run policy (may pick a rowset already
                                               removed by the T1 compaction)
                                   reacquire component_lock
                                   those rowsets will be missing from the tree
                                   CRASH!
{code}

> Crash with "rowset selected for compaction but not available anymore"
> ---------------------------------------------------------------------
>
>                 Key: KUDU-1956
>                 URL: https://issues.apache.org/jira/browse/KUDU-1956
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.3.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> I loaded 1T of data into a server with 8 MM threads configured, and a patch 
> to make the MM thread wake up and do scheduling as soon as any prior op 
> finished. After a day or two of runtime the TS crashed with:
> E0324 14:28:19.733708  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(22755)
> E0324 14:28:19.733762  5801 tablet.cc:1207] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Rowset 
> selected for compaction but not available anymore: RowSet(24031)
> F0324 14:28:19.733777  5801 tablet.cc:1210] T 
> cf12905ff7d84fa0b4148aab292f0c40 P 40b48ecb131449c58df26f62ccc35538: Was 
> unable to find all rowsets selected for compaction



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to