[ 
https://issues.apache.org/jira/browse/KUDU-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Berkeley resolved KUDU-2807.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 1.10.0

f6f8bbf35aa33e668f9cc5ce9b1e80d202a7f736

> Possible crash when flush or compaction overlaps with another compaction
> ------------------------------------------------------------------------
>
>                 Key: KUDU-2807
>                 URL: https://issues.apache.org/jira/browse/KUDU-2807
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.9.0
>            Reporter: Adar Dembo
>            Assignee: Will Berkeley
>            Priority: Blocker
>             Fix For: 1.10.0
>
>         Attachments: kudu-tserver.INFO.gz
>
>
> Manuel Sopena reported a crash like this in Slack:
> {noformat}
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0429 07:26:56.918041 34043 tablet.cc:2268] Check failed: lock.owns_lock() 
> RowSet(24130) unable to lock compact_flush_lock
> {noformat}
> It's hard to say exactly what's going on without more logging, but after 
> looking at the code in more detail, I think the culprit is [this 
> commit|https://github.com/apache/kudu/commit/d3684a7b2add8f06b7189adb9ce9222b8ae1eff5],
>  new in Kudu 1.9.0. To understand why it's problematic, we first need to 
> understand the locking invariant in play:
> # A thread must acquire the tablet's compact_select_lock_ in order to select 
> rowsets to compact.
> # Because of #1, it's safe to assume that, if a thread successfully acquired 
> a rowset's compact_flush_lock_ in the act of selecting it for compaction, it 
> can release and reacquire the lock without contention. More precisely, it can 
> release the compact_flush_lock_, then try-lock it, and the try-lock is 
> guaranteed to succeed. All compacting MM ops use a CHECK to enforce this 
> invariant.
> With that in mind, here's the problem: at the time that the call to 
> {{RowSetInfo::ComputeCdfAndCollectOrdered}} is made from 
> {{Tablet::AtomicSwapRowSetsUnlocked}}, the tablet's compact_select_lock_ is 
> not held. {{ComputeCdfAndCollectOrdered}} calls 
> {{RowSet::IsAvailableForCompaction}}, which try-locks the per-rowset 
> compact_flush_lock_. As a result, it's possible for a racing MM operation to 
> also call {{IsAvailableForCompaction}}, successfully try-lock the 
> compact_flush_lock_, release it, try-lock it again (as per the invariant 
> above), fail, and crash in the aforementioned CHECK.
> I don't think this can result in corruption as we crash rather than allowing 
> the MM op to proceed. But it's a bad race and a bad crash, so we should fix 
> it. Possibly producing a 1.9.1 release in the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to