[ 
https://issues.apache.org/jira/browse/KUDU-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357872#comment-16357872
 ] 

Mike Percy commented on KUDU-2293:
----------------------------------

I think there are two problems that caused this issue:
 # In TabletCopyClient::Start(), we do not set the started_ flag until the end 
of the function, even though we may have started making changes on disk. That 
means a failure in the middle of Start() may never get cleaned up.
 # In TabletCopyClient::Abort(), we WARN_NOT_OK on several kinds of cleanup, 
including the cleanup to change from COPYING to TOMBSTONED, not to mention the 
fact that we may exit early due to a RETURN_NOT_OK in Abort()

Before we made disk errors non-fatal, this wasn't much of a big deal because 
generally we would simply crash if any of these conditions occurred. Now things 
are more complex and we need to be more careful about how we handle the 
in-memory tablet data state.

An easy fix would be to go through these two codepaths and crash whenever we 
hit any of these problems.

An ideal fix would involve discriminating between data-path errors and 
metadata-path errors, and only crashing when we have a metadata problem, while 
ensuring we unwind and tombstone ourselves if we run out of space on the data 
drive. This latter approach is much better because tombstoning ourselves 
actually helps address the disk-full problem.

> tserver crashes with 'Found tablet in TABLET_DATA_COPYING state during 
> StartTabletCopy()' message
> -------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2293
>                 URL: https://issues.apache.org/jira/browse/KUDU-2293
>             Project: Kudu
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.5.0, 1.6.0, 1.7.0
>            Reporter: Alexey Serbin
>            Priority: Major
>         Attachments: crash-at-tablet-copy-session-start.log
>
>
> When running out of disc space, tablet server can crash while trying to start 
> tablet copying over already tombstoned replica.
> In essence, if {{DataDirManager::CreateDataDirGroup()}} returns an error due 
> to the out-of-disc space condition while running 
> {{TabletCopyClient::Start()}}, tablet server crashes with an error message 
> like below.  The relevant part of the log is attached.
> {noformat}
> F0208 05:35:22.152496  2721 ts_tablet_manager.cc:563] T 
> 5384471d823e46929029f9ff6ce212a3 P c713ac498df040caa897d3229214baa3: Tablet 
> Copy: Found tablet in TABLET_DATA_COPYING state during 
> StartTabletCopy(){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to