[
https://issues.apache.org/jira/browse/KUDU-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735757#comment-17735757
]
Zoltan Martonka commented on KUDU-3458:
---------------------------------------
Review:
[https://gerrit.cloudera.org/#/c/20067/]
> Continue loading other tablets even if metadata for some tablets failed to
> load
> -------------------------------------------------------------------------------
>
> Key: KUDU-3458
> URL: https://issues.apache.org/jira/browse/KUDU-3458
> Project: Kudu
> Issue Type: Improvement
> Components: tserver
> Reporter: Alexey Serbin
> Assignee: Zoltan Martonka
> Priority: Major
> Labels: scalability, supportability, troubleshooting
>
> kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed
> to load (the kudu-tserver process exits on such an event, but with caveat of
> KUDU-3419).
> This current behavior requires manual intervention. In most cases, the
> reason behind the failure to load tablet metadata is corrupted metadata file.
> The suspect behind such a corruption is a power failure, kernel panic, etc.
> where opened file isn't synced.
> In case of a cluster with many tablet servers, where RF=3, if majority of
> tablet replicas is present, such a situation with corrupted file could be
> addressed automatically if the tablet server would continue bootstrapping of
> other tablet replicas and eventually registered with Kudu masters. The
> system catalog would detect that the tablet is under-replicated because one
> replica isn't running, and would re-replicate it elsewhere, sending
> DELETE_TABLET for the tablet replica that has the corrupted metadata file.
> That'd be similar to what happens if a consensus metadata for a tablet
> replica were corrupted.
> It's necessary to update the code in {{TSTabletManager}} and allow
> {{TSTabletManager::Init()}} to complete successfully in such case, marking
> corresponding tablet replicas as failed to load (similar to what's done in
> case of replica's consensus metadata).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)