Alexey Serbin created KUDU-3458:
-----------------------------------
Summary: Continue loading other tablets even if metadata for some
tablets failed to load
Key: KUDU-3458
URL: https://issues.apache.org/jira/browse/KUDU-3458
Project: Kudu
Issue Type: Improvement
Components: tserver
Reporter: Alexey Serbin
kudu-tserver stops tablet bootstrapping if a single tablet's metadata failed to
load (the kudu-tserver process exits on such an event, but with caveat of
KUDU-3419).
This current behavior requires manual intervention. In most cases, the reason
behind the failure to load tablet metadata is corrupted metadata file. The
suspect behind such a corruption is a power failure, kernel panic, etc. where
opened file isn't synced (?).
In case of a cluster with many tablet servers, where RF=3, if majority of
tablet replicas is present, such a situation with corrupted file could be
addressed automatically if the tablet server would continue bootstrapping of
other tablet replicas and eventually registered with Kudu masters. The system
catalog would detect that the tablet is under-replicated because one replica
isn't running, and would re-replicate it elsewhere, sending DELETE_TABLET for
the tablet replica that has the corrupted metadata file. That'd be similar to
what happens if a consensus metadata for a tablet replica were corrupted.
It's necessary to update the code in {{TSTabletManager}} and allow
{{TSTabletManager::Init()}} to complete successfully in such case, marking
corresponding tablet replicas as failed to load (similar to what's done in case
of replica's consensus metadata).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)