Adar Dembo created KUDU-1362:
--------------------------------
Summary: Ensure master behaves correctly after a sys_catalog write
failure
Key: KUDU-1362
URL: https://issues.apache.org/jira/browse/KUDU-1362
Project: Kudu
Issue Type: Bug
Components: master
Affects Versions: 0.7.0
Reporter: Adar Dembo
Assignee: Adar Dembo
Priority: Critical
For multi-master usage to truly be safe, we must ensure that a failure to write
to the system catalog table is handled correctly. When there's only one master
this can only happen in the event of a disk failure or equivalent, but with
multiple masters, failures can happen all the time (i.e. failed replicas,
network partitions, etc.)
So far I've only found one case where this is truly broken, in
catalog_manager.cc:L2444:
{noformat}
2433 void CatalogManager::DeleteTabletsAndSendRequests(const
scoped_refptr<TableInfo>& table) {
2434 vector<scoped_refptr<TabletInfo> > tablets;
2435 table->GetAllTablets(&tablets);
2436
2437 string deletion_msg = "Table deleted at " + LocalTimeAsString();
2438
2439 for (const scoped_refptr<TabletInfo>& tablet : tablets) {
2440 DeleteTabletReplicas(tablet.get(), deletion_msg);
2441
2442 TabletMetadataLock tablet_lock(tablet.get(),
TabletMetadataLock::WRITE);
2443 tablet_lock.mutable_data()->set_state(SysTabletsEntryPB::DELETED,
deletion_msg);
>2444 CHECK_OK(sys_catalog_->UpdateTablets({ tablet.get() }));
2445 tablet_lock.Commit();
2446 }
2447 }
{noformat}
In this case we should batch up all of the tablet deletions into one
UpdateTablets() call, and pass the status up to the DeleteTable caller too.
Part of the work here is an integration test that provides good coverage for
the various failure paths.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)