Adar Dembo created KUDU-1362:
--------------------------------

             Summary: Ensure master behaves correctly after a sys_catalog write 
failure
                 Key: KUDU-1362
                 URL: https://issues.apache.org/jira/browse/KUDU-1362
             Project: Kudu
          Issue Type: Bug
          Components: master
    Affects Versions: 0.7.0
            Reporter: Adar Dembo
            Assignee: Adar Dembo
            Priority: Critical


For multi-master usage to truly be safe, we must ensure that a failure to write 
to the system catalog table is handled correctly. When there's only one master 
this can only happen in the event of a disk failure or equivalent, but with 
multiple masters, failures can happen all the time (i.e. failed replicas, 
network partitions, etc.)

So far I've only found one case where this is truly broken, in 
catalog_manager.cc:L2444:
{noformat}
   2433 void CatalogManager::DeleteTabletsAndSendRequests(const 
scoped_refptr<TableInfo>& table) {
   2434   vector<scoped_refptr<TabletInfo> > tablets;
   2435   table->GetAllTablets(&tablets);
   2436 
   2437   string deletion_msg = "Table deleted at " + LocalTimeAsString();
   2438 
   2439   for (const scoped_refptr<TabletInfo>& tablet : tablets) {
   2440     DeleteTabletReplicas(tablet.get(), deletion_msg);
   2441 
   2442     TabletMetadataLock tablet_lock(tablet.get(), 
TabletMetadataLock::WRITE);
   2443     tablet_lock.mutable_data()->set_state(SysTabletsEntryPB::DELETED, 
deletion_msg);
  >2444     CHECK_OK(sys_catalog_->UpdateTablets({ tablet.get() }));
   2445     tablet_lock.Commit();
   2446   }
   2447 }
{noformat}

In this case we should batch up all of the tablet deletions into one 
UpdateTablets() call, and pass the status up to the DeleteTable caller too.

Part of the work here is an integration test that provides good coverage for 
the various failure paths.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to