Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14654 )

Change subject: [KUDU-2975]Spread WAL across multiple directories
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14654/3/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14654/3/src/kudu/tserver/ts_tablet_manager.cc@1508
PS3, Line 1508: void TSTabletManager::FailTabletsInDataDir(const string& uuid) {
              :   DataDirManager* dir_manager = nullptr;
              :   int uuid_idx;
              :   if (fs_manager_->dd_manager()->FindUuidIndexByUuid(uuid, 
&uuid_idx)) {
              :     dir_manager = fs_manager_->dd_manager();
              :   } else if 
(fs_manager_->wd_manager()->FindUuidIndexByUuid(uuid, &uuid_idx)) {
              :     dir_manager = fs_manager_->wd_manager();
              :   } else {
              :     LOG(FATAL) << Substitute("No directory found with UUID $0", 
uuid);
              :   }
              :   if (dir_manager->IsDataDirFailed(uuid_idx)) {
              :     LOG(WARNING) << "Data directory is already marked failed.";
              :     return;
              :   }
              :   // Fail the directory to prevent other tablets from being 
placed in it.
              :   dir_manager->MarkDataDirFailed(uuid_idx);
              :   set<string> tablets = 
dir_manager->FindTabletsByDataDirUuidIdx(uuid_idx);
              :   LOG(INFO) << Substitute("Data dir $0 has $1 tablets", uuid, 
tablets.size());
              :   for (const string& tablet_id : 
dir_manager->FindTabletsByDataDirUuidIdx(uuid_idx)) {
              :     FailTabletAndScheduleShutdown(tablet_id);
              :   }
              : }
> I'm sorry, here I misunderstood. Always has two FsErrorManager.
Are the UUIDs in the WAL directories the same as the UUIDs of the data 
directories though? I thought they wouldn't be because they're completely 
separate entities, and they don't share the same block_manager_instance files.

I would think that if we failed to write to the WAL, we would trigger failure 
handling for that WAL directory (e.g. if we know that we failed to write to a 
tablet's WAL, maybe we would lookup the tablet's WAL directory and then fail 
_all_ the tablets in that WAL directory). We should be able to do this 
completely separately from the data directory path, which will trigger failure 
handling when a block fails to write/read.

Because the handling is separate, we could probably separate the DISK_ERROR 
enum to DATA_DISK_ERROR and WAL_DISK_ERROR enums, and then share a single 
FsErrorManager.

I also think that for initial implementation, we should not focus on disk 
failure handling in the WALs at all, and just focus on the structures. We can 
always improve in later patches.



--
To view, visit http://gerrit.cloudera.org:8080/14654
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ied496804421d91ff1fa63d49979fde971071506e
Gerrit-Change-Number: 14654
Gerrit-PatchSet: 3
Gerrit-Owner: YangSong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: YangSong <[email protected]>
Gerrit-Comment-Date: Fri, 08 Nov 2019 06:11:29 +0000
Gerrit-HasComments: Yes

Reply via email to