Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8376 )
Change subject: fs: defer failure from metadata load to bootstrap when data dir is missing ...................................................................... Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81 PS2, Line 81: Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid, worth short docs explaining the cases of bad-status here. http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84 PS2, Line 84: Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx, and here http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276 PS2, Line 276: data dir : // is missing. only looked at the headers so far (not the impl) but this left me a little confused. This seems to indicate that the above LoadDataDirGroupFromPB actually has a side effect even if it returns a bad status that indicates a missing dir? Does this return a 'pb' even with a bad status? http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318 PS2, Line 318: if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) { : return Status::NotFound(Substitute( : "could not find data dir with uuid index $0", uuid_idx)); would this not be a programmer error? how would you end up with a uuid_idx in a DataDirGroup which doesn't point to a valid entry in the UUIDs array? I would have expected a FindOrDie or LOG(FATAL) here http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831 PS2, Line 831: RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb)); I guess my confusion in the header was wrong, but per my comments above, I think any failure of this code would be indicative of a bug, no? http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440 PS2, Line 440: DeleteOrphanedBlocks(orphaned_blocks); I think we discussed this elsewhere, but I"m concerned about this interacting with blockid re-use. If we have the following scenario: - block 1-5 on disk A - block 6-10 on disk B tablet has orphaned block #6 in its metadata disk B is failed or removed, and we restart the tserver - LBM sees '5' as max_block_id - allocates a new block 6 on disk A - we open the tablet, and it decides to delete block 6 -- incorrectly deletes the new block 6 instead of the old block 6 ie I think it's important not to go to proceed on a tablet when we have it pointing to a drive that has been removed. -- To view, visit http://gerrit.cloudera.org:8080/8376 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad Gerrit-Change-Number: 8376 Gerrit-PatchSet: 2 Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <t...@apache.org> Gerrit-Comment-Date: Thu, 26 Oct 2017 20:58:36 +0000 Gerrit-HasComments: Yes