Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8376 )

Change subject: fs: defer failure from metadata load to bootstrap when data dir 
is missing
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@81
PS2, Line 81:   Status LoadFromPB(const UuidIndexByUuidMap& uuid_idx_by_uuid,
worth short docs explaining the cases of bad-status here.


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@84
PS2, Line 84:   Status CopyToPB(const UuidByUuidIndexMap& uuid_by_uuid_idx,
and here


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.h@276
PS2, Line 276: data dir
             :   // is missing.
only looked at the headers so far (not the impl) but this left me a little 
confused.

This seems to indicate that the above LoadDataDirGroupFromPB actually has a 
side effect even if it returns a bad status that indicates a missing dir? Does 
this return a 'pb' even with a bad status?


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@318
PS2, Line 318:     if (!FindCopy(uuid_by_uuid_idx, uuid_idx, &uuid)) {
             :       return Status::NotFound(Substitute(
             :           "could not find data dir with uuid index $0", 
uuid_idx));
would this not be a programmer error? how would you end up with a uuid_idx in a 
DataDirGroup which doesn't point to a valid entry in the UUIDs array? I would 
have expected a FindOrDie or LOG(FATAL) here


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/fs/data_dirs.cc@831
PS2, Line 831:   RETURN_NOT_OK(group->CopyToPB(uuid_by_idx_, pb));
I guess my confusion in the header was wrong, but per my comments above, I 
think any failure of this code would be indicative of a bug, no?


http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/8376/2/src/kudu/tablet/tablet_metadata.cc@440
PS2, Line 440:     DeleteOrphanedBlocks(orphaned_blocks);
I think we discussed this elsewhere, but I"m concerned about this interacting 
with blockid re-use. If we have the following scenario:

- block 1-5 on disk A
- block 6-10 on disk B

tablet has orphaned block #6 in its metadata

disk B is failed or removed, and we restart the tserver

- LBM sees '5' as max_block_id
- allocates a new block 6 on disk A
- we open the tablet, and it decides to delete block 6
-- incorrectly deletes the new block 6 instead of the old block 6

ie I think it's important not to go to proceed on a tablet when we have it 
pointing to a drive that has been removed.



--
To view, visit http://gerrit.cloudera.org:8080/8376
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1e8d5697c2bb08287cce11fbdab6fb8d6e37d1ad
Gerrit-Change-Number: 8376
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Comment-Date: Thu, 26 Oct 2017 20:58:36 +0000
Gerrit-HasComments: Yes

Reply via email to