Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/10340 )

Change subject: WIP KUDU-2359: allow startup with missing data dirs
......................................................................


Patch Set 2:

(4 comments)

Sorry for the tardy response. Not pushing a PS just yet, just responding to a 
couple points you brought up.

http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@62
PS2, Line 62:   NotFound, indicating Kudu should attempt to create a new FS 
layout
> I guess what this also means is that semantically, there's now no differenc
In terms of the control flow of Open/Create/Update paths, somewhat. What 
differs is the fact that, as implemented, only NotFound() errors will indicate 
the viability to move forward with UPDATE_ON_DISK paths.

E.g. in FsManager::Open(), when opening instance files, those that fail due to 
NotFound() errors are separated into a list of missing roots that are created 
if going down the UPDATE_ON_DISK paths.

This isn't a hard requirement, but I think it's more intuitive, even if it 
means extra checking for NotFound()s. One alternative you may have alluded to 
elsewhere would be to treat NotFound() as a disk failure, in which case we 
might be able to collapse some of the NotFound()-specific codepaths. For an 
ENFORCE_CONSISTENCY FsManager::Open(), I think this makes perfect sense. I'm 
not quite sold on doing this for UPDATE_ON_DISK.

We _could_ always try to create missing roots in the face of any disk error 
codes (including ENOENT); the existing file creation rollback code should make 
that not too painful. The code changes to make that happen would then revolve 
around fetching the "failed/missing" (potentially non-canonicalized) 
roots/instance pathnames from lists that contain some unhealthy root instances 
or path instances, and using those fetched as the basis for the 
CreateFileSystemRoots() or CreateNewDataDirectoriesAndUpdateExistingOnes() 
calls. The downside here is that I think it becomes somewhat less intuitive for 
readers without knowing the context disk failures and not found errors, 
although that could be remediated through comments.

Another thing that concerns me wrt lumping ENOENT into the disk error list is 
that its effects span beyond the scope of startup and fs update. It would 
affect _any_ missing file, and thus, any data directory with a missing file. 
Maybe that's ok and desirable, but that call seems like it warrants separate 
discussion. Or maybe not, I could go either way at this point. WDYT?


http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc
File src/kudu/fs/block_manager_util.cc:

http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc@52
PS2, Line 52: // Evaluates 'status_expr' and if it results in a disk-related 
error (i.e. disk
> One of the challenging aspects of this change is explaining why this macro
My comment in the commit message addresses this.


http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@603
PS2, Line 603:       continue;
> Would it be possible to omit this to simplify the control flow? AFAICT, thi
As implemented, this gives us a clear separation between missing directories, 
that should be created if we're UPDATING_ON_DISK, and failed directories, that 
should halt further UPDATE operation. This assumption is used in 
UpdateExistingInstances(), which doesn't currently handle  unhealthy instances, 
but that can be amended pretty easily.


http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@692
PS2, Line 692:     DCHECK(missing_roots.empty());
> If the extent of our recheck is a DCHECK, we could probably omit it and mak
Ack yeah, can do.



--
To view, visit http://gerrit.cloudera.org:8080/10340
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I61a71265c3cc34a7b72320149770a814ec7f8351
Gerrit-Change-Number: 10340
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 22 May 2018 02:48:16 +0000
Gerrit-HasComments: Yes

Reply via email to