Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/10340 )
Change subject: WIP KUDU-2359: allow startup with missing data dirs ...................................................................... Patch Set 2: (4 comments) Sorry for the tardy response. Not pushing a PS just yet, just responding to a couple points you brought up. http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@62 PS2, Line 62: NotFound, indicating Kudu should attempt to create a new FS layout > I guess what this also means is that semantically, there's now no differenc In terms of the control flow of Open/Create/Update paths, somewhat. What differs is the fact that, as implemented, only NotFound() errors will indicate the viability to move forward with UPDATE_ON_DISK paths. E.g. in FsManager::Open(), when opening instance files, those that fail due to NotFound() errors are separated into a list of missing roots that are created if going down the UPDATE_ON_DISK paths. This isn't a hard requirement, but I think it's more intuitive, even if it means extra checking for NotFound()s. One alternative you may have alluded to elsewhere would be to treat NotFound() as a disk failure, in which case we might be able to collapse some of the NotFound()-specific codepaths. For an ENFORCE_CONSISTENCY FsManager::Open(), I think this makes perfect sense. I'm not quite sold on doing this for UPDATE_ON_DISK. We _could_ always try to create missing roots in the face of any disk error codes (including ENOENT); the existing file creation rollback code should make that not too painful. The code changes to make that happen would then revolve around fetching the "failed/missing" (potentially non-canonicalized) roots/instance pathnames from lists that contain some unhealthy root instances or path instances, and using those fetched as the basis for the CreateFileSystemRoots() or CreateNewDataDirectoriesAndUpdateExistingOnes() calls. The downside here is that I think it becomes somewhat less intuitive for readers without knowing the context disk failures and not found errors, although that could be remediated through comments. Another thing that concerns me wrt lumping ENOENT into the disk error list is that its effects span beyond the scope of startup and fs update. It would affect _any_ missing file, and thus, any data directory with a missing file. Maybe that's ok and desirable, but that call seems like it warrants separate discussion. Or maybe not, I could go either way at this point. WDYT? http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc@52 PS2, Line 52: // Evaluates 'status_expr' and if it results in a disk-related error (i.e. disk > One of the challenging aspects of this change is explaining why this macro My comment in the commit message addresses this. http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@603 PS2, Line 603: continue; > Would it be possible to omit this to simplify the control flow? AFAICT, thi As implemented, this gives us a clear separation between missing directories, that should be created if we're UPDATING_ON_DISK, and failed directories, that should halt further UPDATE operation. This assumption is used in UpdateExistingInstances(), which doesn't currently handle unhealthy instances, but that can be amended pretty easily. http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@692 PS2, Line 692: DCHECK(missing_roots.empty()); > If the extent of our recheck is a DCHECK, we could probably omit it and mak Ack yeah, can do. -- To view, visit http://gerrit.cloudera.org:8080/10340 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I61a71265c3cc34a7b72320149770a814ec7f8351 Gerrit-Change-Number: 10340 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 22 May 2018 02:48:16 +0000 Gerrit-HasComments: Yes
