Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10340 )
Change subject: WIP KUDU-2359: allow startup with missing data dirs ...................................................................... Patch Set 2: (10 comments) http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@12 PS2, Line 12: reading instance and block : manager instance files Nit: "reading fs and block manager instance files". http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@21 PS2, Line 21: requests Nit: requested http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@35 PS2, Line 35: A missing data : directory in this case is one that has no instance or block manager : instance file Would it be useful to broaden the scope to include missing data_root directories or data_dir subdirectories as well? http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@54 PS2, Line 54: "unhealthy", the same way they treat files that fail due disk errors Nit: due to http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@58 PS2, Line 58: missing directories as : an unhealthy instance used to spawn a failed DataDir in memory Nit: singular/plural confusion here http://gerrit.cloudera.org:8080/#/c/10340/2//COMMIT_MSG@62 PS2, Line 62: NotFound, indicating Kudu should attempt to create a new FS layout I guess what this also means is that semantically, there's now no difference between "brand new node" and "node whose disks all failed because instance files could not be found". Was that important before? http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc File src/kudu/fs/block_manager_util.cc: http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/block_manager_util.cc@52 PS2, Line 52: // Evaluates 'status_expr' and if it results in a disk-related error (i.e. disk One of the challenging aspects of this change is explaining why this macro should consider NotFound errors as disk failures, but why we don't want to modify IsDiskFailure outright to include NotFound. http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.h@423 PS2, Line 423: Instance : // files that fail to load because they are missing or because of a disk : // failure are considered loaded and are internally labeled unhealthy. Maybe reword: "It also includes instance files that failed to load because they were missing or because of a disk failure; they are still considered "loaded" and are labeled unhealthy internally". http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@603 PS2, Line 603: continue; Would it be possible to omit this to simplify the control flow? AFAICT, this would mean loaded_instances always has missing (and unhealthy) instances in it, even if missing_roots was set. Is that a big deal? http://gerrit.cloudera.org:8080/#/c/10340/2/src/kudu/fs/data_dirs.cc@692 PS2, Line 692: DCHECK(missing_roots.empty()); If the extent of our recheck is a DCHECK, we could probably omit it and make L696 apply to both code paths, no? -- To view, visit http://gerrit.cloudera.org:8080/10340 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I61a71265c3cc34a7b72320149770a814ec7f8351 Gerrit-Change-Number: 10340 Gerrit-PatchSet: 2 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Fri, 11 May 2018 21:23:43 +0000 Gerrit-HasComments: Yes
