Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14920 )
Change subject: KUDU-2975: Spread WAL across multiple directories ...................................................................... Patch Set 5: (34 comments) Thanks for your patience here, and for your performance analysis. I reviewed most of the non-test code, and I'll review the rest of the test code when I can. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.h File src/kudu/fs/fs_manager.h: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.h@223 PS5, Line 223: Status GetOrphanedWalDir(const std::string& tablet_id, std::string* wal_dir); : : Status GetOrphanedWalRecoveryDir(const std::string& tablet_id, : std::string* wal_recovery_dir); Can you add comments explaining what "orphaned" means here? Or maybe rename this to GetWalDirFromDisk so it's clear to readers of the header what this is doing. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc File src/kudu/fs/fs_manager.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@a704 PS5, Line 704: : : nit: maybe use the emplace_back() pattern here too? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@153 PS5, Line 153: strings::Split nit: add "using strings::Split()" http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@149 PS5, Line 149: if (FLAGS_fs_wal_dirs.empty() && !FLAGS_fs_wal_dir.empty()) { : wal_roots.push_back(FLAGS_fs_wal_dir); : } : if (!FLAGS_fs_wal_dirs.empty() && FLAGS_fs_wal_dir.empty()) { : wal_roots = strings::Split(FLAGS_fs_wal_dirs, ",", strings::SkipEmpty()); : } How about adding a flag validator that validates that at most one of {fs_wal_dir, fs_wal_dirs} is a non-empty string? I.e. validator() { if (!FLAGS_fs_wal_dirs.empty() && !FLAGS_fs_wal_dir.empty()) { // return an error } } That way, when we get here, we can do: if (!FLAGS_fs_wal_dirs.empty()) { wal_roots = Split(FLAGS_fs_wal_dirs, ",", strings::SkipEmpty()); } if (!FLAGS_fs_wal_dir.empty()) { wal_roots = { FLAGS_fs_wal_dir }; } And this would also work with the existing tests that don't use the gflags. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@195 PS5, Line 195: wal/wals root nit: "wal roots" http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@271 PS5, Line 271: fs_wal_dir/fs_wal_dirs) " : "as data directory"; nit: If you use Substitute() for this, you could easily do something like Substitute("...directory ($0) for data", FLAGS_fs_wal_dir.empty() ? "fs_wal_dirs" : "fs_wal_dir"); http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@282 PS5, Line 282: use the WAL root. nit: "use the first WAL root." http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@305 PS5, Line 305: t: nit: roots http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@441 PS5, Line 441: block_manager_type This should be "wal" or something similar, right? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@530 PS5, Line 530: push_back nit: prefer using emplace_back in new code http://www.cplusplus.com/reference/vector/vector/emplace_back/ http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@556 PS5, Line 556: LOG_TIMING(INFO, "creating directory manager") { nit: how about "creating WAL directory manager" and "creating data directory manager" above? Same with the "Unable to create..." messages. Same with Open(). http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@730 PS5, Line 730: if (wal_dir == nullptr) { : return Status::InvalidArgument("wal_dir cannot be nullptr"); : } We should probably consider this a programmer error, in which case a DCHECK or CHECK is more appropriate. We tend to use error statuses for errors that we expect to be possible, whereas for this, we would expect a developer to never pass in nullptr to 'wal_dir'. Same below. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@735 PS5, Line 735: << "cannot find tablet:" << tablet_id << " wal directory." << : s.message().ToString(); nit: use Substitute() here for improved readability. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@781 PS5, Line 781: if (segment_name) { We should use CHECK or DCHECK on this too so we don't have to condition on this, otherwise what's the point of calling this function? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h File src/kudu/fs/wal_dirs.h: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@101 PS5, Line 101: // Deserializes a WalDirPB and associates the resulting WalDirPB : // with a tablet_id. : // : // Returns an error if the tablet already exists or if a WAL dir is missing. : Status LoadWalDirFromPB(const std::string& tablet_id, : const WalDirPB& pb); : : // Serializes the WalDirPB associated with the given tablet_id. : // : // Returns an error if the tablet was not already registered or if a WAL dir : // is missing. : Status GetWalDirPB(const std::string& tablet_id, WalDirPB* pb) const; Could these be private? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@114 PS5, Line 114: has registe nit: "has already been registered" http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@117 PS5, Line 117: suffix Document what 'suffix' does? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@120 PS5, Line 120: has not a WAL dir nit: "has no WAL dir" http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@121 PS5, Line 121: e : // WalDirManager will be unchanged. nit: fix spacing http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@123 PS5, Line 123: FindAndRegisterWalDir nit: could you rename this "FindAndRegisterWalDirOnDisk()" or something similar so it's obvious that this will be looking on disk? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@135 PS5, Line 135: // : // More information on uuid and their relation to WAL directories : // can be found next to PathSetPB in fs.proto. nit: this isn't quite true, so maybe remove it? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@148 PS5, Line 148: FindWalDirByTabletId nit: could you rename this "FindWalDirFromDiskByTabletId()" or something similar so it's obvious that this will be looking on disk? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc File src/kudu/fs/wal_dirs.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@113 PS5, Line 113: FLAGS_block_manager Shouldn't this be kWalDirName? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@128 PS5, Line 128: 1 nit: could you annotate this with the variable name so it's easy to understand what this is? /*num_threads_per_dir=*/1 http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@215 PS5, Line 215: { : shared_lock<rw_spinlock> lock(dir_group_lock_.get_lock()); : if (ContainsKey(uuid_idx_by_tablet_, tablet_id)) { : return Status::AlreadyPresent(Substitute("Tried to create WAL directory for tablet $0" : "but one is already registered", tablet_id)); : } : } : : // Check to see if any of our WAL directories have the given tablet ID. : string wd_uuid; : int dir_number = 0; : for (const auto& wd : dirs_) { : string tablet_path = JoinPathSegments(wd->dir(), tablet_id); : if (env_->FileExists(tablet_path)) { : ++dir_number; : if (wd->instance()->healthy()) { : wd_uuid = wd->instance()->uuid(); : } : } : } : if (wd_uuid.empty()) { : return Status::NotFound(Substitute( : "could not find a healthy WAL dir for tablet $0", tablet_id)); : } : if (dir_number != 1) { : return Status::Corruption(Substitute("Tablet $0 has at least two WAL directories.", : tablet_id)); : } Why can't we reuse FindTabletDirFromDisk() here? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@268 PS5, Line 268: for (const auto& wd : dirs_) { : string tablet_path = JoinPathSegments(wd->dir(), tablet_id); : tablet_path.append(suffix); : if (env_->FileExists(tablet_path)) { : *wal_dir = tablet_path; : return Status::OK(); : } : } Shouldn't this return an error if there is more than one directory that has the tablet ID? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@297 PS5, Line 297: if (!candidate) { : return Status::NotFound( : Substitute("unable to find WAL dir with index $0", : uuid_idx)); : } How can we ever reach this block of code? Would a CHECK or DCHECK be more appropriate here? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@318 PS5, Line 318: target_uuid_idx nit: maybe use our maps to determine the path? http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@339 PS5, Line 339: uuid_idx == nullptr nit: you can write this as if (!uuid_idx) if you prefer. Same elsewhere http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc File src/kudu/tablet/tablet_metadata.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc@234 PS5, Line 234: data dir group nit: "data dir group and WAL dir" http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc@474 PS5, Line 474: tablet_id_), tiny-nit: this could probably be on the previous line http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/tablet_copy_client.cc File src/kudu/tserver/tablet_copy_client.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/tablet_copy_client.cc@348 PS5, Line 348: group nit: it's not much of a group if there's only one directory! http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/ts_tablet_manager.cc File src/kudu/tserver/ts_tablet_manager.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/ts_tablet_manager.cc@1465 PS5, Line 1465: string wal_dir; : string wal_recovery_dir; : WARN_NOT_OK(meta->fs_manager()->GetTabletWalDir(tablet_id, &wal_dir), : Substitute("Could not find WAL directory for tablet $0", tablet_id)); : WARN_NOT_OK(meta->fs_manager()->GetTabletWalRecoveryDir( : tablet_id, &wal_recovery_dir), Substitute("Could not find WAL recovery directory" : "for tablet $0", tablet_id)); Why only WARN here? If a failure of the WAL directory is considered fatal below, why not make these fatal here? Also maybe move this down by the other WAL checks near L1484. http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/util/env_util.cc File src/kudu/util/env_util.cc: http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/util/env_util.cc@373 PS5, Line 373: Status s = env->IsDirectory(JoinPathSegments(path, *iter), &is_dir); Why not RETURN_NOT_OK() here? Why should we be permissive of errors? -- To view, visit http://gerrit.cloudera.org:8080/14920 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ic3490b2bbc0544e7c8a5f987fb83855ff155ac2c Gerrit-Change-Number: 14920 Gerrit-PatchSet: 5 Gerrit-Owner: YangSong <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Comment-Date: Sat, 04 Jan 2020 00:01:25 +0000 Gerrit-HasComments: Yes
