Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14920 )

Change subject: KUDU-2975: Spread WAL across multiple directories
......................................................................


Patch Set 5:

(34 comments)

Thanks for your patience here, and for your performance analysis. I reviewed 
most of the non-test code, and I'll review the rest of the test code when I can.

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.h
File src/kudu/fs/fs_manager.h:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.h@223
PS5, Line 223:   Status GetOrphanedWalDir(const std::string& tablet_id, 
std::string* wal_dir);
             :
             :   Status GetOrphanedWalRecoveryDir(const std::string& tablet_id,
             :                                    std::string* 
wal_recovery_dir);
Can you add comments explaining what "orphaned" means here?

Or maybe rename this to GetWalDirFromDisk so it's clear to readers of the 
header what this is doing.


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc
File src/kudu/fs/fs_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@a704
PS5, Line 704:
             :
             :
nit: maybe use the emplace_back() pattern here too?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@153
PS5, Line 153: strings::Split
nit: add "using strings::Split()"


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@149
PS5, Line 149:   if (FLAGS_fs_wal_dirs.empty() && !FLAGS_fs_wal_dir.empty()) {
             :     wal_roots.push_back(FLAGS_fs_wal_dir);
             :   }
             :   if (!FLAGS_fs_wal_dirs.empty() && FLAGS_fs_wal_dir.empty()) {
             :     wal_roots = strings::Split(FLAGS_fs_wal_dirs, ",", 
strings::SkipEmpty());
             :   }
How about adding a flag validator that validates that at most one of 
{fs_wal_dir, fs_wal_dirs} is a non-empty string? I.e.

validator() {
  if (!FLAGS_fs_wal_dirs.empty() && !FLAGS_fs_wal_dir.empty()) {
    // return an error
  }
}

That way, when we get here, we can do:

if (!FLAGS_fs_wal_dirs.empty()) {
  wal_roots = Split(FLAGS_fs_wal_dirs, ",", strings::SkipEmpty());
}
if (!FLAGS_fs_wal_dir.empty()) {
  wal_roots = { FLAGS_fs_wal_dir };
}

And this would also work with the existing tests that don't use the gflags.


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@195
PS5, Line 195: wal/wals root
nit: "wal roots"


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@271
PS5, Line 271: fs_wal_dir/fs_wal_dirs) "
             :         "as data directory";
nit: If you use Substitute() for this, you could easily do something like

 Substitute("...directory ($0) for data", FLAGS_fs_wal_dir.empty() ? 
"fs_wal_dirs" : "fs_wal_dir");


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@282
PS5, Line 282: use the WAL root.
nit: "use the first WAL root."


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@305
PS5, Line 305: t:
nit: roots


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@441
PS5, Line 441: block_manager_type
This should be "wal" or something similar, right?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@530
PS5, Line 530: push_back
nit: prefer using emplace_back in new code

http://www.cplusplus.com/reference/vector/vector/emplace_back/


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@556
PS5, Line 556:   LOG_TIMING(INFO, "creating directory manager") {
nit: how about "creating WAL directory manager" and "creating data directory 
manager" above? Same with the "Unable to create..." messages. Same with Open().


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@730
PS5, Line 730:   if (wal_dir == nullptr) {
             :     return Status::InvalidArgument("wal_dir cannot be nullptr");
             :   }
We should probably consider this a programmer error, in which case a DCHECK or 
CHECK is more appropriate. We tend to use error statuses for errors that we 
expect to be possible, whereas for this, we would expect a developer to never 
pass in nullptr to 'wal_dir'.

Same below.


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@735
PS5, Line 735: << "cannot find tablet:" << tablet_id << " wal directory." <<
             :                  s.message().ToString();
nit: use Substitute() here for improved readability.


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/fs_manager.cc@781
PS5, Line 781:   if (segment_name) {
We should use CHECK or DCHECK on this too so we don't have to condition on 
this, otherwise what's the point of calling this function?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h
File src/kudu/fs/wal_dirs.h:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@101
PS5, Line 101:   // Deserializes a WalDirPB and associates the resulting 
WalDirPB
             :   // with a tablet_id.
             :   //
             :   // Returns an error if the tablet already exists or if a WAL 
dir is missing.
             :   Status LoadWalDirFromPB(const std::string& tablet_id,
             :                           const WalDirPB& pb);
             :
             :   // Serializes the WalDirPB associated with the given tablet_id.
             :   //
             :   // Returns an error if the tablet was not already registered 
or if a WAL dir
             :   // is missing.
             :   Status GetWalDirPB(const std::string& tablet_id, WalDirPB* pb) 
const;
Could these be private?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@114
PS5, Line 114:  has registe
nit: "has already been registered"


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@117
PS5, Line 117: suffix
Document what 'suffix' does?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@120
PS5, Line 120: has not a WAL dir
nit: "has no WAL dir"


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@121
PS5, Line 121: e
             :   // WalDirManager will be unchanged.
nit: fix spacing


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@123
PS5, Line 123: FindAndRegisterWalDir
nit: could you rename this "FindAndRegisterWalDirOnDisk()" or something similar 
so it's obvious that this will be looking on disk?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@135
PS5, Line 135:   //
             :   // More information on uuid and their relation to WAL 
directories
             :   // can be found next to PathSetPB in fs.proto.
nit: this isn't quite true, so maybe remove it?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.h@148
PS5, Line 148: FindWalDirByTabletId
nit: could you rename this "FindWalDirFromDiskByTabletId()" or something 
similar so it's obvious that this will be looking on disk?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc
File src/kudu/fs/wal_dirs.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@113
PS5, Line 113: FLAGS_block_manager
Shouldn't this be kWalDirName?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@128
PS5, Line 128: 1
nit: could you annotate this with the variable name so it's easy to understand 
what this is?

 /*num_threads_per_dir=*/1


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@215
PS5, Line 215: {
             :     shared_lock<rw_spinlock> lock(dir_group_lock_.get_lock());
             :     if (ContainsKey(uuid_idx_by_tablet_, tablet_id)) {
             :       return Status::AlreadyPresent(Substitute("Tried to create 
WAL directory for tablet $0"
             :                                     "but one is already 
registered", tablet_id));
             :     }
             :   }
             :
             :   // Check to see if any of our WAL directories have the given 
tablet ID.
             :   string wd_uuid;
             :   int dir_number = 0;
             :   for (const auto& wd : dirs_) {
             :     string tablet_path = JoinPathSegments(wd->dir(), tablet_id);
             :     if (env_->FileExists(tablet_path)) {
             :       ++dir_number;
             :       if (wd->instance()->healthy()) {
             :         wd_uuid = wd->instance()->uuid();
             :       }
             :     }
             :   }
             :   if (wd_uuid.empty()) {
             :     return Status::NotFound(Substitute(
             :         "could not find a healthy WAL dir for tablet $0", 
tablet_id));
             :   }
             :   if (dir_number != 1) {
             :     return Status::Corruption(Substitute("Tablet $0 has at least 
two WAL directories.",
             :                                           tablet_id));
             :   }
Why can't we reuse FindTabletDirFromDisk() here?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@268
PS5, Line 268: for (const auto& wd : dirs_) {
             :     string tablet_path = JoinPathSegments(wd->dir(), tablet_id);
             :     tablet_path.append(suffix);
             :     if (env_->FileExists(tablet_path)) {
             :       *wal_dir = tablet_path;
             :       return Status::OK();
             :     }
             :   }
Shouldn't this return an error if there is more than one directory that has the 
tablet ID?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@297
PS5, Line 297: if (!candidate) {
             :       return Status::NotFound(
             :           Substitute("unable to find WAL dir with index $0",
             :                      uuid_idx));
             :     }
How can we ever reach this block of code? Would a CHECK or DCHECK be more 
appropriate here?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@318
PS5, Line 318: target_uuid_idx
nit: maybe use our maps to determine the path?


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/fs/wal_dirs.cc@339
PS5, Line 339: uuid_idx == nullptr
nit: you can write this as

 if (!uuid_idx)

if you prefer. Same elsewhere


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc
File src/kudu/tablet/tablet_metadata.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc@234
PS5, Line 234: data dir group
nit: "data dir group and WAL dir"


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tablet/tablet_metadata.cc@474
PS5, Line 474: tablet_id_),
tiny-nit: this could probably be on the previous line


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/tablet_copy_client.cc
File src/kudu/tserver/tablet_copy_client.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/tablet_copy_client.cc@348
PS5, Line 348: group
nit: it's not much of a group if there's only one directory!


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/tserver/ts_tablet_manager.cc@1465
PS5, Line 1465:   string wal_dir;
              :   string wal_recovery_dir;
              :   WARN_NOT_OK(meta->fs_manager()->GetTabletWalDir(tablet_id, 
&wal_dir),
              :               Substitute("Could not find WAL directory for 
tablet $0", tablet_id));
              :   WARN_NOT_OK(meta->fs_manager()->GetTabletWalRecoveryDir(
              :       tablet_id, &wal_recovery_dir), Substitute("Could not find 
WAL recovery directory"
              :       "for tablet $0", tablet_id));
Why only WARN here? If a failure of the WAL directory is considered fatal 
below, why not make these fatal here?

Also maybe move this down by the other WAL checks near L1484.


http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/util/env_util.cc
File src/kudu/util/env_util.cc:

http://gerrit.cloudera.org:8080/#/c/14920/5/src/kudu/util/env_util.cc@373
PS5, Line 373:     Status s = env->IsDirectory(JoinPathSegments(path, *iter), 
&is_dir);
Why not RETURN_NOT_OK() here? Why should we be permissive of errors?



--
To view, visit http://gerrit.cloudera.org:8080/14920
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3490b2bbc0544e7c8a5f987fb83855ff155ac2c
Gerrit-Change-Number: 14920
Gerrit-PatchSet: 5
Gerrit-Owner: YangSong <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Sat, 04 Jan 2020 00:01:25 +0000
Gerrit-HasComments: Yes

Reply via email to