Andrew Wong has posted comments on this change.

Change subject: disk failure: make DataDirManager failure-aware
......................................................................


Patch Set 1:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs-test.cc
File src/kudu/fs/data_dirs-test.cc:

Line 86:   CreateBlockOptions non_existent_tablet_opts(test_block_opts_);
> Why do we need to copy test_block_opts_? Can't we just pass it directly to 
Was meant for explicitness, but it should be obvious given the comment.


http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 443:   shared_lock<rw_spinlock> health_lock(dir_health_lock_.get_lock());
> This is why I asked about the contention profiles of the locks: use and acq
Fair, it is messy and risky, adding another lock. I thought proper ordering 
would be sufficient to prevent contention, but good point that there's not much 
gained and a big potential headache.


PS1, Line 460:       group_target_size = std::min(group_target_size,
             :           static_cast<uint32_t>(data_dirs_.size() - 
failed_data_dirs_.size()));
> If someone requested a group of a particular size, why not fail that reques
As discussed, there may be cases where we want either behavior. If the user 
just wants to get things done without regard for the performance hit, we may 
want to understripe the tablet. If the user cares very much about performance, 
"overstriping" (by assigning full or failed disks) may be the way to go.

As you mentioned, in the case of understriping, if it is important to a user, 
we'll need to add functionality to resize a directory group. Left this as a 
TODO.


PS1, Line 484:     // This should only be reached by some tests; in cases where 
there is no
             :     // natural tablet_id, select a data dir from any of the 
directories.
> So in this case there's no point in considering failed_data_dirs_?
I don't think so; this isn't used by all tests, and I would expect tests that  
do expect disk failures to not use this codepath and to specify a tablet.


Line 503:         return Status::IOError("No healthy directories exist in 
tablet's "
> No ENODEV here?
Done


Line 587: bool DataDirManager::FindTabletsByDataDirUuidIdx(uint16_t uuid_idx, 
set<string>** tablet_ids) {
> This needs some locking.
Done


Line 601:   LOG(ERROR) << Substitute("Dir with uuid index $0 ($1) marked as 
failed", uuid_idx, dd->dir());
> Perhaps avoid logging this multiple times if failed_data_dirs_ already has 
Done


http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

Line 263:   bool FindTabletsByDataDirUuidIdx(uint16_t uuid_idx, 
std::set<std::string>**
same line


Line 263:   bool FindTabletsByDataDirUuidIdx(uint16_t uuid_idx, 
std::set<std::string>**
> I imagine you're trying to avoid copies by passing the caller a pointer int
Hrm good point, you're right that I was trying to avoid copies, but agreed, 
that's probably not worth the potential synchronization headache. If left as 
is, I think external locking would be necessary.

Using both suggestions.


PS1, Line 268: MarkDataDirFailure
> Nit: how about MarkDataDirFailed, for better symmetry with IsDirectoryFaile
Done


Line 271:   bool IsDirectoryFailed(uint16_t uuid_idx) const;
> Nit: Likewise, IsDataDirFailed (or HasDataDirFailed)
Done


Line 273:   void GetFailedDataDirs(std::set<uint16_t>* data_dirs) const {
> How about just returning the data_dirs? Easier for callers.
Done


Line 274:     *data_dirs = failed_data_dirs_;
> This needs to acquire a lock, no?
Done


Line 329:   // Lock protecting changes to failed_data_dirs_.
> Why do we need a different lock for this? Do you expect either this lock or
Noted in another comment that this is a good point. Removed this lock and am 
just using dir_group_lock_.


-- 
To view, visit http://gerrit.cloudera.org:8080/7028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iee212793152de5de5198751d649ab34fb97f6aa2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to