Andrew Wong has posted comments on this change.

Change subject: disk failure: make DataDirManager failure-aware
......................................................................


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

Line 587:       candidate_indices.erase(candidate_indices.begin());
> This needs some locking.
Good point, although I think the lock may need to be external.


http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.cc
File src/kudu/fs/data_dirs.cc:

PS4, Line 480: LOG(WARNING) << Substitute("DataDirGroup for tablet $0 is 
smaller than targeted. "
             :                                  "Target size: $1, Actual size: 
$2", tablet_id,
             :                                  
FLAGS_fs_target_data_dirs_per_tablet, group_indices.size());
> can you add more info? like total num data dirs and total num failed data d
Done


PS4, Line 509: if (PREDICT_FALSE(!failed_data_dirs_.empty())) {
             :       for (uint16_t uuid_index : group->uuid_indices()) {
             :         if (!ContainsKey(failed_data_dirs_, uuid_index)) {
             :           valid_uuid_indices.emplace_back(uuid_index);
             :         }
             :       }
             :       if (valid_uuid_indices.empty()) {
             :         return Status::IOError("No healthy directories exist in 
tablet's "
             :                                "DataDirGroup", opts.tablet_id, 
ENODEV);
             :       }
             :       group_uuid_indices = &valid_uuid_indices;
> how about creating a helper method to filter failed dirs from a group of uu
Done, open to API suggestions, for now I have it return a bool indicating 
whether there are failed data dirs with a pointer input (to avoid an extra 
allocation).


PS4, Line 503: // Get the data dir group for the tablet.
             :     DataDirGroup* group = FindOrNull(group_by_tablet_map_, 
opts.tablet_id);
             :     if (group == nullptr) {
             :       return Status::NotFound("Tried to get directory but no 
DataDirGroup "
             :                               "registered for tablet", 
opts.tablet_id);
             :     }
             :     if (PREDICT_FALSE(!failed_data_dirs_.empty())) {
             :       for (uint16_t uuid_index : group->uuid_indices()) {
             :         if (!ContainsKey(failed_data_dirs_, uuid_index)) {
             :           valid_uuid_indices.emplace_back(uuid_index);
             :         }
             :       }
             :       if (valid_uuid_indices.empty()) {
             :         return Status::IOError("No healthy directories exist in 
tablet's "
             :                                "DataDirGroup", opts.tablet_id, 
ENODEV);
             :       }
             :       group_uuid_indices = &valid_uuid_indices;
             :     } else {
             :       group_uuid_indices = &group->uuid_indices();
             :     }
> not from this patch but can you revert the if to put the non-test logic fir
Done


PS4, Line 573: // TODO(awong): disk fullness and, to an extent, disk failure, 
are
             :     // transient states. If a disk is in either state at the 
time of group
             :     // creation, the resulting group may be below targeted size. 
Add
             :     // functionality to resize groups.
> the whole management of data dir lifecycles (retiring failed data dirs or f
Done


http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.h
File src/kudu/fs/data_dirs.h:

PS4, Line 258: const std::vector<std::unique_ptr<DataDir>>& data_dirs() const {
             :     return data_dirs_;
             :   }
> can you add docs stating whether this returns healthy/failed/all data dirs?
Done


PS4, Line 264: std::set<std::string> FindTabletsByDataDirUuidIdx(uint16_t 
uuid_idx);
> maybe refactor this to return a status and pass the set by pointer as an ou
I agree that when that happens, this will need to change, but since that isn't 
yet a feature, and since the status won't be needed in this patch, I think we 
should defer this until recovery is implemented.


-- 
To view, visit http://gerrit.cloudera.org:8080/7028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Iee212793152de5de5198751d649ab34fb97f6aa2
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>
Gerrit-HasComments: Yes

Reply via email to