Andrew Wong has posted comments on this change. Change subject: disk failure: make DataDirManager failure-aware ......................................................................
Patch Set 4: (7 comments) http://gerrit.cloudera.org:8080/#/c/7028/1/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: Line 587: candidate_indices.erase(candidate_indices.begin()); > This needs some locking. Good point, although I think the lock may need to be external. http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS4, Line 480: LOG(WARNING) << Substitute("DataDirGroup for tablet $0 is smaller than targeted. " : "Target size: $1, Actual size: $2", tablet_id, : FLAGS_fs_target_data_dirs_per_tablet, group_indices.size()); > can you add more info? like total num data dirs and total num failed data d Done PS4, Line 509: if (PREDICT_FALSE(!failed_data_dirs_.empty())) { : for (uint16_t uuid_index : group->uuid_indices()) { : if (!ContainsKey(failed_data_dirs_, uuid_index)) { : valid_uuid_indices.emplace_back(uuid_index); : } : } : if (valid_uuid_indices.empty()) { : return Status::IOError("No healthy directories exist in tablet's " : "DataDirGroup", opts.tablet_id, ENODEV); : } : group_uuid_indices = &valid_uuid_indices; > how about creating a helper method to filter failed dirs from a group of uu Done, open to API suggestions, for now I have it return a bool indicating whether there are failed data dirs with a pointer input (to avoid an extra allocation). PS4, Line 503: // Get the data dir group for the tablet. : DataDirGroup* group = FindOrNull(group_by_tablet_map_, opts.tablet_id); : if (group == nullptr) { : return Status::NotFound("Tried to get directory but no DataDirGroup " : "registered for tablet", opts.tablet_id); : } : if (PREDICT_FALSE(!failed_data_dirs_.empty())) { : for (uint16_t uuid_index : group->uuid_indices()) { : if (!ContainsKey(failed_data_dirs_, uuid_index)) { : valid_uuid_indices.emplace_back(uuid_index); : } : } : if (valid_uuid_indices.empty()) { : return Status::IOError("No healthy directories exist in tablet's " : "DataDirGroup", opts.tablet_id, ENODEV); : } : group_uuid_indices = &valid_uuid_indices; : } else { : group_uuid_indices = &group->uuid_indices(); : } > not from this patch but can you revert the if to put the non-test logic fir Done PS4, Line 573: // TODO(awong): disk fullness and, to an extent, disk failure, are : // transient states. If a disk is in either state at the time of group : // creation, the resulting group may be below targeted size. Add : // functionality to resize groups. > the whole management of data dir lifecycles (retiring failed data dirs or f Done http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS4, Line 258: const std::vector<std::unique_ptr<DataDir>>& data_dirs() const { : return data_dirs_; : } > can you add docs stating whether this returns healthy/failed/all data dirs? Done PS4, Line 264: std::set<std::string> FindTabletsByDataDirUuidIdx(uint16_t uuid_idx); > maybe refactor this to return a status and pass the set by pointer as an ou I agree that when that happens, this will need to change, but since that isn't yet a feature, and since the status won't be needed in this patch, I think we should defer this until recovery is implemented. -- To view, visit http://gerrit.cloudera.org:8080/7028 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iee212793152de5de5198751d649ab34fb97f6aa2 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
