David Ribeiro Alves has posted comments on this change. Change subject: disk failure: make DataDirManager failure-aware ......................................................................
Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.cc File src/kudu/fs/data_dirs.cc: PS4, Line 480: LOG(WARNING) << Substitute("DataDirGroup for tablet $0 is smaller than targeted. " : "Target size: $1, Actual size: $2", tablet_id, : FLAGS_fs_target_data_dirs_per_tablet, group_indices.size()); can you add more info? like total num data dirs and total num failed data dirs? PS4, Line 509: if (PREDICT_FALSE(!failed_data_dirs_.empty())) { : for (uint16_t uuid_index : group->uuid_indices()) { : if (!ContainsKey(failed_data_dirs_, uuid_index)) { : valid_uuid_indices.emplace_back(uuid_index); : } : } : if (valid_uuid_indices.empty()) { : return Status::IOError("No healthy directories exist in tablet's " : "DataDirGroup", opts.tablet_id, ENODEV); : } : group_uuid_indices = &valid_uuid_indices; how about creating a helper method to filter failed dirs from a group of uuid_indices , can potentially use that in other places PS4, Line 503: // Get the data dir group for the tablet. : DataDirGroup* group = FindOrNull(group_by_tablet_map_, opts.tablet_id); : if (group == nullptr) { : return Status::NotFound("Tried to get directory but no DataDirGroup " : "registered for tablet", opts.tablet_id); : } : if (PREDICT_FALSE(!failed_data_dirs_.empty())) { : for (uint16_t uuid_index : group->uuid_indices()) { : if (!ContainsKey(failed_data_dirs_, uuid_index)) { : valid_uuid_indices.emplace_back(uuid_index); : } : } : if (valid_uuid_indices.empty()) { : return Status::IOError("No healthy directories exist in tablet's " : "DataDirGroup", opts.tablet_id, ENODEV); : } : group_uuid_indices = &valid_uuid_indices; : } else { : group_uuid_indices = &group->uuid_indices(); : } not from this patch but can you revert the if to put the non-test logic first? PS4, Line 573: // TODO(awong): disk fullness and, to an extent, disk failure, are : // transient states. If a disk is in either state at the time of group : // creation, the resulting group may be below targeted size. Add : // functionality to resize groups. the whole management of data dir lifecycles (retiring failed data dirs or fixing them, adding new ones etc) likely merits a jira and not just a comment http://gerrit.cloudera.org:8080/#/c/7028/4/src/kudu/fs/data_dirs.h File src/kudu/fs/data_dirs.h: PS4, Line 258: const std::vector<std::unique_ptr<DataDir>>& data_dirs() const { : return data_dirs_; : } can you add docs stating whether this returns healthy/failed/all data dirs? PS4, Line 264: std::set<std::string> FindTabletsByDataDirUuidIdx(uint16_t uuid_idx); maybe refactor this to return a status and pass the set by pointer as an out param? if you plan to ever allow an admin to (hot) remove a data dir, might be useful to be able to return something like NotFound here is the uuid_idx doesn't exist. If this is not a concern, or at least not something that we'll want to do online, never mind -- To view, visit http://gerrit.cloudera.org:8080/7028 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Iee212793152de5de5198751d649ab34fb97f6aa2 Gerrit-PatchSet: 4 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
