[
https://issues.apache.org/jira/browse/HDFS-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14987971#comment-14987971
]
Haohui Mai commented on HDFS-9103:
----------------------------------
Thanks for updating the patch.
{code}
+ auto name_match = [dn](const Entry& entry) {
+ if (dn == entry.first) {
+ return true;
+ }
+ return false;
+ };
+ auto entry = std::find_if(datanodes_.begin(), datanodes_.end(), name_match);
+
+ if (entry != datanodes_.end()) {
+ /* if entry already exists remove it so one with a new timestamp can be
+ * added */
+ datanodes_.erase(entry);
+ }
+
+ /* insert a new entry with bad node and current time */
+ datanodes_.insert(Entry(std::string(dn), Clock::now()));
{code}
The {{name_match}} function can be simplified. It looks like the comments are
unnecessary.
{code}
+std::set<std::string> BadDataNodeTracker::GetNodesToExclude() {
{code}
This function is on the critical path thus the result should be cached.
{code}
+void BadDataNodeTracker::Clear() {
+ std::lock_guard<std::mutex> update_lock(datanodes_update_lock_);
+ datanodes_.clear();
+}
{code}
Looks like this is a test-only function. Let's name it "TEST_Clear()" for
clarity.
{code}
+ BadDataNodeTracker() : timeout_duration_(10 * 60) {}
+ unsigned int timeout_duration_; /* in seconds */
{code}
The timeout settings needs to be passed in through the {{Option}} object.
{{timeout_duration_}} can be marked as const.
{code}
+ typedef std::chrono::system_clock Clock;
+ typedef std::pair<std::string, TimePoint> Entry;
+ std::set<Entry> datanodes_;
{code}
The clock needs to be a {{steady_clock}} for monotonicity. Are there any
reasons of using a {{set<Entry>}} instead of a {{vector}} / {{map}}? Using
{{vector}} has much better cache locality results in this case.
{code}
if (!s.ok()) {
+ /* determine if DN gets marked bad */
+ if (ShouldExclude(s)) {
+ bad_datanodes_->AddBadNode(contacted_datanode);
+ }
+ // resource might free up
+ case Status::kResourceUnavailable:
+ return false;
{code}
The above code can be simplified.
{code}
+static bool ShouldExclude(const Status &s) {
+ if (s.ok()) {
+ return false;
+ }
+
+ switch (s.code()) {
+ // resource might free up
+ case Status::kResourceUnavailable:
+ return false;
+ case Status::kInvalidArgument:
+ case Status::kUnimplemented:
+ case Status::kException:
+ default:
+ return true;
+ }
+ return true;
+}
{code}
The function can be placed in the {{InputStream}} class.
> Retry reads on DN failure
> -------------------------
>
> Key: HDFS-9103
> URL: https://issues.apache.org/jira/browse/HDFS-9103
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: Bob Hansen
> Assignee: James Clampffer
> Fix For: HDFS-8707
>
> Attachments: HDFS-9103.1.patch, HDFS-9103.2.patch,
> HDFS-9103.HDFS-8707.006.patch, HDFS-9103.HDFS-8707.3.patch,
> HDFS-9103.HDFS-8707.4.patch, HDFS-9103.HDFS-8707.5.patch
>
>
> When AsyncPreadSome fails, add the failed DataNode to the excluded list and
> try again.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)