Hi Brahma, Thanks for reporting the issue.
If your problem is really a network issue, then your proposed solution sounds reasonable to me, and it's different than what HDFS-6937 intends to solve. I think we can create a new jira for your issue. Here is why: HDFS-6937's scenario is that we keep replacing the third node in recovery, and did not detect that the middle node is corrupt. Thus adding a corruption checking for the middle node would solve the issue; In your case, even if we try to check the middle node, it would appear as not corrupt. The problem is that, we don't have a check for network issue (and probably adding a network check may not be feasible here). On the other hand, if it's not a network issue, then it could be caused by HDFS-4660, if you don't already have the fix. Hope my explanation makes sense. Thanks. --Yongjun On Sat, Jul 30, 2016 at 4:03 AM, Brahma Reddy Battula < brahmareddy.batt...@huawei.com> wrote: > Hello > > > We had come across one issue, where write is failed even 7 DN's are > available due to network fault at one datanode which is LAST_IN_PIPELINE. > It will be similar to HDFS-6937 . > > Scenario : (DN3 has N/W Fault and Min repl=2). > > Write pipeline: > DN1->DN2->DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad > DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad > .... > And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no > more datanodes to construct the pipeline. > > Thinking we can handle like below: > > Instead of throwing IOException for ERROR_CHECKSUM ack from downstream, If > we can send back the pipeline ack and client side we can replace both DN2 > and DN3 with new nodes as we can't decide on which is having network > problem. > > > Please give you views the possible fix.. > > > --Brahma Reddy Battula > >