Re: Issue in handling checksum errors in write pipeline

Yongjun Zhang Sat, 30 Jul 2016 10:23:33 -0700

Hi Brahma,

Thanks for reporting the issue.

If your problem is really a network issue, then your proposed solution
sounds reasonable to me, and it's different than what HDFS-6937 intends to
solve. I think we can create a new jira for your issue. Here is why:

HDFS-6937's scenario is that we keep replacing the third node in recovery,
and did not detect that the middle node is corrupt. Thus adding a
corruption checking for the middle node would solve the issue; In your
case, even if we try to check the middle node, it would appear as not
corrupt. The problem is that, we don't have a check for network issue (and
probably adding a network check may not be feasible here).

On the other hand, if it's not a network issue, then it could be caused by
HDFS-4660, if you don't already have the fix.

Hope my explanation makes sense.

Thanks.

--Yongjun

On Sat, Jul 30, 2016 at 4:03 AM, Brahma Reddy Battula <
brahmareddy.batt...@huawei.com> wrote:

> Hello
>
>
> We had come across one issue, where write is failed even 7 DN's are
> available due to network fault at one datanode which is LAST_IN_PIPELINE.
> It will be similar to HDFS-6937 .
>
> Scenario : (DN3 has N/W Fault and Min repl=2).
>
> Write pipeline:
> DN1->DN2->DN3  => DN3 Gives ERROR_CHECKSUM ack. And so DN2 marked as bad
> DN1->DN4-> DN3 => DN3 Gives ERROR_CHECKSUM ack. And so DN4 is marked as bad
> ....
> And so on ( all the times DN3 is LAST_IN_PIPELINE) ... Continued till no
> more datanodes to construct the pipeline.
>
> Thinking we can handle like below:
>
> Instead of throwing IOException for ERROR_CHECKSUM ack from downstream, If
> we can send back the pipeline ack and client side we can replace both DN2
> and DN3 with new nodes as we can't decide on which is having network
> problem.
>
>
> Please give you views the possible fix..
>
>
> --Brahma Reddy Battula
>
>

Re: Issue in handling checksum errors in write pipeline

Reply via email to