[ 
https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732318#comment-14732318
 ] 

Walter Su commented on HDFS-8383:
---------------------------------

*1. What's the difference between datanodeError and externalError?*

They are both error states of streamer.
datanodeError is set inside, by streamer itself. externalError is set outside, 
by DFSOutputStream.
We provide one node for each internal block. We have no node replacement. So if 
one node is marked error, streamer is dead.
externalError is an error signal from outside, means another streamer has 
datanodeError, probably dead already. In this case, all the left healthy will 
received externalError from DFSOutputStream prepare to start a recovery.

*2. What's the difference between {{failed}} and datanodeError?*

No difference, mostly. {{failed}} can be removed. Some unexpected error is not 
datanodeError but should be {{failed}}, like NPE, in this case streamer will 
close. So {{failed}} == error && streamerClosed. 

*3. How does a recovery begin?*

The failed streamer which has datanodeError will be dead. It will not trigger 
recovery. When a streamer failed, it saves lastException. When DFSOutputStream 
writes to this streamer, it calls {{checkClosed()}} first to check if the 
streamer is healthy by checking lastException. When DFSOutputStream finds out 
the streamer failed, it notifies other streamers by setting externalError. 
Other streamers begin recovery.

*4. How does a recovery begin if DFSOutputStream doesn't write to the failed 
streamer?*

DFSOutputStream just finish writing to streamer#3. And streamer#5 failed 
already. DFSOutputStream by accident suspends (possible if client calls 
write(b) slowly) and never touch streamer#5 again. DFSOutputStream doesn't know 
streamer#5 failed. So no recovery. When it calls {{close()}} it will check 
streamer#5 for the last time and will trigger recovery.

*5. What if a second streamer failed during recovery?*

the first recovery will succeed. the second failed streamer will have 
datanodeError and be dead. A second recovery will begin once condition of #3,#4 
has been met.

*6. How does a sceond recovery begin if the first recovery(1001-->1002) is 
unfinished?*
The second recovery will be scheduled. The second recovery should bump GS to 
1003, because the second recovery maybe from some failed streamer finished bump 
GS to 1002. So the second recovery should bump to 1003. The second recovery 
should wait(or force) the first one to finish.

*7. How does a third recovery begin if the first recovery(1001-->1002) is 
unfinished?*
The third reocvery merged with the second one. Only schedule once.

have I answered your question, Jing?

==

*follow-on:*

1. remove {{failed}}.

2. Coordinator periodically search failed streamer. Start recovery 
automatically. Should't depend on DFSOutputStream.

3. We faked {{DataStreamer#block}} if the streamer failed. We also faked 
{{DataStreamer#bytesCurBlock}}. But {{dataQueue}} is lost. (DFSOutputStream is 
async with streamer. So part of {{dataQueue}} belongs to old block and part of 
it belongs to new block when DFSOutputStream begins write next block) So it's 
hard to restart a failed streamer when moving on to next blockGroup.
We have 2 options:
3a. replace the failed streamer with a new one. we have to cache the new block 
part of {{dataQueue}}.
3b. restart the failed streamer.
HDFS-8704 tries to restart the failed streamer. HDFS-8704 disables 
{{checkClosed()}}, and consider failed streamer as normal one. So {{dataQueue}} 
is not lost. And we can simplify {{getBlockGroup}}.

4. block recovery when block group is ending. This's nothing like 
BlockConstructionStage.PIPELINE_CLOSE_RECOVERY. The fastest streamer have ended 
previous block, and sent request to NN to get new block group, while some 
streamer planing to bump GS for the old blocks. There is no way to bump the 
ended/finalized block. I have no clue how to solve this. My first plan is to 
disable block recovery in this situation.

> Tolerate multiple failures in DFSStripedOutputStream
> ----------------------------------------------------
>
>                 Key: HDFS-8383
>                 URL: https://issues.apache.org/jira/browse/HDFS-8383
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Tsz Wo Nicholas Sze
>            Assignee: Walter Su
>         Attachments: HDFS-8383.00.patch, HDFS-8383.01.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to