[
https://issues.apache.org/jira/browse/HDFS-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14732318#comment-14732318
]
Walter Su commented on HDFS-8383:
---------------------------------
*1. What's the difference between datanodeError and externalError?*
They are both error states of streamer.
datanodeError is set inside, by streamer itself. externalError is set outside,
by DFSOutputStream.
We provide one node for each internal block. We have no node replacement. So if
one node is marked error, streamer is dead.
externalError is an error signal from outside, means another streamer has
datanodeError, probably dead already. In this case, all the left healthy will
received externalError from DFSOutputStream prepare to start a recovery.
*2. What's the difference between {{failed}} and datanodeError?*
No difference, mostly. {{failed}} can be removed. Some unexpected error is not
datanodeError but should be {{failed}}, like NPE, in this case streamer will
close. So {{failed}} == error && streamerClosed.
*3. How does a recovery begin?*
The failed streamer which has datanodeError will be dead. It will not trigger
recovery. When a streamer failed, it saves lastException. When DFSOutputStream
writes to this streamer, it calls {{checkClosed()}} first to check if the
streamer is healthy by checking lastException. When DFSOutputStream finds out
the streamer failed, it notifies other streamers by setting externalError.
Other streamers begin recovery.
*4. How does a recovery begin if DFSOutputStream doesn't write to the failed
streamer?*
DFSOutputStream just finish writing to streamer#3. And streamer#5 failed
already. DFSOutputStream by accident suspends (possible if client calls
write(b) slowly) and never touch streamer#5 again. DFSOutputStream doesn't know
streamer#5 failed. So no recovery. When it calls {{close()}} it will check
streamer#5 for the last time and will trigger recovery.
*5. What if a second streamer failed during recovery?*
the first recovery will succeed. the second failed streamer will have
datanodeError and be dead. A second recovery will begin once condition of #3,#4
has been met.
*6. How does a sceond recovery begin if the first recovery(1001-->1002) is
unfinished?*
The second recovery will be scheduled. The second recovery should bump GS to
1003, because the second recovery maybe from some failed streamer finished bump
GS to 1002. So the second recovery should bump to 1003. The second recovery
should wait(or force) the first one to finish.
*7. How does a third recovery begin if the first recovery(1001-->1002) is
unfinished?*
The third reocvery merged with the second one. Only schedule once.
have I answered your question, Jing?
==
*follow-on:*
1. remove {{failed}}.
2. Coordinator periodically search failed streamer. Start recovery
automatically. Should't depend on DFSOutputStream.
3. We faked {{DataStreamer#block}} if the streamer failed. We also faked
{{DataStreamer#bytesCurBlock}}. But {{dataQueue}} is lost. (DFSOutputStream is
async with streamer. So part of {{dataQueue}} belongs to old block and part of
it belongs to new block when DFSOutputStream begins write next block) So it's
hard to restart a failed streamer when moving on to next blockGroup.
We have 2 options:
3a. replace the failed streamer with a new one. we have to cache the new block
part of {{dataQueue}}.
3b. restart the failed streamer.
HDFS-8704 tries to restart the failed streamer. HDFS-8704 disables
{{checkClosed()}}, and consider failed streamer as normal one. So {{dataQueue}}
is not lost. And we can simplify {{getBlockGroup}}.
4. block recovery when block group is ending. This's nothing like
BlockConstructionStage.PIPELINE_CLOSE_RECOVERY. The fastest streamer have ended
previous block, and sent request to NN to get new block group, while some
streamer planing to bump GS for the old blocks. There is no way to bump the
ended/finalized block. I have no clue how to solve this. My first plan is to
disable block recovery in this situation.
> Tolerate multiple failures in DFSStripedOutputStream
> ----------------------------------------------------
>
> Key: HDFS-8383
> URL: https://issues.apache.org/jira/browse/HDFS-8383
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Tsz Wo Nicholas Sze
> Assignee: Walter Su
> Attachments: HDFS-8383.00.patch, HDFS-8383.01.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)