[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14876807#comment-14876807
 ] 

Jing Zhao commented on HDFS-9040:
---------------------------------

Good point about the GS bump! Some thinking along the line:
# First totally agree if the client finally commits the block, then based on 
the committed length we can identify replicas reported from failed DN and 
exclude them (suppose checksum is enabled). No GS bump is needed in this case.
# If the client fails before committing the block and thus the NN does not know 
the expected length. To recovery lease, the NN may have to contact all the 
DataNodes and identify the "safe length" of the block group. Because our 
current implementation (with GS bump) does not have the guarantee that an 
internal block with higher GS must have longer safe length (since a healthy 
streamer may be very slow and left way behind), the GS bump is kind of useless.
# Based on the above two, we have the first option: do not bump GS for failures 
during the writing. Only need to bump GS for append in the future.
# However, this implementation lacks of a bound for safe length. When a failure 
happens and is detected by the StripedOutputStream, nothing has been done to 
make sure some size of data is finally successfully written until the block 
gets committed. It is very possible that when the user finally hits a writing 
failure, only little data can be recovered.
# Another issue is, when NN restarts and receives block reports from DN, it's 
hard for it to determine when to start the recovery. It is possible that it 
determines the safe length too early (e.g., based on 6/9 reported internal 
blocks) and truncates too much data. Thus the logic of the recovery based on 
the block lengths can be complicated.
# We can have option 2: to sync/flush data periodically. Similaly with QFS, we 
can flush the data out for every 1MB or n stripes. Or we can choose flush the 
data only when failures are detected. It will definitely be slower, but it's 
safer and some optimizations can be done: e.g., we can rule out slow streamers 
if we still have enough number of healthy streamers. And GS bump can come into 
the picture to help us simplify the recovery: we can guarantee that a new GS 
indicates some level of safe length (since we flush the data to DN before the 
GS bump). And when NN does the recovery later, GS can help it determine which 
DataNodes should be included in the recovery process.

Thoughts?

> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests 
> to Coordinator)
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9040
>                 URL: https://issues.apache.org/jira/browse/HDFS-9040
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Walter Su
>         Attachments: HDFS-9040-HDFS-7285.002.patch, 
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, 
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and 
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the 
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
>  from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to