[
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14877451#comment-14877451
]
Zhe Zhang commented on HDFS-9040:
---------------------------------
Thanks for the comment Jing. The discussion is getting more and more
interesting :)
bq. our current implementation (with GS bump) does not have the guarantee that
an internal block with higher GS must have longer safe length
This is a great observation. I used to think GS is helpful to detect stale UC
replicas in the read-being-written scenario, but actually reading from a "slow
replica" is as bad as from a stale one.
bq. To recovery lease, the NN may have to contact all the DataNodes and
identify the "safe length" of the block group.
Current lease recovery algorithm searches for the minimal length of all "good"
replicas (with correct GS), and then truncates all other "good" replicas to
that length. Does "safe length" refer to this minimal length? As indicated in
the HADOOP-1700 design [doc |
https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc], the
option of growing all "good" replicas to the maximum length was also considered
but given up for overhead concern. We can also consider doing some data
reconstruction in EC file lease recovery.
I'm still trying to understand why we discard replicas with stale GS in lease
recovery. Per Jing's analysis, For non-EC files, a replica with a higher GS
should have a larger length anyway, so this question was not important. But in
lease recovery for EC files, shouldn't we just make decision based on the
length of internal blocks? From another angle, if internal_block_1 has a larger
GS but smaller length than internal_block_2, doesn't it mean internal_block_2
is fresher?
Without append / truncate, the only use case for GS I can think of is
"Datanodes storing legacy blocks were dead for a long time and re-join the
cluster" (HADOOP-1497). [~jingzhao] Do you think this is why we consider GS in
calculating "safe length"?
> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests
> to Coordinator)
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Walter Su
> Attachments: HDFS-9040-HDFS-7285.002.patch,
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch,
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
> from [~jingzhao].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)