[
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900536#comment-14900536
]
Walter Su commented on HDFS-9040:
---------------------------------
bq. 5. Another issue is, when NN restarts and receives block reports from DN,
it's hard for it to determine when to start the recovery. It is possible that
it determines the safe length too early (e.g., based on 6/9 reported internal
blocks) and truncates too much data...
bq. 6.....And GS bump can come into the picture to help us simplify the
recovery: we can guarantee that a new GS indicates some level of safe length
(since we flush the data to DN before the GS bump). And when NN does the
recovery later, GS can help it determine which DataNodes should be included in
the recovery process.
Agree. bumpGS is necessary for choosing working set. For example,
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|30mb|90mb|90mb|89mb|90mb|89mb|89mb|
idx 0~2 are corrupted at different times. idx 3~8 are healthy. Ideally, we
truncate last stripe then all(healthy) internal blocks have 89mb.
Without bumping GS, assume idx 0~5 are reported first. If we truncate to 10mb
we lose too much data. If we wait for idx 6~8 so we can truncate to 89mb, but
we dont know if idx 6~8 are permanent lost or delayed.
With bumping GS, we have no such problem.
bq. 6. We can have option 2: to sync/flush data periodically. Similaly with
QFS, we can flush the data out for every 1MB or n stripes. Or we can choose
flush the data only when failures are detected.
writeMaxPackets=80, packetSize=64k, total ~=5mb. write will blocks if dataQueue
congested. I think the delta length of healthy blocks is no more than 10mb.
flushing every 1mb maybe not necessary.
For example,
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|30mb|90mb|90mb|80mb|90mb|89mb|89mb|
idx 0~2 are corrupted at different times. idx 3~8 are healthy. idx 5 is stale.
We truncate to 80mb.
For example,
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|89mb|90mb|90mb|80mb|90mb|89mb|89mb|
idx 0,1 are corrupted at different times. idx 2~8 are healthy. idx 5 is stale.
We dispose idx 5 and truncate to 89mb.
BTW,
{code}
public void initializeBlockRecovery(BlockInfo blockInfo, long recoveryId) {...
if (replicas == null || replicas.length == 0) { //
BlockUnderConstructionFeature.java:164
NameNode.blockStateChangeLog.warn("BLOCK*" +
" BlockUnderConstructionFeature.initializeBlockRecovery:" +
" No blocks found, lease removed.");
// sets primary node index and return.
primaryNodeIndex = -1;
return;
}
{code}
For non-ec file, if no enough replica reported, we don't trigger block
recovery, don't close the file. Lease recovery should retry later.
For ec file, we should wait 6 healthy replicas reported before we allow block
recovery. This is why we need bumpGS. (bumpGS is used to rule out corrupt
replicas). Ideally it's better to wait 9 healthy replicas reported. So I
suggest increase soft limit for ec file to 3min (3x of non-ec file) in order to
wait enough time for reporting. ( Well, 3min is of no use when cluster
restarts( So is 1min for non-ec file))
ClientProtocol.recoverLease() will force trigger lease recovery . It doesn't
wait 3min soft limit to expire so it's likely not all 9 reported. It's user's
behaviour we should allow it (suppose already have 6 healthy replicas. If no 6
healthy replicas, recoverLease() return false and user should retry.) Append()
should wait 3min.
*What if no 6 healthy replicas reported after retry?*
If block is not committed, we dispose the whole lastBlock. If committed, it's a
file-level corruption.
In short, bumpGS is useful for choosing working set(healthy replicas). It's not
useful for calculating safe length with given working set. (I think [~jingzhao]
just said that if I understand correctly.)
We can discuss lease recovery at another jira. And commit this if we at least
agree that bumpGS is useful.
> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests
> to Coordinator)
> -------------------------------------------------------------------------------------------
>
> Key: HDFS-9040
> URL: https://issues.apache.org/jira/browse/HDFS-9040
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Reporter: Walter Su
> Attachments: HDFS-9040-HDFS-7285.002.patch,
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch,
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
> from [~jingzhao].
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)