[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14900536#comment-14900536
 ] 

Walter Su commented on HDFS-9040:
---------------------------------


bq. 5. Another issue is, when NN restarts and receives block reports from DN, 
it's hard for it to determine when to start the recovery. It is possible that 
it determines the safe length too early (e.g., based on 6/9 reported internal 
blocks) and truncates too much data...
bq. 6.....And GS bump can come into the picture to help us simplify the 
recovery: we can guarantee that a new GS indicates some level of safe length 
(since we flush the data to DN before the GS bump). And when NN does the 
recovery later, GS can help it determine which DataNodes should be included in 
the recovery process.
Agree. bumpGS is necessary for choosing working set. For example,
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|30mb|90mb|90mb|89mb|90mb|89mb|89mb|
idx 0~2 are corrupted at different times. idx 3~8 are healthy. Ideally, we 
truncate last stripe then all(healthy) internal blocks have 89mb.
Without bumping GS, assume idx 0~5 are reported first. If we truncate to 10mb 
we lose too much data. If we wait for idx 6~8 so we can truncate to 89mb, but 
we dont know if idx 6~8 are permanent lost or delayed.
With bumping GS, we have no such problem.


bq. 6. We can have option 2: to sync/flush data periodically. Similaly with 
QFS, we can flush the data out for every 1MB or n stripes. Or we can choose 
flush the data only when failures are detected. 
writeMaxPackets=80, packetSize=64k, total ~=5mb. write will blocks if dataQueue 
congested. I think the delta length of healthy blocks is no more than 10mb. 
flushing every 1mb maybe not necessary.
For example, 
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|30mb|90mb|90mb|80mb|90mb|89mb|89mb|
idx 0~2 are corrupted at different times. idx 3~8 are healthy. idx 5 is stale. 
We truncate to 80mb.


For example, 
||idx0||idx1||idx2||idx3||idx4||idx5||idx6||idx7||idx8||
|10mb|20mb|89mb|90mb|90mb|80mb|90mb|89mb|89mb|
idx 0,1 are corrupted at different times. idx 2~8 are healthy. idx 5 is stale. 
We dispose idx 5 and truncate to 89mb.

BTW,
{code}
  public void initializeBlockRecovery(BlockInfo blockInfo, long recoveryId) {...
    if (replicas == null || replicas.length == 0) { // 
BlockUnderConstructionFeature.java:164
      NameNode.blockStateChangeLog.warn("BLOCK*" +
          " BlockUnderConstructionFeature.initializeBlockRecovery:" +
          " No blocks found, lease removed.");
      // sets primary node index and return.
      primaryNodeIndex = -1;
      return;
    }
{code}
For non-ec file, if no enough replica reported, we don't trigger block 
recovery, don't close the file. Lease recovery should retry later.
For ec file, we should wait 6 healthy replicas reported before we allow block 
recovery. This is why we need bumpGS. (bumpGS is used to rule out corrupt 
replicas). Ideally it's better to wait 9 healthy replicas reported. So I 
suggest increase soft limit for ec file to 3min (3x of non-ec file) in order to 
wait enough time for reporting. ( Well, 3min is of no use when cluster 
restarts( So is 1min for non-ec file))
ClientProtocol.recoverLease() will force trigger lease recovery . It doesn't 
wait 3min soft limit to expire so it's likely not all 9 reported. It's user's 
behaviour we should allow it (suppose already have 6 healthy replicas. If no 6 
healthy replicas, recoverLease() return false and user should retry.) Append() 
should wait 3min.

*What if no 6 healthy replicas reported after retry?*
If block is not committed, we dispose the whole lastBlock. If committed, it's a 
file-level corruption.

In short, bumpGS is useful for choosing working set(healthy replicas). It's not 
useful for calculating safe length with given working set. (I think [~jingzhao] 
just said that if I understand correctly.)

We can discuss lease recovery at another jira. And commit this if we at least 
agree that bumpGS is useful.

> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests 
> to Coordinator)
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9040
>                 URL: https://issues.apache.org/jira/browse/HDFS-9040
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Walter Su
>         Attachments: HDFS-9040-HDFS-7285.002.patch, 
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040.00.patch, HDFS-9040.001.wip.patch, 
> HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and 
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the 
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
>  from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to