[ 
https://issues.apache.org/jira/browse/HDFS-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903889#comment-14903889
 ] 

Walter Su commented on HDFS-9040:
---------------------------------

bq. 1. Flush out all the enqueued data to DataNodes before handling failures 
and bumping GS.
Great. It's much simpler. In checkStreamerFailures(boolean toClose), you will 
flushAllInternals anyway before start handling. It doesn't hurt to flush twice. 
So {{toClose}} is unnecessary?
bq. 3. During the test I found that some data streamer may take a long time to 
close/create datanode connections. This may cause other streamers' connections 
timeout. Thus the new patch adds an upper bound for the total waiting time of 
creating datanode connections during failure handling.
bq. +   && remaingTime > waitInterval * 2) {
It's not good enough approach. {{socketTimeout}} is 6s by default. Here you 
wait at most 4s. I remember you just flushAllInternals() before. When 
dataQueue.size()==0, a healthy streamer could in sleep for at most 
{{halfSocketTimeout}}, aka 3s. So you give this streamer 1s left to create 
blockStream and offer updateStreamerMap. If it doesn't finish in 1s, you kill 
it.
I think we should notify every dataQueues to wake up streamers after 
markExternalErrorOnStreamers(), so every streamer has 4s. And it would be 
better if streamers start sending heartbeat packet in the middle of waiting 
other streamers, but it's too hard.
bq. 2.Instead of let each DataStreamer write their own last empty packet of the 
block, we do it in the StripedOutputStream level so that we can still bump GS 
for failure handling before some streamers close their internal blocks.
{code}
        if (shouldEndBlockGroup()) {
          for (int i = 0; i < numAllBlocks; i++) {
            final StripedDataStreamer s = setCurrentStreamer(i);
            if (s.isHealthy()) {
              endBlock();
            }
          }
        }
{code}
The logic looks good. Before we have a solution for PIPELINE_CLOSE_RECOVERY, 
should we catch the exception thrown by endBlock() and ignore it?

> Erasure coding: Refactor DFSStripedOutputStream (Move Namenode RPC Requests 
> to Coordinator)
> -------------------------------------------------------------------------------------------
>
>                 Key: HDFS-9040
>                 URL: https://issues.apache.org/jira/browse/HDFS-9040
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Walter Su
>            Assignee: Jing Zhao
>         Attachments: HDFS-9040-HDFS-7285.002.patch, 
> HDFS-9040-HDFS-7285.003.patch, HDFS-9040-HDFS-7285.004.patch, 
> HDFS-9040.00.patch, HDFS-9040.001.wip.patch, HDFS-9040.02.bgstreamer.patch
>
>
> The general idea is to simplify error handling logic.
> Proposal 1:
> A BlockGroupDataStreamer to communicate with NN to allocate/update block, and 
> StripedDataStreamer s only have to stream blocks to DNs.
> Proposal 2:
> See below the 
> [comment|https://issues.apache.org/jira/browse/HDFS-9040?focusedCommentId=14741388&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14741388]
>  from [~jingzhao].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to