[jira] [Commented] (HDFS-5924) Utilize OOB upgrade message processing for writes

Kihwal Lee (JIRA) Sun, 23 Feb 2014 18:12:12 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909974#comment-13909974
 ]


Kihwal Lee commented on HDFS-5924:
----------------------------------

h3. Batch upgrades and upgrade speed
bq. Shutting down a datanode after previous shutdown datanode is up might be 
able to minimize the write failure but would increase the total upgrade time.

The upgrade batch size has come up multiple times in the past discussions. The 
conclusion is that DN rolling upgrades shouldn't be done in batch in order to 
minimize data loss. When upgrading a big cluster in the traditional way, a 
number of DNs don't come back online, mainly due to latent hardware issues. 
This results in missing blocks and manual recovery is necessary. Most cases are 
recoverable by fixing hardware and manually copying block files, but not all. 
Sometimes a number of blocks are permanently lost.  Since the block placement 
policy is not sophisticated enough to consider failure domains, any several 
simultaneous permanent node failures can cause this.

During DN rolling upgrades, admins should watch out for data availability 
issues caused by failed restart. Data loss from permanent failures can only be 
avoided if only 1-2 nodes are upgraded at a time. If the normal failure rate of 
non-upgrading nodes, one soon realizes that upgrading one at a time is the 
preferred way.

The upgrade timing requirement (req-3) in the design doc was specified based on 
the serial DN upgrade scenario for this reason.

Regarding the upgrade speed, it largely depends on the number of blocks on each 
DN. The restart of a DN with about half million blocks in 4-6 volumes used to 
take minutes. After a number of optimizations and HDFS-5498, it can come back 
up in about 10 seconds.  

h3. Write pipeline recovery during DN upgrades
There are three things a client can do when a node in the pipeline is being 
upgraded/restarted:
# Exclude the node and continue: this is just like the regular pipeline 
recovery. If the pipeline has enough number of nodes, clients can do this. Even 
if there is no OOB ack, majority of writers will survive this way as long as 
they don't hit double/triple failures and upgrades combined. But single replica 
writes will fail.
# Copy block from the upgrading node to a new one and continue: this is a 
variation of the "add additional node" recovery path of the pipeline recovery. 
The client will ask for an additional node from NN and then issue a copy 
command with the source as the upgrading node.  This does not add any value 
unless there is only one nodes in the pipeline. Writes with only one replica 
will fail if the copy cannot be carried out.
# Wait until the node comes back: this is an alternative to the above approach. 
It avoids extra data movement and maintains the locality. I've talked to HBase 
guys and they said they would prefer this.  Writes with only one replica will 
fail if the restart does not happen in time.

If min_replica is set to 2 for a cluster, (1) may take care of service 
availability side of the issue. I.e. all writes succeeds if nodes are upgraded 
one by one.  There are use cases, however, require min_replica to be 1, so we 
need a way to make single replica writes survive upgrades. 

A single replica write may be a write that was initiated with one replica from 
the beginning or may be a result of node failures during the write. The former 
is usually considered okay to fail, since users usually understand the risk of 
the replication factor of 1 and make the choice.  The more problematic case is 
the latter; the ones started with more than one replicas but currently having 
only one due to node failures.  I believe (2) or (3) can solve most of these 
cases, but neither is ideal.

In theory, (2) sounds useful, but there are some difficulties in realizing it. 
First, a DN has to stop accepting new requests, yet allow copy requests to be 
allowed. This means DataXceiverServer's server socket cannot be closed until 
copy requests are received. Since the knowledge about a specific client 
(writer) is only known to each DataXceiver thread and the copy command will 
spawn another thread on DN, coordinating this is not simple.  Secondly, DN 
should be able to detect when it is safe to shutdown.  To get it correct, it 
has to be told by the clients. A slow client can also slow down the progress. 
In the end, the extra coordination and dependencies are not exactly cheap.

The mechanism in (3) works fine with no additional run-time traffic overhead 
and no dependencies on clients. Timing-wise it is also more deterministic. The 
downside is that if the datanode does not come back up in time, outstanding 
writes will timeout and fail, if the node was the only one left in the 
pipeline.  In addition to making writes continue, this mechanism allows 
locality to be preserved, so (2) cannot substitute (3) completely.

I suggest an additional change to be made in order to keep the mechanism 
simpler, but address its deficiency.

h3. Suggested improvement
While using (3) for write pipeline recovery during upgrade-restarts, we can 
reduce the chance of permanent failures for writers with the replica count 
reduced to one due to prior node failures. If a block write started with no 
replication (ie. single replica) by the user, it is assumed that the user 
understands the risk of higher possibility of read or write failures. Thus we 
focus on the cases where the specified replication factor is greater than one.

The datanode replacement policy is defined in {{ReplaceDatanodeOnFailure}}. If 
the default policy is modified to maintain two replicas at minimum for r > 1, 
the chance of write failures for r > 1 during DN rolling upgrades will be 
greatly reduced.  Note that this does not incur additional replication traffic. 
  

If you think this proposal makes sense, I will file a separate jira to update 
{{ReplaceDatanodeOnFailure}}.

> Utilize OOB upgrade message processing for writes
> -------------------------------------------------
>
>                 Key: HDFS-5924
>                 URL: https://issues.apache.org/jira/browse/HDFS-5924
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, ha, hdfs-client, namenode
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>         Attachments: HDFS-5924_RBW_RECOVERY.patch, 
> HDFS-5924_RBW_RECOVERY.patch
>
>
> After HDFS-5585 and HDFS-5583, clients and datanodes can coordinate 
> shutdown-restart in order to minimize failures or locality loss.
> In this jira, HDFS client is made aware of the restart OOB ack and perform 
> special write pipeline recovery. Datanode is also modified to load marked RBW 
> replicas as RBW instead of RWR as long as the restart did not take long. 
> For clients, it considers doing this kind of recovery only when there is only 
> one node left in the pipeline or the restarting node is a local datanode.  
> For both clients and datanodes, the timeout or expiration is configurable, 
> meaning this feature can be turned off by setting timeout variables to 0.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5924) Utilize OOB upgrade message processing for writes

Reply via email to