[
https://issues.apache.org/jira/browse/HDFS-5924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909974#comment-13909974
]
Kihwal Lee commented on HDFS-5924:
----------------------------------
h3. Batch upgrades and upgrade speed
bq. Shutting down a datanode after previous shutdown datanode is up might be
able to minimize the write failure but would increase the total upgrade time.
The upgrade batch size has come up multiple times in the past discussions. The
conclusion is that DN rolling upgrades shouldn't be done in batch in order to
minimize data loss. When upgrading a big cluster in the traditional way, a
number of DNs don't come back online, mainly due to latent hardware issues.
This results in missing blocks and manual recovery is necessary. Most cases are
recoverable by fixing hardware and manually copying block files, but not all.
Sometimes a number of blocks are permanently lost. Since the block placement
policy is not sophisticated enough to consider failure domains, any several
simultaneous permanent node failures can cause this.
During DN rolling upgrades, admins should watch out for data availability
issues caused by failed restart. Data loss from permanent failures can only be
avoided if only 1-2 nodes are upgraded at a time. If the normal failure rate of
non-upgrading nodes, one soon realizes that upgrading one at a time is the
preferred way.
The upgrade timing requirement (req-3) in the design doc was specified based on
the serial DN upgrade scenario for this reason.
Regarding the upgrade speed, it largely depends on the number of blocks on each
DN. The restart of a DN with about half million blocks in 4-6 volumes used to
take minutes. After a number of optimizations and HDFS-5498, it can come back
up in about 10 seconds.
h3. Write pipeline recovery during DN upgrades
There are three things a client can do when a node in the pipeline is being
upgraded/restarted:
# Exclude the node and continue: this is just like the regular pipeline
recovery. If the pipeline has enough number of nodes, clients can do this. Even
if there is no OOB ack, majority of writers will survive this way as long as
they don't hit double/triple failures and upgrades combined. But single replica
writes will fail.
# Copy block from the upgrading node to a new one and continue: this is a
variation of the "add additional node" recovery path of the pipeline recovery.
The client will ask for an additional node from NN and then issue a copy
command with the source as the upgrading node. This does not add any value
unless there is only one nodes in the pipeline. Writes with only one replica
will fail if the copy cannot be carried out.
# Wait until the node comes back: this is an alternative to the above approach.
It avoids extra data movement and maintains the locality. I've talked to HBase
guys and they said they would prefer this. Writes with only one replica will
fail if the restart does not happen in time.
If min_replica is set to 2 for a cluster, (1) may take care of service
availability side of the issue. I.e. all writes succeeds if nodes are upgraded
one by one. There are use cases, however, require min_replica to be 1, so we
need a way to make single replica writes survive upgrades.
A single replica write may be a write that was initiated with one replica from
the beginning or may be a result of node failures during the write. The former
is usually considered okay to fail, since users usually understand the risk of
the replication factor of 1 and make the choice. The more problematic case is
the latter; the ones started with more than one replicas but currently having
only one due to node failures. I believe (2) or (3) can solve most of these
cases, but neither is ideal.
In theory, (2) sounds useful, but there are some difficulties in realizing it.
First, a DN has to stop accepting new requests, yet allow copy requests to be
allowed. This means DataXceiverServer's server socket cannot be closed until
copy requests are received. Since the knowledge about a specific client
(writer) is only known to each DataXceiver thread and the copy command will
spawn another thread on DN, coordinating this is not simple. Secondly, DN
should be able to detect when it is safe to shutdown. To get it correct, it
has to be told by the clients. A slow client can also slow down the progress.
In the end, the extra coordination and dependencies are not exactly cheap.
The mechanism in (3) works fine with no additional run-time traffic overhead
and no dependencies on clients. Timing-wise it is also more deterministic. The
downside is that if the datanode does not come back up in time, outstanding
writes will timeout and fail, if the node was the only one left in the
pipeline. In addition to making writes continue, this mechanism allows
locality to be preserved, so (2) cannot substitute (3) completely.
I suggest an additional change to be made in order to keep the mechanism
simpler, but address its deficiency.
h3. Suggested improvement
While using (3) for write pipeline recovery during upgrade-restarts, we can
reduce the chance of permanent failures for writers with the replica count
reduced to one due to prior node failures. If a block write started with no
replication (ie. single replica) by the user, it is assumed that the user
understands the risk of higher possibility of read or write failures. Thus we
focus on the cases where the specified replication factor is greater than one.
The datanode replacement policy is defined in {{ReplaceDatanodeOnFailure}}. If
the default policy is modified to maintain two replicas at minimum for r > 1,
the chance of write failures for r > 1 during DN rolling upgrades will be
greatly reduced. Note that this does not incur additional replication traffic.
If you think this proposal makes sense, I will file a separate jira to update
{{ReplaceDatanodeOnFailure}}.
> Utilize OOB upgrade message processing for writes
> -------------------------------------------------
>
> Key: HDFS-5924
> URL: https://issues.apache.org/jira/browse/HDFS-5924
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, ha, hdfs-client, namenode
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Attachments: HDFS-5924_RBW_RECOVERY.patch,
> HDFS-5924_RBW_RECOVERY.patch
>
>
> After HDFS-5585 and HDFS-5583, clients and datanodes can coordinate
> shutdown-restart in order to minimize failures or locality loss.
> In this jira, HDFS client is made aware of the restart OOB ack and perform
> special write pipeline recovery. Datanode is also modified to load marked RBW
> replicas as RBW instead of RWR as long as the restart did not take long.
> For clients, it considers doing this kind of recovery only when there is only
> one node left in the pipeline or the restarting node is a local datanode.
> For both clients and datanodes, the timeout or expiration is configurable,
> meaning this feature can be turned off by setting timeout variables to 0.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)