[jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades

Kihwal Lee (JIRA) Tue, 25 Feb 2014 14:26:10 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13912153#comment-13912153
 ]


Kihwal Lee commented on HDFS-5535:
----------------------------------

{quote}
[~andrew.wang] : Can you comment on how riding out DN restarts interacts with 
the HBase MTTR work? I know they've done a lot of work to reduce timeouts 
throughout the stack, and riding out restarts sounds like we need to keep the 
timeouts up. It might help to specify your target restart time, for example 
with a DN with 500k blocks.
{quote}

There are multiple factors that affect the DN upgrade latency. First, shutdown 
can take up to about 2 seconds. The time is bounded so that slow 
clients/network cannot delay the shutdown. That is, the shutdown notification 
(via the OOB acks) is advisory and may not reach all clients. HDFS does not 
provide a integrated tool for software/config upgrade & restart. The process 
will be different depending on how hadoop and its config is deployed. Whatever 
the method is, it will add a bit of delay to the whole process.

The actual DN startup time varies depending on the number of blocks and the 
number of disks (typically corresponds to # of volumes). Before HDFS-5498, DN 
would run "du" on each volume serially and also list and stat block and meta 
files serially. Now "du" doesn't have to run if restarted within 5 minutes of 
shutdown. Also block/meta scans are done concurrently if multiple volumes are 
present.  So, the startup time will be shorter if blocks are spread across more 
disks. In a test carried out in the beginning of HDFS-5498, the restart time of 
a node shrunk from over a minute to about 12 seconds. We could make restart 
even shorter by saving more state on shutdown.

The final step before being able to serve client is the registration. If the 
block token is used, DNs won't be able to validate them until registration with 
NN is complete. This usually happens immediately after scanning and adding 
blocks. We could save the shared secret before shutdown, but obviously it can 
be a security risk.

I will be more than happy to further improve the DN restart latency, if it 
becomes major hurdle for HBase.

{quote}
[~stack] :  + "3. Clients" It says "For shutdown, clients may choose to copy 
partially written replica to another node..." The DFSClient would do this 
internally (or 'may' do this)?
{quote}

This is done internally. This is what happens to clients with # replicas >= 3 
today when node failures leave only (#replica / 2) nodes in the pipeline. If 
the restarting node is not local or more than only nodes are in the pipeline, 
the client will simply exclude the restarting node and , if necessary, add more 
nodes to the pipeline and copy partially written data over to the new node.  If 
the restarting node is the local node or is the only node in the pipeline, 
client will wait for the node to be restarted.  There is a configurable timeout 
for the wait.

If the restarting node was the only node in the pipeline, the write will fail 
after the restart timeout. This is very unlikely to blocks with 3 or more 
replicas due to the DN replacement policy. For blocks being written with 
replication factor of 2 can, however, suffer, if the writer has already lost 
one node due to a failure and loses another because of restart failure. To 
address this issue, HDFS-6016 has been filed. However, this is a non-blocking 
issue for the rolling upgrade feature.

Regarding use of shortened read/write socket timeout, I can see the use cases 
favoring failing over to a slower node than waiting longer for the same 
fail-over or recovery, if lucky.  Reads will fail over to a different source 
during  a DN restart. For writes, the restart timeout is independent of the 
socket write timeout, so the client may block for more than what the user wants 
to.  If that is undesirable, the configuration should be changed so that 
DFSOutputStream can timeout on restart much quicker. The default is currently 
30 seconds.

> Umbrella jira for improved HDFS rolling upgrades
> ------------------------------------------------
>
>                 Key: HDFS-5535
>                 URL: https://issues.apache.org/jira/browse/HDFS-5535
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, ha, hdfs-client, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Nathan Roberts
>         Attachments: HDFSRollingUpgradesHighLevelDesign.pdf, 
> h5535_20140219.patch, h5535_20140220-1554.patch, h5535_20140220b.patch, 
> h5535_20140221-2031.patch, h5535_20140224-1931.patch, 
> h5535_20140225-1225.patch
>
>
> In order to roll a new HDFS release through a large cluster quickly and 
> safely, a few enhancements are needed in HDFS. An initial High level design 
> document will be attached to this jira, and sub-jiras will itemize the 
> individual tasks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5535) Umbrella jira for improved HDFS rolling upgrades

Reply via email to