[ 
https://issues.apache.org/jira/browse/HDFS-5535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913361#comment-13913361
 ] 

Colin Patrick McCabe commented on HDFS-5535:
--------------------------------------------

bq. Stack said: Chatting w/ Colin too, it sound like SSR, if it fails a local 
read, it will then retry the local read again after some number of minutes have 
elapsed.

Yeah.  The {{DomainSocketFactory}} has a blacklist of domain socket paths, but 
they expire after 10 minutes.

bq. Kihwal said: If the local DN was added to deadNodes in a DFSInputStream 
because it was restarted, we may be able to (asynchronously?) probe and remove 
it from deadNodes.

Why not just have a time limit on the blacklist, like 15 minutes?  This would 
also help with the case where a DN is temporarily overloaded, and gets added to 
the blacklist on a long-open file, and never removed.  We should do the simple 
things first, and then perhaps move on to more complex schemes.

If you really want to get fancy, you could have a separate daemon running on 
the DN which would stay up during the duration of the upgrade, and tell clients 
who asked what the status of the upgrade was.  But that seems like a big 
project, when we haven't even done the simple things, like sharing information 
about deadNodes between different DFSInputStreams in the same client.

> Umbrella jira for improved HDFS rolling upgrades
> ------------------------------------------------
>
>                 Key: HDFS-5535
>                 URL: https://issues.apache.org/jira/browse/HDFS-5535
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, ha, hdfs-client, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Nathan Roberts
>         Attachments: HDFSRollingUpgradesHighLevelDesign.pdf, 
> h5535_20140219.patch, h5535_20140220-1554.patch, h5535_20140220b.patch, 
> h5535_20140221-2031.patch, h5535_20140224-1931.patch, 
> h5535_20140225-1225.patch
>
>
> In order to roll a new HDFS release through a large cluster quickly and 
> safely, a few enhancements are needed in HDFS. An initial High level design 
> document will be attached to this jira, and sub-jiras will itemize the 
> individual tasks.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to