[ 
https://issues.apache.org/jira/browse/HDFS-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13667257#comment-13667257
 ] 

Nicolas Liochon commented on HDFS-4754:
---------------------------------------

bq. Should the markStale return a int signifying what's the max value is as 
well for the stale duration (that way the client can adjust itself if needed).. 
not a major one though.
Ok, will do.

bq.  1. The exception msg in DFSClient could be improved a bit 
ok, will do.

 2. On isStale(ConcurrentMap, long), you should update the javadoc to reflect 
the new changes in the method implementation. isStale probably should belong to 
DatanodeManager. See if it makes sense to you.
bq. ok, will do.

bq.  3. May not be immediately required, but at some point, we should probably 
lump all the stuff to do with "stale" in a class and pass that around in the 
methods (like in BlockPlacement* classes). Would ease readability.
Agreed. We could do it in another jira.

bq.  5. Just wondering - if the NameNode's RPC queue is long, and getting to 
the RPC for markStale takes long, the DNs would be marked stale in a different 
window of time than the one the client originally intended. We could fix this 
by having synced times in the cluster and passing client's view of the current 
time to the namenode; the namenode could make some corrections in the duration 
just before marking the datanode stale...
bq.  6. You say that after the desired duration for remaining stale the 
namenode would rely on it's view whether a datanode is stale or not. I am 
wondering if we should cap the max duration for the user controlled stale state 
to be the configured value of the namenode's configured interval for staleness 
based on heartbeat (and not have the new configuration you introduced) .. 
I need a configuration As Aaron prefers if it can be disabled.

For the duration, that's the hard question: I chose this interface for two 
reasons:
1) It makes code easier to test and to understand: there are no 
synchronisations issue.
2) I was thinking about scenarios of a machine beeing shutdown more or less 
cleanly: the process will be killed in a random order. So we could have the DN 
still responding for a few seconds to the heartbeat when we do the call to 
markStale.

That's why I've chosen this option. I still prefer it to the other, but it's 
not "over my dead body", just a slight preference for this trade off vs. the 
other.





                
> Add an API in the namenode to mark a datanode as stale
> ------------------------------------------------------
>
>                 Key: HDFS-4754
>                 URL: https://issues.apache.org/jira/browse/HDFS-4754
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, namenode
>            Reporter: Nicolas Liochon
>            Assignee: Nicolas Liochon
>            Priority: Critical
>         Attachments: 4754.v1.patch
>
>
> There is a detection of the stale datanodes in HDFS since HDFS-3703, with a 
> timeout, defaulted to 30s.
> There are two reasons to add an API to mark a node as stale even if the 
> timeout is not yet reached:
>  1) ZooKeeper can detect that a client is dead at any moment. So, for HBase, 
> we sometimes start the recovery before a node is marked staled. (even with 
> reasonable settings as: stale: 20s; HBase ZK timeout: 30s
>  2) Some third parties could detect that a node is dead before the timeout, 
> hence saving us the cost of retrying. An example or such hw is Arista, 
> presented here by [~tsuna] 
> http://tsunanet.net/~tsuna/fsf-hbase-meetup-april13.pdf, and confirmed in 
> HBASE-6290.
> As usual, even if the node is dead it can comeback before the 10 minutes 
> limit. So I would propose to set a timebound. The API would be
> namenode.markStale(String ipAddress, int port, long durationInMs);
> After durationInMs, the namenode would again rely only on its heartbeat to 
> decide.
> Thoughts?
> If there is no objections, and if nobody in the hdfs dev team has the time to 
> spend some time on it, I will give it a try for branch 2 & 3.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to