[ 
https://issues.apache.org/jira/browse/HDFS-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Clampffer updated HDFS-10310:
-----------------------------------
    Attachment: HDFS-10310.HDFS-8707.000.patch

Initial fix posted + a tiny unrelated refactor to reduce time holding lock in 
DN connection.

It turns out the problem wasn't due to timeouts, it was asio throwing when we 
called close/cancel on an asio socket that didn't connect.  The exception would 
propagate down to the catch-all blocks that wrap the worker threads.  The catch 
would handle and issue a warning and then the worker threads would happily 
continue waiting for more things to do.  I replaced the close/cancel socket 
calls with SafeDisconnect which handles the exceptions thrown by asio.  I 
checked the rest of the library for unchecked closes/shutdowns/cancels and it 
looks like it's good to go.

At some point we do need to add a watchdog to do timeouts but I'd like to get 
some burn in time on the current cancel logic that a watchdog would presumably 
end up using.

> libhdfs++: hdfsConnect needs timeout logic
> ------------------------------------------
>
>                 Key: HDFS-10310
>                 URL: https://issues.apache.org/jira/browse/HDFS-10310
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: James Clampffer
>            Assignee: James Clampffer
>         Attachments: HDFS-10310.HDFS-8707.000.patch
>
>
> hdfsConnect will hang when it attempts to connect to a non-existent NN, right 
> now the client has to wait on a TCP timeout to get unstuck.  Adding some 
> reasonable timeout on FileSystem::Connect will fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to