[
https://issues.apache.org/jira/browse/HDFS-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
James Clampffer updated HDFS-10310:
-----------------------------------
Attachment: HDFS-10310.HDFS-8707.000.patch
Initial fix posted + a tiny unrelated refactor to reduce time holding lock in
DN connection.
It turns out the problem wasn't due to timeouts, it was asio throwing when we
called close/cancel on an asio socket that didn't connect. The exception would
propagate down to the catch-all blocks that wrap the worker threads. The catch
would handle and issue a warning and then the worker threads would happily
continue waiting for more things to do. I replaced the close/cancel socket
calls with SafeDisconnect which handles the exceptions thrown by asio. I
checked the rest of the library for unchecked closes/shutdowns/cancels and it
looks like it's good to go.
At some point we do need to add a watchdog to do timeouts but I'd like to get
some burn in time on the current cancel logic that a watchdog would presumably
end up using.
> libhdfs++: hdfsConnect needs timeout logic
> ------------------------------------------
>
> Key: HDFS-10310
> URL: https://issues.apache.org/jira/browse/HDFS-10310
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: hdfs-client
> Reporter: James Clampffer
> Assignee: James Clampffer
> Attachments: HDFS-10310.HDFS-8707.000.patch
>
>
> hdfsConnect will hang when it attempts to connect to a non-existent NN, right
> now the client has to wait on a TCP timeout to get unstuck. Adding some
> reasonable timeout on FileSystem::Connect will fix this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)