[
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer updated HDFS-6166:
-----------------------------------
Fix Version/s: (was: 3.0.0)
> revisit balancer so_timeout
> ----------------------------
>
> Key: HDFS-6166
> URL: https://issues.apache.org/jira/browse/HDFS-6166
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer
> Affects Versions: 3.0.0, 2.3.0
> Reporter: Nathan Roberts
> Assignee: Nathan Roberts
> Priority: Blocker
> Fix For: 0.23.11, 2.4.0
>
> Attachments: HDFS-6166-branch23.patch, HDFS-6166.patch
>
>
> HDFS-5806 changed the socket read timeout for the balancer connection to DN
> to 60 seconds. This works as long as balancer bandwidth is such that it's
> safe to assume that the DN will easily complete the operation within this
> time. Obviously this isn't a good assumption. When this assumption isn't
> valid, the balancer will timeout the cmd BUT it will then be out-of-sync with
> the datanode (balancer thinks the DN has room to do more work, DN is still
> working on the request and will fail any subsequent requests with "threads
> quota exceeded errors"). This causes expensive NN traffic via getBlocks() and
> also causes lots of WARNS int the balancer log.
> Unfortunately the protocol is such that it's impossible to tell if the DN is
> busy working on replacing the block, OR is in bad shape and will never finish.
> So, in the interest of a small change to deal with both situations, I propose
> the following two changes:
> * Crank of the socket read timeout to 20 minutes
> * Delay looking at a node for a bit if we did timeout in this way (the DN
> could still have xceiver threads working on the replace
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)