[jira] [Updated] (HDFS-6166) revisit balancer so_timeout

Allen Wittenauer (JIRA) Tue, 09 Sep 2014 13:18:42 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Allen Wittenauer updated HDFS-6166:
-----------------------------------
    Fix Version/s:     (was: 3.0.0)

> revisit balancer so_timeout 
> ----------------------------
>
>                 Key: HDFS-6166
>                 URL: https://issues.apache.org/jira/browse/HDFS-6166
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer
>    Affects Versions: 3.0.0, 2.3.0
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>            Priority: Blocker
>             Fix For: 0.23.11, 2.4.0
>
>         Attachments: HDFS-6166-branch23.patch, HDFS-6166.patch
>
>
> HDFS-5806 changed the socket read timeout for the balancer connection to DN 
> to 60 seconds. This works as long as balancer bandwidth is such that it's 
> safe to assume that the DN will easily complete the operation within this 
> time. Obviously this isn't a good assumption. When this assumption isn't 
> valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
> the datanode (balancer thinks the DN has room to do more work, DN is still 
> working on the request and will fail any subsequent requests with "threads 
> quota exceeded errors"). This causes expensive NN traffic via getBlocks() and 
> also causes lots of WARNS int the balancer log.
> Unfortunately the protocol is such that it's impossible to tell if the DN is 
> busy working on replacing the block, OR is in bad shape and will never finish.
> So, in the interest of a small change to deal with both situations, I propose 
> the following two changes:
> * Crank of the socket read timeout to 20 minutes
> * Delay looking at a node for a bit if we did timeout in this way (the DN 
> could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6166) revisit balancer so_timeout

Reply via email to