[jira] [Commented] (HDFS-7704) DN heartbeat to Active NN may be blocked and expire if connection to Standby NN continues to time out.

Kihwal Lee (JIRA) Wed, 04 Feb 2015 13:29:55 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305997#comment-14305997
 ]


Kihwal Lee commented on HDFS-7704:
----------------------------------

{{processQueueMessages()}} is called outside of the try-catch block of 
{{offerService()}}. It will be better to move it inside, at the end of the try 
block and let the existing exception handling take care of the rest.  Then we 
can get rid of {{nnAddress]} that are passed only for logging. Make 
{{reportTo()}} throw and {{offerService()}} always log exceptions with 
{{nnAddress}}.

When a reporting fails, is it okay to simply drop it? Probably not. Take a look 
at {{reportReceivedDeletedBlocks()}}. We should add a similar logic of putting 
failed reports back to the queue to {{processQueueMessages()}}.

> DN heartbeat to Active NN may be blocked and expire if connection to Standby 
> NN continues to time out. 
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7704
>                 URL: https://issues.apache.org/jira/browse/HDFS-7704
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.5.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-7704-v2.patch, HDFS-7704-v3.patch, 
> HDFS-7704-v4.patch, HDFS-7704.patch
>
>
> There are couple of synchronous calls in BPOfferservice (i.e reportBadBlocks 
> and trySendErrorReport) which will wait for both of the actor threads to 
> process this calls.
> This calls are made with writeLock acquired.
> When reportBadBlocks() is blocked at the RPC layer due to unreachable NN, 
> subsequent heartbeat response processing has to wait for the write lock. It 
> eventually gets through, but takes too long and it blocks the next heartbeat.
> In our HA cluster setup, the standby namenode was taking a long time to 
> process the request.
> Requesting improvement in datanode to make the above calls asynchronous since 
> these reports don't have any specific
> deadlines, so extra few seconds of delay should be acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7704) DN heartbeat to Active NN may be blocked and expire if connection to Standby NN continues to time out.

Reply via email to