[jira] [Commented] (HDFS-7704) DN heartbeat to Active NN may be blocked and expire if connection to Standby NN continues to time out.

Kihwal Lee (JIRA) Mon, 02 Feb 2015 14:30:50 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302067#comment-14302067
 ]


Kihwal Lee commented on HDFS-7704:
----------------------------------

Adding to what Charles said: If {{BPServiceActorAction}} is specifically for 
reporting to namenode, you could add something like 
{{reportTo(DatanodeProtocolClientSideTranslatorPB bpNamenode)}} method and have 
the implementation in each subclass do its own thing. Except when one is 
created, the rest of the code won't have to know the specific type of the 
instance.  It will make batching bad block reporting difficult.

In the current patch, the way {{bpThreadQueue}} is synchronized will block 
{{bpThreadEnqueue()}}, if RPC call blocks.  Instead, you could create a new 
collection containing the content of the queue in a synchronized block, and 
then call the report() method of each one outside the synchronized block.

If you want to make batching of bad block reporting work, it may not be worth 
trying to introduce the unified {{BPServiceActorAction}} concept.  For bad 
blocks, "{{enqueue()}}" or "{{add()}}" can put thing directly to an ArrayList, 
which simplifies the aggregation on reporting time.  If you believe  
{{BPServiceActorAction}}-based abstraction will provide more value in the 
future, giving up on batched bad block reporting is okay. After all, datanode 
is not doing it today.

> DN heartbeat to Active NN may be blocked and expire if connection to Standby 
> NN continues to time out. 
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7704
>                 URL: https://issues.apache.org/jira/browse/HDFS-7704
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.5.0
>            Reporter: Rushabh S Shah
>            Assignee: Rushabh S Shah
>         Attachments: HDFS-7704-v2.patch, HDFS-7704.patch
>
>
> There are couple of synchronous calls in BPOfferservice (i.e reportBadBlocks 
> and trySendErrorReport) which will wait for both of the actor threads to 
> process this calls.
> This calls are made with writeLock acquired.
> When reportBadBlocks() is blocked at the RPC layer due to unreachable NN, 
> subsequent heartbeat response processing has to wait for the write lock. It 
> eventually gets through, but takes too long and it blocks the next heartbeat.
> In our HA cluster setup, the standby namenode was taking a long time to 
> process the request.
> Requesting improvement in datanode to make the above calls asynchronous since 
> these reports don't have any specific
> deadlines, so extra few seconds of delay should be acceptable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7704) DN heartbeat to Active NN may be blocked and expire if connection to Standby NN continues to time out.

Reply via email to