[ 
https://issues.apache.org/jira/browse/HADOOP-4584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676355#action_12676355
 ] 

Brian Bockelman commented on HADOOP-4584:
-----------------------------------------

Hey Raghu,

- Regarding your above point about periodic block verification handling the 
various things that can go wrong with a block:  Currently, it's woefully 
insufficient, especially on large data noes, to replace the directory scan.  If 
we wait 3 weeks (or several months for some of our large nodes) before we find 
a block is missing, we're going to see lots and lots of issues crop up!

- I have seen the 'rm -r' in practice, by the way :).

- With a reasonably sized block, we've had 48TB servers be able to only take a 
few minutes for a scan: no heartbeats lost.  That said, I do like your argument 
that the DN should handle things to the best of its abilities and not die. 

I like the idea of the patch, but only if it's combined with an occasional 
offline scan (even once a day!).  Creeping inconsistency bugs in the NN seem to 
make very accurate block reports a precious commodity, one that I'd gladly pay 
an expensive scan for (though I agree that once an hour is probably excessive).

> Slow generation of blockReport at DataNode causes delay of sending heartbeat 
> to NameNode
> ----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4584
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4584
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>            Reporter: Hairong Kuang
>            Assignee: Suresh Srinivas
>             Fix For: 0.20.0
>
>         Attachments: 4584.patch, 4584.patch, 4584.patch, 4584.patch, 
> 4584.patch, 4584.patch
>
>
> sometimes due to disk or some other problems, datanode takes minutes or tens 
> of minutes to generate a block report. It causes the datanode not able to 
> send heartbeat to NameNode every 3 seconds. In the worst case, it makes 
> NameNode to detect a lost heartbeat and wrongly decide that the datanode is 
> dead.
> It would be nice to have two threads instead. One thread is for scanning data 
> directories and generating block report, and executes the requests sent by 
> NameNode; Another thread is for sending heartbeats, block reports, and 
> picking up the requests from NameNode. By having these two threads, the 
> sending of heartbeats will not get delayed by any slow block report or slow 
> execution of NameNode requests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to