[
https://issues.apache.org/jira/browse/HDFS-5809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062385#comment-14062385
]
Aaron T. Myers commented on HDFS-5809:
--------------------------------------
+1, the patch looks good to me. I agree that writing a unit test for this would
be fairly difficult, and the fix is really quite clear, so I'm OK committing it
without a test.
Thanks a lot for taking care of this, Colin, and tanks much to ikeweesung for
reporting this issue.
> BlockPoolSliceScanner and high speed hdfs appending make datanode to drop
> into infinite loop
> --------------------------------------------------------------------------------------------
>
> Key: HDFS-5809
> URL: https://issues.apache.org/jira/browse/HDFS-5809
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.0.0-alpha
> Environment: jdk1.6, centos6.4, 2.0.0-cdh4.5.0
> Reporter: ikweesung
> Assignee: Colin Patrick McCabe
> Priority: Critical
> Labels: blockpoolslicescanner, datanode, infinite-loop
> Attachments: HDFS-5809.001.patch
>
>
> {{BlockPoolSliceScanner#scan}} contains a "while" loop that continues to
> verify (i.e. scan) blocks until the {{blockInfoSet}} is empty (or some other
> conditions like a timeout have occurred.) In order to do this, it calls
> {{BlockPoolSliceScanner#verifyFirstBlock}}. This is intended to grab the
> first block in the {{blockInfoSet}}, verify it, and remove it from that set.
> ({{blockInfoSet}} is sorted by last scan time.) Unfortunately, if we hit a
> certain bug in {{updateScanStatus}}, the block may never be removed from
> {{blockInfoSet}}. When this happens, we keep rescanning the exact same block
> until the timeout hits.
> The bug is triggered when a block winds up in {{blockInfoSet}} but not in
> {{blockMap}}. You can see it clearly in this code:
> {code}
> private synchronized void updateScanStatus(Block block,
>
> ScanType type,
> boolean scanOk) {
>
> BlockScanInfo info = blockMap.get(block);
>
>
> if ( info != null ) {
> delBlockInfo(info);
> } else {
>
> // It might already be removed. Thats ok, it will be caught next time.
>
> info = new BlockScanInfo(block);
>
> }
> {code}
> If {{info == null}}, we never call {{delBlockInfo}}, the function which is
> intended to remove the {{blockInfoSet}} entry.
> Luckily, there is a simple fix here... the variable that {{updateScanStatus}}
> is being passed is actually a BlockInfo object, so we can simply call
> {{delBlockInfo}} on it directly, without doing a lookup in the {{blockMap}}.
> This is both faster and more robust.
--
This message was sent by Atlassian JIRA
(v6.2#6252)