[
https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375781#comment-14375781
]
Yi Liu commented on HDFS-7960:
------------------------------
This is a good fix and improvement. Some comments:
*1.* In {{BlockManager}}, the logic of checking zombie datanode storages has
issue.
{code}
if (context != null) {
storageInfo.setLastBlockReportId(context.getReportId());
if (lastStorageInRpc) {
int rpcsSeen = node.updateBlockReportContext(context);
if (rpcsSeen >= context.getTotalRpcs()) {
List<DatanodeStorageInfo> zombies = node.removeZombieStorages();
if (zombies.isEmpty()) {
...
{code}
In the patch, *rpcsSeen* is calculated in NN by counting all rpcs of same block
report, it's not safe in case of split reports.
{{DatanodeProtocol#blockReport}} is {{@Idempotent}}, if retry happens, {{if
(rpcsSeen >= context.getTotalRpcs())}} can be *true*, while some datanode
storages may not send splits of reports, in this case, these datanode storages
will be treated as zombie and wrongly removed from NN.
I suggest to check all rpc ids of block report received before checking zombie
storages.
*2.* Another comment is in {{removeZombieReplicas}}:
{code}
removeStoredBlock(block, zombie.getDatanodeDescriptor());
{code}
While removing stored block, we'd better to remove it from {{InvalidateBlocks}}
too. How about call {{removeBlocksAssociatedTo(final DatanodeDescriptor
node)}}? Then it can also save your code lines.
> The full block report should prune zombie storages even if they're not empty
> ----------------------------------------------------------------------------
>
> Key: HDFS-7960
> URL: https://issues.apache.org/jira/browse/HDFS-7960
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Lei (Eddy) Xu
> Assignee: Colin Patrick McCabe
> Priority: Critical
> Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch,
> HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch
>
>
> The full block report should prune zombie storages even if they're not empty.
> We have seen cases in production where zombie storages have not been pruned
> subsequent to HDFS-7575. This could arise any time the NameNode thinks there
> is a block in some old storage which is actually not there. In this case,
> the block will not show up in the "new" storage (once old is renamed to new)
> and the old storage will linger forever as a zombie, even with the HDFS-7596
> fix applied. This also happens with datanode hotplug, when a drive is
> removed. In this case, an entire storage (volume) goes away but the blocks
> do not show up in another storage on the same datanode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)