[
https://issues.apache.org/jira/browse/HDFS-9818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153225#comment-15153225
]
Tsz Wo Nicholas Sze commented on HDFS-9818:
-------------------------------------------
- Should we check all targets instead of the first target in
validateReconstructionWork(..)?
{code}
if (!isInNewRack(rw.getSrcNodes(), targets[0].getDatanodeDescriptor())) {
// No use continuing, unless a new rack in this case
return false;
}
{code}
- We may move some of the code to from validateReconstructionWork(..) to
ErasureCodingWork. Then, we can eliminate DatanodeAndBlockIndex and make some
methods private.
{code}
//validateReconstructionWork(..)
// Add block to the to be reconstructed list
if (block.isStriped()) {
assert rw.getTargets().length > 0;
assert pendingNum == 0 : "Should wait the previous reconstruction"
+ " to finish";
((ErasureCodingWork) rw).addBlockToBeReconstructed(
(BlockInfoStriped)block, getBlockPoolId());
} else {
rw.getSrcNodes()[0].addBlockToBeReplicated(block, targets);
}
{code}
{code}
//ErasureCodingWork
void addBlockToBeReconstructed(BlockInfoStriped blk, String bpid) {
// if we already have all the internal blocks, but not enough racks,
// we only need to replicate one internal block to a new rack
if (hasAllInternalBlocks()) {
final int i = chooseSource4SimpleReplication();
final int blkIdx = getLiveBlockIndicies()[i];
final DatanodeDescriptor dn = getSrcNodes()[i];
final long len = StripedBlockUtil.getInternalBlockLength(
blk.getNumBytes(), blk.getCellSize(), blk.getDataBlockNum(), blkIdx);
final long id = blk.getBlockId() + blkIdx;
final Block targetBlk = new Block(id, len, blk.getGenerationStamp());
dn.addBlockToBeReplicated(targetBlk, getTargets());
} else {
getTargets()[0].getDatanodeDescriptor().addBlockToBeErasureCoded(
new ExtendedBlock(bpid, blk),
getSrcNodes(), getTargets(), getLiveBlockIndicies(),
blk.getErasureCodingPolicy());
}
}
{code}
{code}
//ErasureCodingWork
private int chooseSource4SimpleReplication() {
Map<String, List<Integer>> map = new HashMap<>();
for (int i = 0; i < getSrcNodes().length; i++) {
final String rack = getSrcNodes()[i].getNetworkLocation();
List<Integer> dnList = map.get(rack);
if (dnList == null) {
dnList = new ArrayList<>();
map.put(rack, dnList);
}
dnList.add(i);
}
int max = 0;
String rack = null;
for (Map.Entry<String, List<Integer>> entry : map.entrySet()) {
if (entry.getValue().size() > max) {
max = entry.getValue().size();
rack = entry.getKey();
}
}
assert rack != null;
return map.get(rack).get(0);
}
{code}
{code}
//ErasureCodingWork
private boolean hasAllInternalBlocks() {
...
}
{code}
> Correctly handle EC reconstruction work caused by not enough racks
> ------------------------------------------------------------------
>
> Key: HDFS-9818
> URL: https://issues.apache.org/jira/browse/HDFS-9818
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, namenode
> Affects Versions: 3.0.0
> Reporter: Takuya Fukudome
> Assignee: Jing Zhao
> Attachments: HDFS-9818.000.patch, HDFS-9818.001.patch
>
>
> This is reported by [~tfukudom]:
> In a system test where 1 of 7 datanode racks were stopped,
> {{HadoopIllegalArgumentException}} was seen on DataNode side while
> reconstructing missing EC blocks:
> {code}
> 2016-02-16 11:09:06,672 WARN datanode.DataNode
> (ErasureCodingWorker.java:run(482)) - Failed to recover striped block:
> BP-480558282-172.29.4.13-1453805190696:blk_-9223372036850962784_278270
> org.apache.hadoop.HadoopIllegalArgumentException: Inputs not fully
> corresponding to erasedIndexes in null places. erasedOrNotToReadIndexes: [1,
> 2, 6], erasedIndexes: [3]
> at
> org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.doDecode(RSRawDecoder.java:166)
> at
> org.apache.hadoop.io.erasurecode.rawcoder.AbstractRawErasureDecoder.decode(AbstractRawErasureDecoder.java:84)
> at
> org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder.decode(RSRawDecoder.java:89)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.recoverTargets(ErasureCodingWorker.java:683)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.ErasureCodingWorker$ReconstructAndTransferBlock.run(ErasureCodingWorker.java:465)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)