Yao Guangdong created HDFS-15186:
------------------------------------
Summary: Erasure Coding: Decommission may generate the parity
block's content with all 0 in some case
Key: HDFS-15186
URL: https://issues.apache.org/jira/browse/HDFS-15186
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode, erasure-coding
Affects Versions: 3.1.3, 3.2.1, 3.0.3
Reporter: Yao Guangdong
Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2
I can find some parity block's content with all 0 when i decommission some
DataNode(more than 1) from a cluster. And the probability is very big(parts per
thousand).This is a big problem.You can think that if we read data from the
zero parity block or use the zero parity block to recover a block which can
make us use the error data even we don't know it.
There is some case in the below:
B: Busy DataNode,
D:Decommissioning DataNode,
Others is normal.
1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
....
In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7,
8(D)], the DN may received reconstruct block command and the liveIndices=[0, 1,
2, 3, 4, 5, 7, 8] and the targets's(the field which in the class
StripedReconstructionInfo) length is 2.
The targets's length is 2 which mean that the DataNode need recover 2 internal
block in current code.But from the liveIndices we only can find 1 missing
block, so the method StripedWriter#initTargetIndices will use 0 as the default
recover block and don't care the indices 0 is in the sources indices or not.
When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] use
the ec algorithm.We can find that the indices [0] is in the both the sources
indices and the targets indices in this case. The returned target buffer in the
indices [6] is always 0 from the ec algorithm.So I think this is the ec
algorithm's problem. Because it should more fault tolerance.I try to fixed it
.But it is too hard. Because the case is too more. The second is another case
in the example above(use sources indices [1, 2, 3, 4, 5, 7] to recover indices
[0, 6, 0]). So I changed my mind.Invoke the ec algorithm with a correct
parameters. Which mean that remove the duplicate target indices 0 in this
case.Finally, I fixed it in this way.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]