[
https://issues.apache.org/jira/browse/HDFS-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966668#comment-16966668
]
Ayush Saxena commented on HDFS-14946:
-------------------------------------
v004 LGTM +1
> Erasure Coding: Block recovery failed during decommissioning
> ------------------------------------------------------------
>
> Key: HDFS-14946
> URL: https://issues.apache.org/jira/browse/HDFS-14946
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.0.3, 3.2.1, 3.1.3
> Reporter: Fei Hui
> Assignee: Fei Hui
> Priority: Major
> Attachments: HDFS-14946.001.patch, HDFS-14946.002.patch,
> HDFS-14946.003.patch, HDFS-14946.004.patch
>
>
> DataNode logs as follow
> {quote}
> org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are
> provided, not recoverable
> at
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
> at
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
> at
> org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
> at
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {quote}
> Block recovery always failed because of srcNodes in the wrong order
> Reproduce steps are:
> # ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8],
> dn[0-3] are decommissioning
> # dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is
> [b0(decommissioning), b[1-3](decommissioned), b[4-8](live), b[0-3](live)]
> # dn4 is crash, and b4 will be recovery, ec block is [b0(decommissioning),
> b[1-3](decommissioned), null, b[5-8](live), b[0-3](live)]
> We can see error log as above, and b4 is not recovery successfuly. Because
> srcNodes transfered to recovery datanode contains block [b0, b[5-8],b[0-3]],
> and datanode use [b0, b[5-8], b0](minRequiredSources Readers to reconstruct,
> minRequiredSources = Math.min(cellsNum, dataBlkNum)) to recovery the missing
> block.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]