[ 
https://issues.apache.org/jira/browse/HDFS-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-14946:
---------------------------
    Attachment: HDFS-14946.003.patch

> Erasure Coding: Block recovery failed during decommissioning
> ------------------------------------------------------------
>
>                 Key: HDFS-14946
>                 URL: https://issues.apache.org/jira/browse/HDFS-14946
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.3, 3.2.1, 3.1.3
>            Reporter: Fei Hui
>            Assignee: Fei Hui
>            Priority: Major
>         Attachments: HDFS-14946.001.patch, HDFS-14946.002.patch, 
> HDFS-14946.003.patch
>
>
> DataNode logs as follow
> {quote}
> org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are 
> provided, not recoverable
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:748)
> {quote}
> Block recovery always failed because of srcNodes in the wrong order
> Reproduce steps are:
> # ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8], 
> dn[0-3] are decommissioning
> # dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is 
> [b0(decommissioning), b[1-3](decommissioned), b[4-8](live), b[0-3](live)]
> # dn4 is crash, and b4 will be recovery, ec block is [b0(decommissioning), 
> b[1-3](decommissioned), null, b[5-8](live), b[0-3](live)]
> We can see error log as above, and b4 is not recovery successfuly. Because 
> srcNodes transfered to recovery datanode contains block [b0, b[5-8],b[0-3]], 
> and datanode use [b0, b[5-8], b0](minRequiredSources Readers to reconstruct, 
> minRequiredSources = Math.min(cellsNum, dataBlkNum)) to recovery the missing 
> block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to