[ 
https://issues.apache.org/jira/browse/HDFS-14946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fei Hui updated HDFS-14946:
---------------------------
    Status: Patch Available  (was: Open)

> Erasure Coding: Block recovery failed during decommissioning
> ------------------------------------------------------------
>
>                 Key: HDFS-14946
>                 URL: https://issues.apache.org/jira/browse/HDFS-14946
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.1.3, 3.2.1, 3.0.3
>            Reporter: Fei Hui
>            Assignee: Fei Hui
>            Priority: Major
>         Attachments: HDFS-14946.001.patch
>
>
> DataNode logs as follow
> {quote}
> org.apache.hadoop.HadoopIllegalArgumentException: No enough valid inputs are 
> provided, not recoverable
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.checkInputBuffers(ByteBufferDecodingState.java:119)
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.ByteBufferDecodingState.<init>(ByteBufferDecodingState.java:47)
>       at 
> org.apache.hadoop.io.erasurecode.rawcoder.RawErasureDecoder.decode(RawErasureDecoder.java:86)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstructTargets(StripedBlockReconstructor.java:126)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:97)
>       at 
> org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:748)
> {quote}
> Block recovery always failed because of srcNodes in the wrong order
> Reproduce steps are
> * ec block (b0, b1, b2, b3, b4, b5, b6, b7, b8), b[0-8] are on dn[0-8], 
> dn[0-3] are decommissioning
> * dn[1-3] are decommissioned, dn0 are in decommissioning, ec block is 
> (b0{decommissioning}, b[1-3]{decommissioned}, b[4-8]{live}, b[0-3]{live})
> * dn4 is crash, and b4 will be recovery, ec block is (b0{decommissioning}, 
> b[1-3]{decommissioned}, null, b[5-8]{live}, b[0-3]{live})
> We can see error log as above, and b4 is not recovery successfuly. Because 
> srcNodes transfered to recovery datanode contains block (b0, b[5-8],b[0-3]), 
> and datanode use {b0, b[5-8], b0} to recovery the missing block.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to