[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853065#comment-17853065 ] ruiliang commented on HDFS-15759: - When I validate a block that has been corrupted many times, does it appear normal? ByteBuffer hb show [0..] !image-2024-06-07-15-52-26-294.png! Can this situation be judged as an anomaly? {code:java} hdfs debug verifyEC -file /file.orc 24/06/07 15:40:29 WARN erasurecode.ErasureCodeNative: ISA-L support is not available in your platform... using builtin-java codec where applicable Checking EC block group: blk_-9223372036492703744 Status: OK {code} check orc file {code:java} Structure for skip_ip/_skip_file File Version: 0.12 with ORC_517 by ORC Java Exception in thread "main" java.io.IOException: Problem opening stripe 0 footer in skip_ip/_skip_file. at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:360) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:879) at org.apache.orc.impl.ReaderImpl.rows(ReaderImpl.java:873) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:345) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:276) at org.apache.orc.tools.FileDump.main(FileDump.java:137) at org.apache.orc.tools.Driver.main(Driver.java:124) Caused by: java.lang.IllegalArgumentException: Buffer size too small. size = 131072 needed = 7752508 in column 3 kind LENGTH at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:481) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:528) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:507) at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:59) at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:333) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.readDictionaryLengthStream(TreeReaderFactory.java:2221) at org.apache.orc.impl.TreeReaderFactory$StringDictionaryTreeReader.startStripe(TreeReaderFactory.java:2201) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.startStripe(TreeReaderFactory.java:1943) at org.apache.orc.impl.reader.tree.StructBatchReader.startStripe(StructBatchReader.java:112) at org.apache.orc.impl.RecordReaderImpl.readStripe(RecordReaderImpl.java:1251) at org.apache.orc.impl.RecordReaderImpl.advanceStripe(RecordReaderImpl.java:1290) at org.apache.orc.impl.RecordReaderImpl.advanceToNextRow(RecordReaderImpl.java:1333) at org.apache.orc.impl.RecordReaderImpl.(RecordReaderImpl.java:355) ... 6 more {code} > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848806#comment-17848806 ] ruiliang commented on HDFS-15759: - Hello, our current online data also appears this kind of EC storage data damage problem, about the problem description https://github.com/apache/orc/issues/1939 I would like to ask if cherry picked your current code (GitHub pull request #2869), can you skip the code to fix HDFS-14768,HDFS-15186 and HDFS-15240 related patches? The current version of hdfs is 3.1.0. Thank you! > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Time Spent: 10h 20m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15759) EC: Verify EC reconstruction correctness on DataNode
[ https://issues.apache.org/jira/browse/HDFS-15759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315247#comment-17315247 ] Wei-Chiu Chuang commented on HDFS-15759: This is a great tool. If no objections I intend to backport it to lower branches. > EC: Verify EC reconstruction correctness on DataNode > > > Key: HDFS-15759 > URL: https://issues.apache.org/jira/browse/HDFS-15759 > Project: Hadoop HDFS > Issue Type: New Feature > Components: datanode, ec, erasure-coding >Affects Versions: 3.4.0 >Reporter: Toshihiko Uchida >Assignee: Toshihiko Uchida >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 8h 10m > Remaining Estimate: 0h > > EC reconstruction on DataNode has caused data corruption: HDFS-14768, > HDFS-15186 and HDFS-15240. Those issues occur under specific conditions and > the corruption is neither detected nor auto-healed by HDFS. It is obviously > hard for users to monitor data integrity by themselves, and even if they find > corrupted data, it is difficult or sometimes impossible to recover them. > To prevent further data corruption issues, this feature proposes a simple and > effective way to verify EC reconstruction correctness on DataNode at each > reconstruction process. > It verifies correctness of outputs decoded from inputs as follows: > 1. Decoding an input with the outputs; > 2. Compare the decoded input with the original input. > For instance, in RS-6-3, assume that outputs [d1, p1] are decoded from inputs > [d0, d2, d3, d4, d5, p0]. Then the verification is done by decoding d0 from > [d1, d2, d3, d4, d5, p1], and comparing the original and decoded data of d0. > When an EC reconstruction task goes wrong, the comparison will fail with high > probability. > Then the task will also fail and be retried by NameNode. > The next reconstruction will succeed if the condition triggered the failure > is gone. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org