[ https://issues.apache.org/jira/browse/HDDS-11171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan updated HDDS-11171: ------------------------------ Description: In our internal use of Ozone, we heavily utilize EC (Erasure Coding) functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC replica data, which will be reconstructed on other DNs (DataNodes). This reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction, I intend to implement an auditing feature dedicated to EC reconstruction logs. This is crucial, especially in instances of failure, to promptly pinpoint the reasons for reconstruction failures. Success log: {code:java} 2024-07-13 12:06:25,371 | INFO | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=SUCCESS | {code} Failure log: {code:java} 2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-s2102-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-s2102-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-s2102-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-s2102-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-s2102-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-s2102-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-s2102-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-s2102-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | java.lang.IllegalArgumentException: The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size. at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147) {code} was: In our internal use of Ozone, we heavily utilize EC (Erasure Coding) functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC replica data, which will be reconstructed on other DNs (DataNodes). This reconstruction process may either succeed or fail. To swiftly grasp the outcome of EC block reconstruction, I intend to implement an auditing feature dedicated to EC reconstruction logs. This is crucial, especially in instances of failure, to promptly pinpoint the reasons for reconstruction failures. {code:java} 2024-07-13 12:06:25,371 | INFO | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=SUCCESS | {code} {code:java} 2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: df941469-8358-402a-8600-0d3f508f9cda(bigdata-s2102-m1/xx.xx.xxx.xx) 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-s2102-m2/xx.xx.xxx.xx) d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-s2102-m3/xx.xx.xxx.xx) ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-s2102-m4/xx.xx.xxx.xx) 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-s2102-m5/xx.xx.xxx.xx) 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-s2102-m6/xx.xx.xxx.xx) 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-s2102-m7/xx.xx.xxx.xx) b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-s2102-m8/xx.xx.xxx.xx), excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | java.lang.IllegalArgumentException: The chunk list has 26 entries, but the checksum chunks has 27 entries. They should be equal in size. at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147) {code} > [DN] Add EC Block Recover Audit Log > ----------------------------------- > > Key: HDDS-11171 > URL: https://issues.apache.org/jira/browse/HDDS-11171 > Project: Apache Ozone > Issue Type: Improvement > Components: Ozone Datanode > Reporter: Shilun Fan > Assignee: Shilun Fan > Priority: Major > > In our internal use of Ozone, we heavily utilize EC (Erasure Coding) > functionality. When a DN (DataNode) disk fails, it leads to the loss of some > EC replica data, which will be reconstructed on other DNs (DataNodes). This > reconstruction process may either succeed or fail. To swiftly grasp the > outcome of EC block reconstruction, I intend to implement an auditing feature > dedicated to EC reconstruction logs. This is crucial, especially in instances > of failure, to promptly pinpoint the reasons for reconstruction failures. > Success log: > {code:java} > 2024-07-13 12:06:25,371 | INFO | DNAudit | user=null | ip=null | > op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: > 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, > pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: > df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) > 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) > d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) > ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) > 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) > 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) > 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) > b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), > excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, > CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], > createVersion=0, partNumber=0}} | ret=SUCCESS | > {code} > Failure log: > {code:java} > 2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | > op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: > 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, > pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: > df941469-8358-402a-8600-0d3f508f9cda(bigdata-s2102-m1/xx.xx.xxx.xx) > 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-s2102-m2/xx.xx.xxx.xx) > d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-s2102-m3/xx.xx.xxx.xx) > ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-s2102-m4/xx.xx.xxx.xx) > 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-s2102-m5/xx.xx.xxx.xx) > 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-s2102-m6/xx.xx.xxx.xx) > 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-s2102-m7/xx.xx.xxx.xx) > b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-s2102-m8/xx.xx.xxx.xx), > excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, > CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], > createVersion=0, partNumber=0}} | ret=FAILURE | > java.lang.IllegalArgumentException: The chunk list has 26 entries, but the > checksum chunks has 27 entries. They should be equal in size. > at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) > at > org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org For additional commands, e-mail: issues-h...@ozone.apache.org