[ 
https://issues.apache.org/jira/browse/HDDS-11171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDDS-11171:
------------------------------
    Description: 
In our internal use of Ozone, we heavily utilize EC (Erasure Coding) 
functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC 
replica data, which will be reconstructed on other DNs (DataNodes). This 
reconstruction process may either succeed or fail. To swiftly grasp the outcome 
of EC block reconstruction, I intend to implement an auditing feature dedicated 
to EC reconstruction logs. This is crucial, especially in instances of failure, 
to promptly pinpoint the reasons for reconstruction failures.


{code:java}
2024-07-13 12:06:25,371 | INFO  | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 
bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 
622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: 
df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 
7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) 
d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) 
ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 
7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 
6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 
791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) 
b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), 
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], 
createVersion=0, partNumber=0}} | ret=SUCCESS |
{code}


{code:java}
2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 113750155051715549 
bcsId: 0}, length=163577856, offset=0, token=null, pipeline=Pipeline[Id: 
622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: 
df941469-8358-402a-8600-0d3f508f9cda(bigdata-s2102-m1/xx.xx.xxx.xx) 
7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-s2102-m2/xx.xx.xxx.xx) 
d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-s2102-m3/xx.xx.xxx.xx) 
ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-s2102-m4/xx.xx.xxx.xx) 
7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-s2102-m5/xx.xx.xxx.xx) 
6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-s2102-m6/xx.xx.xxx.xx) 
791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-s2102-m7/xx.xx.xxx.xx) 
b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-s2102-m8/xx.xx.xxx.xx), 
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], 
createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 26 entries, but the 
checksum chunks has 27 entries. They should be equal in size.
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) 
at 
org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)
{code}



  was:
In our internal use of Ozone, we heavily utilize EC (Erasure Coding) 
functionality. When a DN (DataNode) disk fails, it leads to the loss of some EC 
replica data, which will be reconstructed on other DNs (DataNodes). This 
reconstruction process may either succeed or fail. To swiftly grasp the outcome 
of EC block reconstruction, I intend to implement an auditing feature dedicated 
to EC reconstruction logs. This is crucial, especially in instances of failure, 
to promptly pinpoint the reasons for reconstruction failures.


{code:java}
2024-07-13 12:06:25,371 | INFO  | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 113750155051714398 
bcsId: 0}, length=4766503, offset=0, token=null, pipeline=Pipeline[ Id: 
622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: 
df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 
7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) 
d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) 
ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 
7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 
6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 
791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) 
b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), 
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], 
createVersion=0, partNumber=0}} | ret=SUCCESS |
{code}



> [DN] Add EC Block Recover Audit Log
> -----------------------------------
>
>                 Key: HDDS-11171
>                 URL: https://issues.apache.org/jira/browse/HDDS-11171
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Datanode
>            Reporter: Shilun Fan
>            Assignee: Shilun Fan
>            Priority: Major
>
> In our internal use of Ozone, we heavily utilize EC (Erasure Coding) 
> functionality. When a DN (DataNode) disk fails, it leads to the loss of some 
> EC replica data, which will be reconstructed on other DNs (DataNodes). This 
> reconstruction process may either succeed or fail. To swiftly grasp the 
> outcome of EC block reconstruction, I intend to implement an auditing feature 
> dedicated to EC reconstruction logs. This is crucial, especially in instances 
> of failure, to promptly pinpoint the reasons for reconstruction failures.
> {code:java}
> 2024-07-13 12:06:25,371 | INFO  | DNAudit | user=null | ip=null | 
> op=RECOVER_EC_BLOCK { blockId={blockID={conID: 964637 locID: 
> 113750155051714398 bcsId: 0}, length=4766503, offset=0, token=null, 
> pipeline=Pipeline[ Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: 
> df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone-m1/xx.xx.xxx.xx) 
> 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-ozone-m2/xx.xx.xxx.xx) 
> d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone-m3/xx.xx.xxx.xx) 
> ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-ozone-m4/xx.xx.xxx.xx) 
> 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone-m5/xx.xx.xxx.xx) 
> 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-ozone-m6/xx.xx.xxx.xx) 
> 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-ozone-m7/xx.xx.xxx.xx) 
> b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-ozone-m8/xx.xx.xxx.xx), 
> excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
> CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], 
> createVersion=0, partNumber=0}} | ret=SUCCESS |
> {code}
> {code:java}
> 2024-07-13 12:06:25,751 | ERROR | DNAudit | user=null | ip=null | 
> op=RECOVER_EC_BLOCK {blockId={blockID={conID: 964637 locID: 
> 113750155051715549 bcsId: 0}, length=163577856, offset=0, token=null, 
> pipeline=Pipeline[Id: 622e027d-ed89-4b25-9704-17b71ed0cf6b, Nodes: 
> df941469-8358-402a-8600-0d3f508f9cda(bigdata-s2102-m1/xx.xx.xxx.xx) 
> 7c557397-6e8e-413f-ad0c-282634ce84f9(bigdata-s2102-m2/xx.xx.xxx.xx) 
> d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-s2102-m3/xx.xx.xxx.xx) 
> ca5b50fd-4538-430f-85f3-6b2b61ae51d0(bigdata-s2102-m4/xx.xx.xxx.xx) 
> 7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-s2102-m5/xx.xx.xxx.xx) 
> 6a0dbf31-d80b-464a-aba8-b964d807e5c3(bigdata-s2102-m6/xx.xx.xxx.xx) 
> 791f3257-bffb-4e46-b0bb-c122192bb0ba(bigdata-s2102-m7/xx.xx.xxx.xx) 
> b3a06978-c73e-4f17-af0b-a890aca2d51c(bigdata-s2102-m8/xx.xx.xxx.xx), 
> excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
> CreationTimestamp2024-07-13T12:05:55.014859701+08:00[Asia/Shanghai]], 
> createVersion=0, partNumber=0}} | ret=FAILURE | 
> java.lang.IllegalArgumentException: The chunk list has 26 entries, but the 
> checksum chunks has 27 entries. They should be equal in size.
> at com.google.common.base.Preconditions.checkArgument(Preconditions.java:143) 
> at 
> org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:147)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@ozone.apache.org
For additional commands, e-mail: issues-h...@ozone.apache.org

Reply via email to