slfan1989 commented on PR #7009:
URL: https://github.com/apache/ozone/pull/7009#issuecomment-2275335492

   @sodonnel 
   
   I think I may not have expressed myself very clearly. I’d like to describe 
the functionality of this PR again and provide an example to illustrate the 
entire recovery process.
   
   > Background
   
   We discovered some errors related to EC reconstruction online, with detailed 
descriptions in HDDS-10985. 
   
   ```
   java.lang.IllegalArgumentException: The chunk list has 2 entries, but the 
checksum chunks has 3 entries. 
   They should be equal in size.  
   ```
   
   After encountering these errors, we took the following steps:
   
   1. We carefully reviewed the `read` and `write` code for EC and raised some 
questions about its implementation. After you provided answers, I understood 
your approach and concluded that there were no obvious issues with the EC write 
code.
   
   2. We added extra auditLog entries for certain DataNodes (DNs) because 
finding EC reconstruction errors in DN logs is very challenging. I wrote the 
reconstruction error information into the audit.log. Wd can refer to the 
relevant pr for details(#6936).
   
   After deploying this PR, I discovered many errors with the reconstruction 
blocks.
   
   ```
   2024-07-25 07:25:15,830 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 951772 locID: 
113750155032021583 bcsId: 0}, length=25165824, offset=0, token=null, 
pipeline=Pipeline[ Id: cc205dc9-49a1-4c07-92b9-0c27504268b6, Nodes: 
62fa9baf-6002-4da9-bb06-eaf3e9549774(bigdata-ozone1431.online/10.77.218.54)d6b32839-fd22-4ee9-b1ef-0a2f7ae5e8d7(bigdata-ozone1302.online/10.77.213.32)8d7e54be-c94c-4bfd-ba11-4b618a1ca332(bigdata-ozone1718.online/10.77.233.41)e6599058-c6cf-4d3c-b669-4155bedc6631(bigdata-ozone1455.online/10.77.219.38)94bfe561-11b8-40b4-89e7-f3279f6b73e6(bigdata-ozone1474.online/10.77.220.18)f4bd3569-fbb4-40fa-a792-863a34df2cb4(bigdata-ozone1245.online/10.77.211.34)c1e06adb-46d9-4d08-ab65-bd9ef83607f8(bigdata-ozone1330.online/10.77.214.31)0e41e3ce-bc8d-4185-b956-d1d445b25cb9(bigdata-ozone1382.online/10.77.216.56),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T07:23:12.013617
 185+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 4 entries, but the 
checksum chunks has 5 entries. They should be equal in size.
   2024-07-25 08:44:13,601 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 837314 locID: 
113750154862943175 bcsId: 0}, length=100663296, offset=0, token=null, 
pipeline=Pipeline[ Id: 8aeaf9ec-7537-4ad0-82d3-be7b5e7702aa, Nodes: 
6d856eab-b1ce-428e-b91c-b8eb950807ae(bigdata-ozone1354.online/10.77.215.15)1b29d271-631b-4703-9d63-a0b65bf30480(bigdata-ozone1258.online/10.77.211.57)447297bf-f236-42da-babd-6affdff5e845(bigdata-ozone1404.online/10.77.217.53)105617a7-5fa5-41fd-bd51-61b5200c3e1d(bigdata-ozone1695.online/10.77.232.12)dedf7b87-667b-4c84-b98b-94fb8bb8a2bc(bigdata-ozone1295.online/10.77.213.15)f76dda7e-d639-465c-a1ab-e9ef6ec4421c(bigdata-ozone1418.online/10.77.218.31)86e30199-e1b6-41ee-936a-c9c7638af580(bigdata-ozone1476.online/10.77.220.20)df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone1366.online/10.77.216.18),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T08:44:04.37758
 6980+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 16 entries, but the 
checksum chunks has 17 entries. They should be equal in size.
   2024-07-25 08:45:07,392 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 951070 locID: 
113750155030944419 bcsId: 0}, length=56623104, offset=0, token=null, 
pipeline=Pipeline[ Id: 7d0a854b-dbeb-4958-91fb-16c2dcf2fdfa, Nodes: 
a2ddcc2d-8aa9-4030-9121-af6f9426ac04(bigdata-ozone1249.online/10.77.211.38)cac84fc4-835b-4c49-9566-10b3b252b44f(bigdata-ozone1408.online/10.77.217.58)e6599058-c6cf-4d3c-b669-4155bedc6631(bigdata-ozone1455.online/10.77.219.38)cffa6746-ae46-4e5f-8f54-c37bebdb36d6(bigdata-ozone1712.online/10.77.234.53)93435586-39df-4e4b-88e6-f25e3d926bba(bigdata-ozone1329.online/10.77.214.20)795e2ef6-4f22-44de-aad0-78bae5a153b3(bigdata-ozone1513.online/10.77.221.52)4f5b5892-63f4-4cf8-b617-2fb26e9e0ef5(bigdata-ozone1480.online/10.77.220.34)7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone1316.online/10.77.213.57),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T08:44:04.377586
 831+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 9 entries, but the 
checksum chunks has 10 entries. They should be equal in size.
   2024-07-25 08:55:26,974 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 932402 locID: 
113750154996121646 bcsId: 0}, length=18874368, offset=0, token=null, 
pipeline=Pipeline[ Id: 9a6bb8f1-8322-4846-990b-81d5b5de78e0, Nodes: 
4dc2a747-2be3-4a58-ab7c-3fbe504a1189(bigdata-ozone1502.online/10.77.221.20)fcd1cc61-b679-4853-87f6-be6cbfd94b32(bigdata-ozone1352.online/10.77.215.13)6c2057a9-061a-44c2-bd5e-b7e430f5257d(bigdata-ozone1334.online/10.77.214.35)fe24377f-91a9-403a-b654-1f7c8cd184e4(bigdata-ozone1663.online/10.77.230.37)33c6fb6a-6ed6-4246-b8b9-814309b59750(bigdata-ozone1255.online/10.77.211.54)5b5316b6-0be1-4ff6-8c2b-52c3cb53a702(bigdata-ozone1414.online/10.77.218.17)6ea28f7b-462f-4b35-994a-33d1d7d288de(bigdata-ozone1315.online/10.77.213.56)91662b72-aa0d-4087-a1e9-f71b55c4a64f(bigdata-ozone1468.online/10.77.220.11),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T08:55:26.562523
 718+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 3 entries, but the 
checksum chunks has 4 entries. They should be equal in size.
   2024-07-25 09:15:06,692 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 925553 locID: 
113750154985174863 bcsId: 0}, length=56623104, offset=0, token=null, 
pipeline=Pipeline[ Id: 819c7cb8-a4f2-43a3-a2c1-c692c15f1c7f, Nodes: 
77eaf094-d67b-40cc-a1c0-5eff52292a22(bigdata-ozone1351.online/10.77.215.12)14e3d6be-dd8f-476d-90b9-043d37e8d735(bigdata-ozone1472.online/10.77.220.16)cf38675f-987b-47c2-baa3-e6afdd3884fe(bigdata-ozone1451.online/10.77.219.34)003c3831-ffd7-4d6a-8b8c-f71a4d01bf94(bigdata-ozone1291.online/10.77.213.11)d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone1425.online/10.77.218.38)c31aca19-69cc-4cb6-9776-c995ee3e3132(bigdata-ozone1509.online/10.77.221.38)cde3f346-7a0d-42e1-9c53-2084e71febfd(bigdata-ozone1381.online/10.77.216.55)c27c68ff-966c-4d55-a92c-52ce9214ca40(bigdata-ozone1343.online/10.77.214.54),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T09:11:21.535920
 299+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 9 entries, but the 
checksum chunks has 10 entries. They should be equal in size.
   2024-07-25 09:27:51,679 | ERROR | DNAudit | user=null | ip=null | 
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 955780 locID: 
113750155037274967 bcsId: 0}, length=12582912, offset=0, token=null, 
pipeline=Pipeline[ Id: 8431e3cf-7812-49bc-89f9-d43e1e3cdcf0, Nodes: 
b98a0679-cab6-4f1e-98ee-a04d27f0f1e4(bigdata-ozone1506.online/10.77.221.34)41ba9acc-b97c-4e7a-b0df-d1d05daebefb(bigdata-ozone1807.online/10.77.223.60)19c5d2a7-1fee-4e6c-8395-ee6d09cfd86a(bigdata-ozone1332.online/10.77.214.33)7509ee25-be69-49bb-8d03-38457712f2e9(bigdata-ozone1296.online/10.77.213.16)94bfe561-11b8-40b4-89e7-f3279f6b73e6(bigdata-ozone1474.online/10.77.220.18)60fe53ee-de63-41c3-b5f0-613a6b55444c(bigdata-ozone1423.online/10.77.218.36)fcd1cc61-b679-4853-87f6-be6cbfd94b32(bigdata-ozone1352.online/10.77.215.13)3ccd78c7-047f-4a45-ac88-891936f29dfe(bigdata-ozone1363.online/10.77.216.15),
 excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:, 
CreationTimestamp2024-07-25T09:27:36.505353
 695+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE | 
java.lang.IllegalArgumentException: The chunk list has 2 entries, but the 
checksum chunks has 3 entries. They should be equal in size.
   ```
   
   After reviewing the error logs, we found that most issues were caused by a 
single block missing the final chunk, leading to reconstructed blocks not 
meeting expectations. After some discussion, we proposed a recovery strategy: 
if among the participating blocks, there is exactly one block with a different 
`blockgrouplength` from the others (where all other blocks have the same 
`blockgrouplength`), we will treat this inconsistent block as a lost block and 
use the remaining blocks to complete its recovery.
   
   Overall, this process is idempotent, because for EC-6-3-1024k, using 8 
blocks to recover 1 lost block is equivalent to using 7 blocks to recover 1 
lost block.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to