slfan1989 commented on PR #7009:
URL: https://github.com/apache/ozone/pull/7009#issuecomment-2274643856
> > This section of code selects the minimum BlockGroupLength, thus choosing
the BlockGroupLength of DataBlock3, leading to the recovery of an incorrect
data block.
>
> I am not sure this statement is correct.
>
> When a stripe is written, the client must wait for all DNs to ack the
write, indicating it was successfully saved. If any of the DNs do not return
successfully, the stripe is abandoned by the client, and a new block is
requested and the stripe is written to the new block again. This may duplicate
some data.
>
> To know for sure if this is happening, you need to look at the block
length stored in OM for this block and see if it aligns with 2 strips (as
stated by DN 3) or greater than 3 stripes, indicating that the "failed write
handling" on the client did not do the correct thing. Unfortunately given the
block ID, its not easy to find the key is it associated with. I believe it can
be done via recon.
Thank you very much for your reply! Currently, there is a lot of data online
in this situation. It seems that our client is not performing as expected. I
added audit logs on a specific DN and found a large number of failed
reconstructions.
> cat dn-audit.log|grep "RECOVER_EC_BLOCK"|grep "FAIL"
353
> example:
```
2024-07-25 07:25:15,830 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 951772 locID:
113750155032021583 bcsId: 0}, length=25165824, offset=0, token=null,
pipeline=Pipeline[ Id: cc205dc9-49a1-4c07-92b9-0c27504268b6, Nodes:
62fa9baf-6002-4da9-bb06-eaf3e9549774(bigdata-ozone1431.online/10.77.218.54)d6b32839-fd22-4ee9-b1ef-0a2f7ae5e8d7(bigdata-ozone1302.online/10.77.213.32)8d7e54be-c94c-4bfd-ba11-4b618a1ca332(bigdata-ozone1718.online/10.77.233.41)e6599058-c6cf-4d3c-b669-4155bedc6631(bigdata-ozone1455.online/10.77.219.38)94bfe561-11b8-40b4-89e7-f3279f6b73e6(bigdata-ozone1474.online/10.77.220.18)f4bd3569-fbb4-40fa-a792-863a34df2cb4(bigdata-ozone1245.online/10.77.211.34)c1e06adb-46d9-4d08-ab65-bd9ef83607f8(bigdata-ozone1330.online/10.77.214.31)0e41e3ce-bc8d-4185-b956-d1d445b25cb9(bigdata-ozone1382.online/10.77.216.56),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T07:23:12.013617
185+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 4 entries, but the
checksum chunks has 5 entries. They should be equal in size.
2024-07-25 08:44:13,601 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 837314 locID:
113750154862943175 bcsId: 0}, length=100663296, offset=0, token=null,
pipeline=Pipeline[ Id: 8aeaf9ec-7537-4ad0-82d3-be7b5e7702aa, Nodes:
6d856eab-b1ce-428e-b91c-b8eb950807ae(bigdata-ozone1354.online/10.77.215.15)1b29d271-631b-4703-9d63-a0b65bf30480(bigdata-ozone1258.online/10.77.211.57)447297bf-f236-42da-babd-6affdff5e845(bigdata-ozone1404.online/10.77.217.53)105617a7-5fa5-41fd-bd51-61b5200c3e1d(bigdata-ozone1695.online/10.77.232.12)dedf7b87-667b-4c84-b98b-94fb8bb8a2bc(bigdata-ozone1295.online/10.77.213.15)f76dda7e-d639-465c-a1ab-e9ef6ec4421c(bigdata-ozone1418.online/10.77.218.31)86e30199-e1b6-41ee-936a-c9c7638af580(bigdata-ozone1476.online/10.77.220.20)df941469-8358-402a-8600-0d3f508f9cda(bigdata-ozone1366.online/10.77.216.18),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T08:44:04.37758
6980+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 16 entries, but the
checksum chunks has 17 entries. They should be equal in size.
2024-07-25 08:45:07,392 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 951070 locID:
113750155030944419 bcsId: 0}, length=56623104, offset=0, token=null,
pipeline=Pipeline[ Id: 7d0a854b-dbeb-4958-91fb-16c2dcf2fdfa, Nodes:
a2ddcc2d-8aa9-4030-9121-af6f9426ac04(bigdata-ozone1249.online/10.77.211.38)cac84fc4-835b-4c49-9566-10b3b252b44f(bigdata-ozone1408.online/10.77.217.58)e6599058-c6cf-4d3c-b669-4155bedc6631(bigdata-ozone1455.online/10.77.219.38)cffa6746-ae46-4e5f-8f54-c37bebdb36d6(bigdata-ozone1712.online/10.77.234.53)93435586-39df-4e4b-88e6-f25e3d926bba(bigdata-ozone1329.online/10.77.214.20)795e2ef6-4f22-44de-aad0-78bae5a153b3(bigdata-ozone1513.online/10.77.221.52)4f5b5892-63f4-4cf8-b617-2fb26e9e0ef5(bigdata-ozone1480.online/10.77.220.34)7c8f10a6-8027-488c-b187-8e4b3afadce3(bigdata-ozone1316.online/10.77.213.57),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T08:44:04.377586
831+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 9 entries, but the
checksum chunks has 10 entries. They should be equal in size.
2024-07-25 08:55:26,974 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 932402 locID:
113750154996121646 bcsId: 0}, length=18874368, offset=0, token=null,
pipeline=Pipeline[ Id: 9a6bb8f1-8322-4846-990b-81d5b5de78e0, Nodes:
4dc2a747-2be3-4a58-ab7c-3fbe504a1189(bigdata-ozone1502.online/10.77.221.20)fcd1cc61-b679-4853-87f6-be6cbfd94b32(bigdata-ozone1352.online/10.77.215.13)6c2057a9-061a-44c2-bd5e-b7e430f5257d(bigdata-ozone1334.online/10.77.214.35)fe24377f-91a9-403a-b654-1f7c8cd184e4(bigdata-ozone1663.online/10.77.230.37)33c6fb6a-6ed6-4246-b8b9-814309b59750(bigdata-ozone1255.online/10.77.211.54)5b5316b6-0be1-4ff6-8c2b-52c3cb53a702(bigdata-ozone1414.online/10.77.218.17)6ea28f7b-462f-4b35-994a-33d1d7d288de(bigdata-ozone1315.online/10.77.213.56)91662b72-aa0d-4087-a1e9-f71b55c4a64f(bigdata-ozone1468.online/10.77.220.11),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T08:55:26.562523
718+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 3 entries, but the
checksum chunks has 4 entries. They should be equal in size.
2024-07-25 09:15:06,692 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 925553 locID:
113750154985174863 bcsId: 0}, length=56623104, offset=0, token=null,
pipeline=Pipeline[ Id: 819c7cb8-a4f2-43a3-a2c1-c692c15f1c7f, Nodes:
77eaf094-d67b-40cc-a1c0-5eff52292a22(bigdata-ozone1351.online/10.77.215.12)14e3d6be-dd8f-476d-90b9-043d37e8d735(bigdata-ozone1472.online/10.77.220.16)cf38675f-987b-47c2-baa3-e6afdd3884fe(bigdata-ozone1451.online/10.77.219.34)003c3831-ffd7-4d6a-8b8c-f71a4d01bf94(bigdata-ozone1291.online/10.77.213.11)d8f3179c-7629-48f2-9030-45a89de389ab(bigdata-ozone1425.online/10.77.218.38)c31aca19-69cc-4cb6-9776-c995ee3e3132(bigdata-ozone1509.online/10.77.221.38)cde3f346-7a0d-42e1-9c53-2084e71febfd(bigdata-ozone1381.online/10.77.216.55)c27c68ff-966c-4d55-a92c-52ce9214ca40(bigdata-ozone1343.online/10.77.214.54),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T09:11:21.535920
299+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 9 entries, but the
checksum chunks has 10 entries. They should be equal in size.
2024-07-25 09:27:51,679 | ERROR | DNAudit | user=null | ip=null |
op=RECOVER_EC_BLOCK {blockLocationInfo={blockID={conID: 955780 locID:
113750155037274967 bcsId: 0}, length=12582912, offset=0, token=null,
pipeline=Pipeline[ Id: 8431e3cf-7812-49bc-89f9-d43e1e3cdcf0, Nodes:
b98a0679-cab6-4f1e-98ee-a04d27f0f1e4(bigdata-ozone1506.online/10.77.221.34)41ba9acc-b97c-4e7a-b0df-d1d05daebefb(bigdata-ozone1807.online/10.77.223.60)19c5d2a7-1fee-4e6c-8395-ee6d09cfd86a(bigdata-ozone1332.online/10.77.214.33)7509ee25-be69-49bb-8d03-38457712f2e9(bigdata-ozone1296.online/10.77.213.16)94bfe561-11b8-40b4-89e7-f3279f6b73e6(bigdata-ozone1474.online/10.77.220.18)60fe53ee-de63-41c3-b5f0-613a6b55444c(bigdata-ozone1423.online/10.77.218.36)fcd1cc61-b679-4853-87f6-be6cbfd94b32(bigdata-ozone1352.online/10.77.215.13)3ccd78c7-047f-4a45-ac88-891936f29dfe(bigdata-ozone1363.online/10.77.216.15),
excludedSet: , ReplicationConfig: EC{rs-6-3-1024k}, State:CLOSED, leaderId:,
CreationTimestamp2024-07-25T09:27:36.505353
695+08:00[Asia/Shanghai]], createVersion=0, partNumber=0}} | ret=FAILURE |
java.lang.IllegalArgumentException: The chunk list has 2 entries, but the
checksum chunks has 3 entries. They should be equal in size.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]