[ 
https://issues.apache.org/jira/browse/RATIS-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuuka updated RATIS-2147:
-------------------------
    Description: 
We encountered an MD5 mismatch issue in IoTDB, and after multiple 
investigations, we found that the digester was contaminated
 
We have checked that it is not a network and disk problem
 
In implementation, the received snapshot will be written to a temporary file 
first. If there is an md5 mismatch, we will read the data from this temporary 
file and use a new digest to calculate md5, but the result of this calculation 
is the same as the md5 hash value sent
!image-2024-09-03-10-35-28-617.png!
 
!image-2024-09-03-10-35-08-315.png!
 
 
Use the saved corrupted file name to locate the relevant log, here to 
tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
!https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA!

Before encountering corrupt, the sender sent several consecutive snapshot 
installation requests to the receiver.
 
The receiver successfully received some requests, and then encountered a 
request for corrupt, and began printing "recompute again" to start 
recalculating.
 
After execution, the ERROR log of the rename will be printed, and the data will 
be read from the file and compared with the received chunk data.
 
If a byte does not match, the corresponding information will be printed, but no 
log information will be printed, which means that the content written to the 
disk is the same as the content sent
!https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA!

This makes the problem very clear. There is a problem with the MD5 calculation 
class, and the reasons are as follows:
 
     If a byte in the middle of the data part is incorrect due to network 
reasons, the calculated result and the hash sent must be different
 
    If there is a problem with the part that stores the hash value, the final 
calculation result will also be different.

 
I suggest creating a new digest every time follower receive a snapshot, so as 
to avoid pollution problems. Under normal network and disk conditions, Corrupt 
will not occur

  was:
We encountered an MD5 mismatch issue in IoTDB, and after multiple 
investigations, we found that the digester was contaminated
 
We have checked that it is not a network and disk problem
 
In implementation, the received snapshot will be written to a temporary file 
first. If there is an md5 mismatch, we will read the data from this temporary 
file and use a new digest to calculate md5, but the result of this calculation 
is the same as the md5 hash value sent
!image-2024-09-03-10-35-28-617.png!
 
!image-2024-09-03-10-35-08-315.png!
 
 
Use the saved corrupted file name to locate the relevant log, here to 
tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
!https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YjM4MWY1MTA2Y2EyYWU4MmZlNDE0Mzg3MDRjYTBjMjRfU0dPbEpVbWFNalV1V1lSUVllOGFISUdWbUhqanRFdFdfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
Before encountering corrupt, the sender sent several consecutive snapshot 
installation requests to the receiver.
 
The receiver successfully received some requests, and then encountered a 
request for corrupt, and began printing "recompute again" to start 
recalculating.
 
After execution, the ERROR log of the rename will be printed, and the data will 
be read from the file and compared with the received chunk data.
 
If a byte does not match, the corresponding information will be printed, but no 
log information will be printed, which means that the content written to the 
disk is the same as the content sent
!https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=YmZlYjk1YjAwOWE4MDJlYTEzZjkxMjljODU1MzQxMTZfMkU0NmlPRWpidDBweGNzWXY4cHNJZG14b1o3Z1BZMzhfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzMzE2MDk6MTcyNTMzNTIwOV9WNA!
This makes the problem very clear. There is a problem with the MD5 calculation 
class, and the reasons are as follows:
 
     If a byte in the middle of the data part is incorrect due to network 
reasons, the calculated result and the hash sent must be different
 
    If there is a problem with the part that stores the hash value, the final 
calculation result will also be different.

 
I suggest creating a new digest every time follower receive a snapshot, so as 
to avoid pollution problems. Under normal network and disk conditions, Corrupt 
will not occur


> MD5 mismatch when accept snapshot
> ---------------------------------
>
>                 Key: RATIS-2147
>                 URL: https://issues.apache.org/jira/browse/RATIS-2147
>             Project: Ratis
>          Issue Type: Bug
>          Components: snapshot
>    Affects Versions: 3.1.0, 3.2.0
>            Reporter: yuuka
>            Priority: Major
>         Attachments: image-2024-09-03-10-35-08-315.png, 
> image-2024-09-03-10-35-28-617.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> We encountered an MD5 mismatch issue in IoTDB, and after multiple 
> investigations, we found that the digester was contaminated
>  
> We have checked that it is not a network and disk problem
>  
> In implementation, the received snapshot will be written to a temporary file 
> first. If there is an md5 mismatch, we will read the data from this temporary 
> file and use a new digest to calculate md5, but the result of this 
> calculation is the same as the md5 hash value sent
> !image-2024-09-03-10-35-28-617.png!
>  
> !image-2024-09-03-10-35-08-315.png!
>  
>  
> Use the saved corrupted file name to locate the relevant log, here to 
> tlog.txt.snapshot.snapshot.as an example corrupt20240831-094107 _735
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=MDhjNDQ1OWY5NGVlM2YzYTEwOWE1ZWU5MDlmZjNmMmRfTHE1T3lFSnllTFR6Mm5Pc2oyQUpsWUxJTmM4SEhodVBfVG9rZW46RHJlbmJHQlRkb2daakp4RHZMVWNEOVFPbmhiXzE3MjUzODYwMzQ6MTcyNTM4OTYzNF9WNA!
> Before encountering corrupt, the sender sent several consecutive snapshot 
> installation requests to the receiver.
>  
> The receiver successfully received some requests, and then encountered a 
> request for corrupt, and began printing "recompute again" to start 
> recalculating.
>  
> After execution, the ERROR log of the rename will be printed, and the data 
> will be read from the file and compared with the received chunk data.
>  
> If a byte does not match, the corresponding information will be printed, but 
> no log information will be printed, which means that the content written to 
> the disk is the same as the content sent
> !https://timechor.feishu.cn/space/api/box/stream/download/asynccode/?code=ZDQ3NmJhNWZiYjEyYjU1MWYxOGI3MTFjNjNjMjAyMmJfUnAwMjB5dloxODlGRG52RFdZUTBCSUc0NjBPaWc3VXdfVG9rZW46TUxFeGJxTjBqbzIxNUx4eUZrUGNHMk55bjhkXzE3MjUzODYwNjA6MTcyNTM4OTY2MF9WNA!
> This makes the problem very clear. There is a problem with the MD5 
> calculation class, and the reasons are as follows:
>  
>      If a byte in the middle of the data part is incorrect due to network 
> reasons, the calculated result and the hash sent must be different
>  
>     If there is a problem with the part that stores the hash value, the final 
> calculation result will also be different.
>  
> I suggest creating a new digest every time follower receive a snapshot, so as 
> to avoid pollution problems. Under normal network and disk conditions, 
> Corrupt will not occur



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to