[ 
https://issues.apache.org/jira/browse/RATIS-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545755#comment-17545755
 ] 

Song Ziyang commented on RATIS-1587:
------------------------------------

I tried to name the tmp dir as ‘snapshot-' + request.uuid(), and *as expected, 
all snapshot files are placed in the tmp dir and moved to sm dir.*
 
However, I encountered a new bug. The {*}follower successfully installed this 
snapshot and replied{*}, but the {*}leader received a reply with RST_STREAM 
CANCEL error code{*}. The leader assumes this installSnapshot failed but the 
follower has already updated term and index. Subsequent RPCs from leader to 
follower will fail.
 
Details:
*My Scenario:* Originally Raft Group contains 1 member. I add a new member to 
this group and triggered the InstallSnapshot process from original leader to 
new follower. The snapshot contains two files. The follower receives two 
InstallSnapshot RPCs, and both replied SUCCESS. The leader received the first 
SUCCESS reply from follower, but then received RST_STREAM CANCEL as the result 
of second reply.
 
Logs are attached.
[^log.txt]
 

> InstallSnapshot fails when snapshot has multiple chunks
> -------------------------------------------------------
>
>                 Key: RATIS-1587
>                 URL: https://issues.apache.org/jira/browse/RATIS-1587
>             Project: Ratis
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 2.3.0
>            Reporter: Song Ziyang
>            Priority: Major
>         Attachments: image-2022-05-31-08-56-35-373.png, 
> image-2022-05-31-08-58-44-717.png, image-2022-05-31-09-00-41-286.png
>
>
> *Bug Description*
> Currently, when follower install snapshot from leader, leader will divide 
> snapshot files into a sequence of fixed size chunks (16MB) and send each 
> through rpc.
> !image-2022-05-31-08-56-35-373.png!
> only the last rpc request in the sequence is tagged with 'Done'.
> !image-2022-05-31-08-58-44-717.png!
> However, when follower handles these sequence of rpcs, it will create *a 
> random temp dir for each rpc request, store the chunk in, and only move the 
> last chunk from tmp dir to sm dir.*
> *!image-2022-05-31-09-00-41-286.png!*
> Thus, when snapshot contains multiple files or a single file larger than 
> 16MB, InstallSnapshot will fail because only last chunk is stored in the /sm 
> dir and others are remained in many tmp dirs.
>  
> *How To Fix*
> Instead of use random uuid to name tmp dir every time, it is possible to use 
> the *request-uuid*  to name the tmp dir. request-uuid is generated by leader 
> once and is shared among the sequence of requests.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to