[
https://issues.apache.org/jira/browse/RATIS-1074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Glen Geng updated RATIS-1074:
-----------------------------
Description:
GrpcLogAppender improperly decrease nextIndex to 1, which will trigger
installSnapshot request to followers, and make pipeline be vulnerable.
*Below is the interaction between leader and the follower.*
Follower side, received an install snapshot request, and closed the pipeline.
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
[grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: receive
installSnapshot:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb->591f38ce-0b11-4003-8528-7a5802f89ea8#0-t514,notify:(t:392,
i:992920)
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
[grpc-default-executor-628] INFO
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
Leader change notification received for group: group-622E07D885D8 with new
leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
[grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: change Leader from
null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 514 for installSnapshot,
leader elected after 50593ms
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
[grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: notifyInstallSnapshot:
nextIndex is 2248004 but the leader's first available index is 992920.
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
[grpc-default-executor-628] WARN
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
Install snapshot notification received from Leader with termIndex: (t:392,
i:992920), terminating pipeline: group-622E07D885D8
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
[grpc-default-executor-628] ERROR
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
pipeline Action CLOSE on pipeline
PipelineID=433c67af-f834-4959-bcf7-622e07d885d8.Reason :
591f38ce-0b11-4003-8528-7a5802f89ea8 closes pipeline when installSnapshot from
leader because leader snapshot doesn't contain any data to replay, all the log
entries prior to the snapshot might have been purged.So follower should not try
to install snapshot from leader butcan close the pipeline here. It's in
follower state for 1761ms
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
[grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: StateMachine
successfully installed snapshot index 992920. Reloading the StateMachine.
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
[grpc-default-executor-628] INFO
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
flushIndex: setUnconditionally 2248003 -> 992920
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
[grpc-default-executor-628] INFO
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
safeCacheEvictIndex: setUnconditionally 2248001 -> 992920
ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
[grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: reply installSnapshot:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-591f38ce-0b11-4003-8528-7a5802f89ea8#0:FAIL-t514,IN_PROGRESS
Leader side, just take charge, got vote from another follower, but miss vote
from this follower
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:21,499
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.impl.LeaderElection:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
begin an election at term 510 for 2248004:
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.impl.LeaderElection:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
Election PASSED; received 1 response(s)
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-e5596d61-c021-4a4b-89bc-6d8e6d4f1baf#0:OK-t510]
and 0 exception(s);
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8:t510, leader=null,
voted=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb,
raftlog=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-SegmentedRaftLog:OPENED:c2248003,f2248004,i2248004,
conf=2248004: [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.impl.RoleInfo:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb: shutdown LeaderElection
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.impl.RaftServerImpl:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: changes role from
CANDIDATE to LEADER at term 510 for changeToLeader
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
Leader change notification received for group: group-622E07D885D8 with new
leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.impl.RaftServerImpl:
6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: change Leader from
null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 510 for becomeLeader,
leader elected after 20419ms
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.RaftServerConfigKeys:
raft.server.staging.catchup.gap = 1000 (default)
ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
[6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
INFO org.apache.ratis.server.RaftServerConfigKeys: raft.server.rpc.sleep.time =
25ms (default)
Due to Grpc exception, leader decrease next index of follower to 1. Since
leader has purged log, it has no choice but send out an install snapsho
was:
GrpcLogAppender improperly decrease nextIndex to 1, which make pipeline to be
vulnerable.
> GrpcLogAppender improperly decrease nextIndex to 1, which will trigger
> installSnapshot request to followers.
> -------------------------------------------------------------------------------------------------------------
>
> Key: RATIS-1074
> URL: https://issues.apache.org/jira/browse/RATIS-1074
> Project: Ratis
> Issue Type: Bug
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Major
> Labels: MultiRaft, dense-storage
> Fix For: 1.1.0
>
>
> GrpcLogAppender improperly decrease nextIndex to 1, which will trigger
> installSnapshot request to followers, and make pipeline be vulnerable.
>
> *Below is the interaction between leader and the follower.*
> Follower side, received an install snapshot request, and closed the pipeline.
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: receive
> installSnapshot:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb->591f38ce-0b11-4003-8528-7a5802f89ea8#0-t514,notify:(t:392,
> i:992920)
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Leader change notification received for group: group-622E07D885D8 with new
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: change Leader from
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 514 for installSnapshot,
> leader elected after 50593ms
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8:
> notifyInstallSnapshot: nextIndex is 2248004 but the leader's first available
> index is 992920.
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,719
> [grpc-default-executor-628] WARN
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Install snapshot notification received from Leader with termIndex: (t:392,
> i:992920), terminating pipeline: group-622E07D885D8
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] ERROR
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> pipeline Action CLOSE on pipeline
> PipelineID=433c67af-f834-4959-bcf7-622e07d885d8.Reason :
> 591f38ce-0b11-4003-8528-7a5802f89ea8 closes pipeline when installSnapshot
> from leader because leader snapshot doesn't contain any data to replay, all
> the log entries prior to the snapshot might have been purged.So follower
> should not try to install snapshot from leader butcan close the pipeline
> here. It's in follower state for 1761ms
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: StateMachine
> successfully installed snapshot index 992920. Reloading the StateMachine.
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
> flushIndex: setUnconditionally 2248003 -> 992920
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8-SegmentedRaftLogWorker:
> safeCacheEvictIndex: setUnconditionally 2248001 -> 992920
> ozone-root-datanode-172.16.90.69.log.9:2020-09-21 19:28:49,720
> [grpc-default-executor-628] INFO org.apache.ratis.server.impl.RaftServerImpl:
> 591f38ce-0b11-4003-8528-7a5802f89ea8@group-622E07D885D8: reply
> installSnapshot:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-591f38ce-0b11-4003-8528-7a5802f89ea8#0:FAIL-t514,IN_PROGRESS
>
> Leader side, just take charge, got vote from another follower, but miss vote
> from this follower
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:21,499
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.LeaderElection:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
> begin an election at term 510 for 2248004:
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.LeaderElection:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143:
> Election PASSED; received 1 response(s)
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb<-e5596d61-c021-4a4b-89bc-6d8e6d4f1baf#0:OK-t510]
> and 0 exception(s);
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8:t510, leader=null,
> voted=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb,
> raftlog=6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-SegmentedRaftLog:OPENED:c2248003,f2248004,i2248004,
> conf=2248004: [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb:172.16.90.54:9858:0,
> 591f38ce-0b11-4003-8528-7a5802f89ea8:172.16.90.69:9858:1,
> e5596d61-c021-4a4b-89bc-6d8e6d4f1baf:172.16.90.56:9858:0], old=null
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RoleInfo:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb: shutdown LeaderElection
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RaftServerImpl:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: changes role from
> CANDIDATE to LEADER at term 510 for changeToLeader
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,580
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
> Leader change notification received for group: group-622E07D885D8 with new
> leaderId: 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.impl.RaftServerImpl:
> 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8: change Leader from
> null to 6769b0ef-fb4a-4235-bcf0-f1623fcd74cb at term 510 for becomeLeader,
> leader elected after 20419ms
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.RaftServerConfigKeys:
> raft.server.staging.catchup.gap = 1000 (default)
> ozone-root-datanode-172.16.90.54.log.3:2020-09-21 19:28:26,588
> [6769b0ef-fb4a-4235-bcf0-f1623fcd74cb@group-622E07D885D8-LeaderElection143]
> INFO org.apache.ratis.server.RaftServerConfigKeys: raft.server.rpc.sleep.time
> = 25ms (default)
>
> Due to Grpc exception, leader decrease next index of follower to 1. Since
> leader has purged log, it has no choice but send out an install snapsho
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)