Sammi Chen created RATIS-1411:
---------------------------------
Summary: Alleviate slow follower issue
Key: RATIS-1411
URL: https://issues.apache.org/jira/browse/RATIS-1411
Project: Ratis
Issue Type: Improvement
Reporter: Sammi Chen
Assignee: Sammi Chen
There is slow follower issue observed in our stress test. For example, when
intensively writing 1TB data, the leader and one follower next_index is 100w+,
the slow follower next_index is 50w+. The gap is huge. Which will cause a lot
of WatchForCommit timeout exception.
After rerun the test and do the investigation, the Ozone stateMachineDataCache
is the key point. With stateMachineDataCache set to 1024 or more, as long as
majority(leader and one follower) have committed the write request index, write
request data is removed from stateMachineDataCache. Leader has to fetch that
chunk of data from on-disk chunk file when grpcLogAppender of the second
follower want to send that write request out.
The time cost of reading from chunk file is much more expensive than reading
from chunk file. Once one follower cannot get the data from
stateMachineDataCache, it will never catch up with, till the write finishes.
I tried using Guava Cache to replace the ResourceLimitCache(she
tateMachineDataCache). It doesn't make an obvious difference since the Cache
size is limited. As long as the follower next_index request be evicted out of
the cache, the follower start to become more and more slower.
Then I tried using the PriorityBlockingList to replace the
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead
of other block's write task, execute the task by entryIndex order. Although
the readStatemachine will get the priority to execute first, but since there
are so many readStatemachines, the overall effect is less than expected.
So the key point to resove the slow follwer is to make sure that all its' data
stay in the cache as long as possbile.
My solution is set a threshold between the majority commited index and slow
follwer's commited index to guarantee the data in cache. I use 0.75 as the
ratio in my test. The effect is very well. I write 2TB data with a 3 DN
cluster, each with 10 HDD. The task finisehd in 40mins without any
watchForCommit timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)