Sammi Chen created RATIS-1411:
---------------------------------

             Summary: Alleviate slow follower issue
                 Key: RATIS-1411
                 URL: https://issues.apache.org/jira/browse/RATIS-1411
             Project: Ratis
          Issue Type: Improvement
            Reporter: Sammi Chen
            Assignee: Sammi Chen


There is slow follower issue observed in our stress test.  For example, when 
intensively writing 1TB data, the leader and one follower next_index is 100w+, 
the slow follower next_index is 50w+.  The gap is huge.  Which will cause a lot 
of WatchForCommit timeout exception. 

After rerun the test and do the investigation,  the Ozone stateMachineDataCache 
is the key point.  With stateMachineDataCache set to 1024 or more, as long as 
majority(leader and one follower) have committed the write request index, write 
request data is removed from stateMachineDataCache. Leader has to fetch that 
chunk of data from on-disk chunk file when grpcLogAppender of the second 
follower want to send that write request out. 

The time cost of reading from chunk file is much more expensive than reading 
from chunk file.  Once one follower cannot get the data from 
stateMachineDataCache, it will never catch up with,  till the write finishes. 

I tried using Guava Cache to replace the ResourceLimitCache(she 
tateMachineDataCache). It doesn't make an obvious difference since the Cache 
size is limited.  As long as the follower next_index request be evicted out of 
the cache,  the follower start to become more and more slower. 

Then I tried using the   PriorityBlockingList to replace the 
LinkedBlockingDeque in chunkExecutors, to put the readStatemachine task ahead 
of other block's write task,  execute the task by entryIndex order.  Although 
the readStatemachine will get the priority to execute first, but since there 
are so many readStatemachines, the overall effect is less than expected. 

So the key point to resove the slow follwer is to make sure that all its' data 
stay in the cache as long as possbile. 

My solution is set a threshold between the majority commited index and slow 
follwer's commited index to guarantee the data in cache.  I use 0.75 as the 
ratio in my test. The effect is very well.  I write 2TB data with a 3 DN 
cluster, each with 10 HDD. The task finisehd in 40mins without any 
watchForCommit timeout. 








--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to