[ 
https://issues.apache.org/jira/browse/HDDS-8324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17706728#comment-17706728
 ] 

Sumit Agrawal commented on HDDS-8324:
-------------------------------------

With adding logs and simulated slowness adding Thread.sleep(), observed, 1. 
added cache for index 117, then removed in apply transaction with current index 
"111". So later when read happens by ratis to sync to follower, its found 
missing.
{code:java}
2023-03-29 21:21:04,902 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:read(772)) - found cache for index: 116
2023-03-29 21:21:05,905 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(476)) - add cache for index 117
2023-03-29 21:21:05,909 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$applyTransaction$10(900)) - removing cache 
of index: 116, 111
2023-03-29 21:21:05,909 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$applyTransaction$10(900)) - removing cache 
of index: 117, 111
2023-03-29 21:21:05,909 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:read(778)) - missing cache for index: 116
2023-03-29 21:21:06,921 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(476)) - add cache for index 118
2023-03-29 21:21:06,926 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:read(778)) - missing cache for index: 117
2023-03-29 21:21:06,926 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:lambda$applyTransaction$10(900)) - removing cache 
of index: 118, 112
2023-03-29 21:21:07,931 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:handleWriteChunk(476)) - add cache for index 119
2023-03-29 21:21:07,952 ERROR ratis.ContainerStateMachine 
(ContainerStateMachine.java:read(778)) - missing cache for index: 117 {code}
With change, missing cache is not found,

> DN data cache gets removed randomly asking for data from disk
> -------------------------------------------------------------
>
>                 Key: HDDS-8324
>                 URL: https://issues.apache.org/jira/browse/HDDS-8324
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sumit Agrawal
>            Assignee: Sumit Agrawal
>            Priority: Major
>             Fix For: 1.4.0
>
>
> ContainerStateMachine in DN have stateMachineDataCache, which cache data wrt 
> to logIndex of ratis, this is used by leader to send data to other follower.
> Issue: the cache gets cleared with incorrect logic, where all higher index is 
> cleared, when apply transaction is called with lower index.
>  
> {code:java}
>  (division.getInfo().isLeader()) {
>   long minIndex = Arrays.stream(division.getInfo()
>       .getFollowerNextIndices()).min().getAsLong();
>   LOG.debug("Removing data corresponding to log index {} min index {} "
>           + "from cache", index, minIndex);
>   stateMachineDataCache.removeIf(k -> k >= (Math.min(minIndex, index)));
> } {code}
>  
>  
> Impact: 
>  * with this clearing, when leader send data, it will cause disk read adding 
> pressure over disk IO.
> As solution, the check should be {color:#de350b}k <{*}={*} 
> (Math.min(minIndex, index)) {color}{color:#172b4d}where all previous index 
> should be cleared as follower sync is done for that.{color}
>  
> {color:#172b4d}Impact with this change:{color}
>  * {color:#172b4d}cache is controlled using {color}LeaderNumPendingRequests 
> (write.element-limit) default 1024 and pendingRequestsBytesLimit 
> (dfs.container.ratis.leader.pending.bytes.limit) default 1GB. So further 
> cache will block till all follower gets sync. This will be correct 
> controlling write load over DN till all cache in sync with majority of 
> follower.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to