Sumit Agrawal created HDDS-8324:
-----------------------------------

             Summary: DN data cache gets removed randomly asking for data from 
disk
                 Key: HDDS-8324
                 URL: https://issues.apache.org/jira/browse/HDDS-8324
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Sumit Agrawal


ContainerStateMachine in DN have stateMachineDataCache, which cache data wrt to 
logIndex of ratis, this is used by leader to send data to other follower.

Issue: the cache gets cleared with incorrect logic, where all higher index is 
cleared, when apply transaction is called with lower index.

 
{code:java}
 (division.getInfo().isLeader()) {
  long minIndex = Arrays.stream(division.getInfo()
      .getFollowerNextIndices()).min().getAsLong();
  LOG.debug("Removing data corresponding to log index {} min index {} "
          + "from cache", index, minIndex);
  stateMachineDataCache.removeIf(k -> k >= (Math.min(minIndex, index)));
} {code}
 

 

Impact: 
 * with this clearing, when leader send data, it will cause disk read adding 
pressure over disk IO.

As solution, the check should be {color:#de350b}k <{*}={*} (Math.min(minIndex, 
index)) {color}{color:#172b4d}where all previous index should be cleared as 
follower sync is done for that.{color}

 

{color:#172b4d}Impact with this change:{color}
 * {color:#172b4d}cache is controlled using {color}LeaderNumPendingRequests 
(write.element-limit) default 1024 and pendingRequestsBytesLimit 
(dfs.container.ratis.leader.pending.bytes.limit) default 1GB. So further cache 
will block till all follower gets sync. This will be correct controlling write 
load over DN till all cache in sync with majority of follower.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to