bshashikant commented on pull request #2538:
URL: https://github.com/apache/ozone/pull/2538#issuecomment-904271996


   > > > The issue eventually turned out to be race among write chunk/readChunk 
threads all using the same file channel. With concurrent writers and readers on 
the same file, the behaviour seems unpredictable and leads to block file 
sparseness on disk.
   > > 
   > > 
   > > @bshashikant , for ContainerStateMachine#read function,I'm not clear 
about one thing,when stateMachineDataCache.get(entry.getIndex()) return null, 
why we cannot think that the data is already written to the disk.
   > > Thanks @ChenSammi . Usually, the entry will be in stateMachine cahe till 
the write chunk is applied which means, if a redaStateMachine comes in between 
till write chunk is not finished, it should be in the cache. But the cache has 
the entry added only for the leader. Now assume a case like this: 3 servers s1, 
s2 and s3. S1 being leader, s2 is fast follower and s3 being the slow follower. 
S1 sends entry 1 - 100 to S2 and S3 and then S1 dies, S2 has received all the 
log entries, (yet to complete the write to disk part ) being a follower doesn't 
update the data cache. Now S2 will be sending append Entries to S3,. S3 might 
ask for entries from s2 for which the write to disk is not completed neither 
these entries are in cache, as this node was follower when it received those 
entries.
   > > BTW, if read chunk gets executed on a thread which is different than the 
thread which has done write chunk, it doesn't get the same view of the 
underlying file as the file handle is not closed on write. So, even if Read 
chunk gets executed post the write chunk completes, it may not necessarily see 
the same data unless its the same thread. That's why its important to remove 
the concurrency across multiple chunk writes/read for the same block.
   > 
   > I didn't aware that stateMachineDataCache is only for leader. Now I got 
the point. I'm wondering if we can have such an cache for follower too. It 
seems the data is still in memory before the write finish. The cache just holds 
a reference to this data. What do you think?
   > 
   > Besides, several changes in your previous commit are also good improvement 
to current code, such as the way of chunkExecutors allocation, and put 
raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   
   Well, from my point of view, followers maintaining cache as well to avoid 
few read calls in advent of failure case may not be the right thing to do. 
Performance improvement efforts can be done with focus on streaming rather. 
   
   > Besides, several changes in your previous commit are also good improvement 
to current code, such as the way of chunkExecutors allocation, and put 
raftFuture instead of writeChunkFuture in the writeChunkFutureMap.
   
   The idea is to go back and study the perf characteristics with this change 
and also about file channel concurrency model as well. All other changes can be 
done as an afterthought of that. For now, i would like to stick to solve the 
data integrity issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to