[jira] [Comment Edited] (HDDS-13933) Consistent Read from OM Followers

Ivan Andika (Jira) Mon, 15 Dec 2025 17:18:19 -0800


    [ 
https://issues.apache.org/jira/browse/HDDS-13933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039037#comment-18039037
 ]


Ivan Andika edited comment on HDDS-13933 at 12/16/25 1:17 AM:
--------------------------------------------------------------

[~szetszwo] Thanks for checking this.

> Could you give details?

[~Symious] Could you help to give the details regarding the internal 
performance results and the performance issues? We discussed last week and 
[~yiyang0203] mentioned that enabling follower read itself can actually affect 
write throughput, which is concerning to me since leader write throughput 
should not be affected by follower read (my understanding is that by right only 
clients that enable the follower read might encounter higher latency). -We are 
still not sure about the reason.- (I think the reason is OM read blocks the 
handler threads which causes the write requests to be blocked).

Therefore, for now we decided to try to use the LocalLease solution approach 
that you suggested in 
[https://github.com/apache/ratis/pull/1296#issuecomment-3416501339] . However, 
this sacrifices consistency which I think might cause some subtle production 
bugs (e.g. for example some workflow write to a Hive table and immediately 
trigger another workflow to read from the same Hive table, the read task might 
see stale reads which affects correctness issue).

Looking at the current ReadIndex implementation, it doesn't look like there are 
issues. So I asked about why HDFS observer consistent read performance 
difference is acceptable, [~Symious] mentioned Hadoop RPC will requeue the call 
(non-blocking) if the NN applied index has not reached the client but OM will 
block the handler while waiting for the read. Therefore, the performance issue 
might be due to this RPC blocking issue rather than the inherent issue in the 
Ratis ReadIndex performance. Not sure if we can support this RPC requeieing 
logic from OM / Ratis since it requires Hadoop RPC changes.


was (Author: JIRAUSER298977):
[~szetszwo] Thanks for checking this.

> Could you give details?

[~Symious] Could you help to give the details regarding the internal 
performance results and the performance issues? We discussed last week and 
[~yiyang0203] mentioned that enabling follower read itself can actually affect 
write throughput, which is concerning to me since leader write throughput 
should not be affected by follower read (my understanding is that by right only 
clients that enable the follower read might encounter higher latency). We are 
still not sure about the reason.

Therefore, for now we decided to try to use the LocalLease solution approach 
that you suggested in 
[https://github.com/apache/ratis/pull/1296#issuecomment-3416501339] . However, 
this sacrifices consistency which I think might cause some subtle production 
bugs (e.g. for example some workflow write to a Hive table and immediately 
trigger another workflow to read from the same Hive table, the read task might 
see stale reads which affects correctness issue).

Looking at the current ReadIndex implementation, it doesn't look like there are 
issues. So I asked about why HDFS observer consistent read performance 
difference is acceptable, [~Symious] mentioned Hadoop RPC will requeue the call 
(non-blocking) if the NN applied index has not reached the client but OM will 
block the handler while waiting for the read. Therefore, the performance issue 
might be due to this RPC blocking issue rather than the inherent issue in the 
Ratis ReadIndex performance. Not sure if we can support this RPC requeieing 
logic from OM / Ratis since it requires Hadoop RPC changes.

> Consistent Read from OM Followers
> ---------------------------------
>
>                 Key: HDDS-13933
>                 URL: https://issues.apache.org/jira/browse/HDDS-13933
>             Project: Apache Ozone
>          Issue Type: New Feature
>            Reporter: Ivan Andika
>            Assignee: Ivan Andika
>            Priority: Major
>
> HDDS-9279 introduces the Ratis / Raft based follower reads by giving an 
> option to enable Ratis Linearizable Read and Leader Lease feature. However, 
> based on previous performance tests, there are significant performance 
> regressions both in the client and server which makes it not production 
> ready. Currently, the root cause of this performance hit has yet to be 
> discovered.
> Therefore, I suggest we pursue two concurrent approaches for consistent 
> follower read. We can then compare the two approach and enable the one with 
> the better performance.
>  * Improve the Ratis Lineariable Read and ReadIndex improvements (following 
> up on HDDS-9279)
>  ** In my opinion, since we are stuck on improving this, we might try the 
> second approach
>  * (This ticket) We follow the HDFS observer read implementation (HDFS-12943)
>  ** General flow
>  *** Client send an msync to OM leader and OM leader reply with the last 
> applied index as part of the AlignmentContext#getLastSeenId
>  *** When client send a request to the OM follower, the Hadoop RPC mechanism 
> will detect the AlignmentContext and will requeue the call if the 
> Call#getClientStateId() > AlignmentContext.getLastSeenStateId() (see Hadoop 
> Server.Handler#run)
>  **** This implies that the server RPC handler will not get blocked, unlike 
> linearizable read
>  ** Pros
>  *** There is already a reference implementation on HDFS, we can simply 
> finesse it to Ozone context
>  **** e.g. FSImage#getLastAppliedOrWrittenTxId can be translated to 
> StateMachine#getLastAppliedTermIndex
>  *** This is similar to HDFS observer read implementation, so we know that 
> this level of consistency is acceptable and we don't need a lot to prove that 
> it is correct (or at least acceptable)
>  **** If we know that HDFS observer read has been deployed to production with 
> no consistency issue and with acceptable performance so we expect the same on 
> Ozone
>  *** Possible better performance since Server will requeue the call to the 
> RPC server instead of blocking
>  *** We can adapt the Client proxy provider implementation from 
> ObserverProxyProvider
>  ** Cons
>  *** Only supports Hadoop RPC based client (gRPC based client is not 
> supported and requires its own development) 
>  *** Will drift the implementation away from Raft / Ratis
>  ** Current Implementation plans
>  *** Ozone side 
>  **** AlignmentContext for client
>  ***** Implementation is similar to Hadoop ClientGSIContext
>  **** AlignmentContext for OM
>  ***** Implementation is similar to Hadoop GlobalStateIdContext
>  **** Create a new proxy provider similar to ObserverReadProxyProvider to 
> allow to route read requests to the OM followers
>  *** Hadoop side
>  **** Support parsing AlignmentContext for ProtobufRpcEngine (raised 
> HADOOP-19741)
>  ***** See Hadoop Server.Connection#processRequest
> Additionally I hope that implementing one can uncover issues and improvements 
> on the other.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDDS-13933) Consistent Read from OM Followers

Reply via email to