[
https://issues.apache.org/jira/browse/RATIS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated RATIS-2524:
-------------------------------
Summary: Implement ReadIndex batching (was: Implement ReadIndex coalescing)
> Implement ReadIndex batching
> ----------------------------
>
> Key: RATIS-2524
> URL: https://issues.apache.org/jira/browse/RATIS-2524
> Project: Ratis
> Issue Type: Improvement
> Components: Linearizable Read, server
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
>
> We recently found that after an optimization that pushed the OM pure read
> performance 5x (240K performance), the pure follower read performance is
> worse than the leader-read performance (previously follower read can improves
> pure read throughput by 50%, now it decreased by 40%). I suspect it's due to
> the network and ReadIndexProto proto serde overhead since now the read QPS is
> way higher. Each read will trigger a ReadIndex call. If network and serde
> overhead is high, this can be the bottleneck.
> One improvement is to batch reads together to a single ReadIndex call.
> Rule: A ReadIndex result may only serve reads whose invocation happened
> before the ReadIndex request is logically issued.
> {code:java}
> t1: read A arrives at follower
> t2: read B arrives at follower
> t3: follower sends one ReadIndex request for batch [A, B]
> t4: leader processes ReadIndex and returns index I
> t5: follower applies >= I
> t6: A and B query local state and complete
> {code}
> It's not
> {code:java}
> t1: read A arrives
> t2: follower sends ReadIndex request
> t3: leader processes it
> t4: read B arrives
> t5: follower attaches B to A's ReadIndex result
> {code}
> This can be implemented using batching window with small batching interval
> (e.g. 500 microseconds or less depending on the average latency) or batch
> numbers (e.g. 64 batch reads). We will batch the reads during the batching
> interval into one ReadIndex batch. After the batching interval is done, we
> will seal this ReadIndex batch (i.e. no more reads will be added into this
> read) and then we will send a ReadIndex that covers all the reads under the
> sealed window (e.g. if the window has 5 read requests then 1 ReadIndex call
> will amortize the cost of 5 ReadIndex calls). New reads will go to the next
> ReadIndex batch.
> The correctness is guaranteed since after the batch is sealed and the
> ReadIndex completes, all writes that precede the sync in the global order are
> locally applied, so the reads can be linearized at/after the sync point.
> This idea is similar to the paper
> https://www.vldb.org/pvldb/vol18/p2831-giortamis.pdf
> (https://law-theorem.com/) mentioned in RATIS-2403 where the "sync"
> lightweight write operation is replaced with ReadIndex (which is also a form
> of "sync"). We can compare both approaches below:
> Our coalesced ReadIndex design:
> {code:java}
> seal pending reads
> send one ReadIndex
> wait follower appliedIndex >= readIndex
> execute all reads locally
> {code}
> Lazy-ALR:
> {code:java}
> seal pending reads
> send one fake write / sync through write protocol
> wait until sync would apply locally
> execute all reads locally
> {code}
> Therefore while RATIS-2403 batch writes together into a single RepliedIndex
> to reduce the bottleneck introduced by high ReadIndex increase (and which
> causes longer follower waitForAdvance). This patch focuses on amortizing the
> network latency bottleneck for reads. Therefore, I think we can try to
> combine both improvements.
> One possible weakness is that now reads are more bursty since reads in a
> batch are going to be executed at around the same time (instead of previously
> where reads are going to be served immediately as they arrive).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)