[
https://issues.apache.org/jira/browse/HDDS-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113488#comment-17113488
]
Bharat Viswanadham commented on HDDS-3354:
------------------------------------------
{quote}That's an interesting question. If I understood well, your objections is
that if we do ratis log snapshot and rocksdb snapshot (=checkpoint) at the same
time they can be inconsistent with each other in case of any error.
I don't think it's a problem. Writing ratis log snapshot can fail even now
which should be handled. The only question if we can finalize both the
snapshots in one step which should be possible: For example write ratis log
snapshot file and rocksdb snapshot file to the same directory and move it to
the final location.{quote}
*Let me share the complete thought details here.*
1. Even in writing to temporary directory both ratis log snapshot file which
has snapshot index and checkpoint which we use rocksdb checkpoint, let's say
checkpoint succeded, and write failed. And in the current directory of snapshot
we have old checkpoint, and old snapshot file. During OM restart if we use
current OM DB, we cannot avoid replay logic. So, in this case, whenever OM
restart, we should use last checkpoint DB and snapshot file and come up. As
said we agree with this it will delay startup until the leader applied all OM
entries from snapshot to latest log, clients will get LeaderNotReadyException.
These kind of issues, will not be seen with the proposed approach.
2. And one step failing is not only the issue, it is one of the issues. If
snapshot taken is controlled by ratis, when a checkpoint is happening, we
should not allow any transactions to be flushed to DB, as we want to get what
is exact last applied Transaction to DB, so that when restart happens, we want
to know what is last applied transaction to DB. If this happens, that means
every time checkpoint is happening, we need to stop double buffer and take a
checkpoint and write to the snapshot file. Stopping double buffer means right
now it will send a signal to interrupt flush thread, but now with this we
should still maintain unflushed transactions that are not completed by flush or
wait for flush to complete. So, this might increase the current queue length in
double-buffer. As apply transaction will still continue to apply transactions
to StateMachine. This looks till complex than what is proposed and it also
comes with its own disadvantages of startup slowness and double-buffer queue
size. Or if we think let's take
Or other approaches is instead of putting transaction info, repeat the above
process of checkpoint and snapshot to file for every iteration, so that we
don't stop double buffer and apply transaction will put to double buffer. But
this is not a great solution, as it will makes double buffer slow and
checkpoints also increased (Just want to point it out), and we need another
background thread for cleaning up. This will not have a startup slow problem.
As with testing it is shown with HDDS-3474 + HDDS-3475 performance is not
degraded and it is in par, and with this we shall remove the replay logic from
actual request logic. So, even if we want to revisit, it will be simpler and it
will make developers implementing new API's does not need to know about
handling a replay case when implementing new write non-idempotent requests.
Let me know your thoughts?
> OM HA replay optimization
> -------------------------
>
> Key: HDDS-3354
> URL: https://issues.apache.org/jira/browse/HDDS-3354
> Project: Hadoop Distributed Data Store
> Issue Type: Improvement
> Reporter: Bharat Viswanadham
> Assignee: Bharat Viswanadham
> Priority: Major
> Attachments: OM HA Replay.pdf, Screen Shot 2020-05-20 at 1.28.48
> PM.png
>
>
> This Jira is to improve the OM HA replay scenario.
> Attached the design document which discusses about the proposal and issue in
> detail.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]