errose28 commented on pull request #1836:
URL: https://github.com/apache/ozone/pull/1836#issuecomment-769303605
I think there are some subtle failure cases that can cause the cancel
prepare request and --upgrade flag to behave in different ways depending on
where the failure occurs.
We should probably pick an expected behavior for each request, and make sure
that it will be observed regardless of where a failure occurs.
My curent understanding is that if an OM goes down/gets disconnected and
misses a cancel prepare request, it should come back cancelled, but if the
majority of OMs are started with --upgrade flag, the downed OM should still
should come back up in its original prepared or unprepared state before the
crash. This follows the behavior we would see if all 3 OMs were up during the
whole process, because cancel prepare is a ratis request for the cluster, and
--upgrade applies only to the OM it is passed to.
If this is the behavior we want, we can use the marker file for indicating
preparedness on restart, and the DB to indicate preparedness on snapshot
install.
Below I have outlined some usages for cancel prepare and --upgrade that may
lead to different results on an OM depending on when in the process it fails,
based on the current implementation.
## Cases for cancel prepare
### Cancel Prepare Case 1 (expected)
1. om1 down
2. om2 and om3 prepared
3. om2 and om3 cancelled by request
4. om1 resumed
- om1 will come back **unprepared**, as expected, because it will not
see a DB entry for preparation in its snapshot.
### Cancel Prepare Request Case 2 (unexpected)
1. om1 om2 and om3 prepared
2. om1 down.
3. om2 and om3 cancelled by request
4. om1 resumed
- om1 comes back **prepared**, because marker file is read to indicate
it is prepared, and the value received in the snapshot is never used.
- This is not expected behavior because cancel prepare is a ratis
request and should be acknowledged when received in the snapshot.
- om1 has all txns now and could become leader, moving cluster to a
pseudo-prepared state where only leader is prepared.
## Cases for --upgrade
### --upgrade Case 1 (expected)
1. om1 down
2. om2 and om3 prepared
3. om2 and om3 cancelled by --upgrade
4. om1 resumed
- om1 comes back **prepared** because the snapshot it receives still has
a prepare DB entry, and the txn index and prepare index in the DB match.
- This is expected, because prepare is a ratis request, and --upgrade is
not.
### --upgrade Case 2 (maybe unexpected?)
1. om1 down
2. om2 and om3 prepared
3. om2 and om3 cancelled by --upgrade
4. Transaction applied on om2 and om3
4. om1 resumed
- om1 sees the prepare DB entry, but the prepare index does not match
the transaction index.
- The current implementation does not have an explicit case to handle
this, so the default behavior would be the OM is **unprepared**.
- This could be unepxected because om1 behaves as if both the prepare
request and --upgrade were ratis requests applied from the snapshot, even
though this is not the case.
- I think this case is atypical usage, however, since we are expecting all
three OMs to be started with --upgrade before the cluster is used again.
### --upgrade Case 3 (expected)
1. om1 om2 and om3 prepared
2. om1 down.
3. om2 and om3 cancelled by --upgrade
4. om1 resumed
- om1 comes back **prepared** because it has a marker file, the prepare
entry in the DB is ignored.
- This is expected behavior becausse --upgrade is not a ratis request.
## Summary
Below is an outline of what I think should happen on each event to cause the
desired behavior. Almost all of this is present in the current implementation,
but it is outlined here for clarity with *(new)* marking the changes. This
should handle *cancel prepare request case 2*. It will also maintain current
behavior for *--upgrade case 3*, although now om1 will come back prepared first
because of the marker file, but the following snapshot install will be
processed as well, leaving it in prepared state.
This will still not handle *--upgrade case 2*, but I am not sure if this is
a case we need to handle.
Unlike the current implementation, this will leave the prepare entry in the
DB, and use indices matching or mismatching to indicate whether that entry mean
the OM is prepared or unprepared. I think this is less confusing since
--upgrade was leaving the entry in the DB anyways, but we can discuss.
- On prepare request:
1. Write marker file.
2. Update index of prepare DB entry
- On cancel prepare request (new):
1. Delete marker file.
2. Do not touch prepare DB entry
- On --upgrade startup:
1. Delete marker file
2. Do not touch prepare DB entry
- On normal startup:
1. Prepare if valid marker file.
2. Do not read prepare DB entry.
- On install snapshot (new):
- If no prepare DB entry:
- Do nothing. Cluster has never been prepared before.
- Else:
- If prepare DB index == txn info DB index:
- Prepare and write marker file.
- Else:
- Delete marker file and cancel prepare.
- These steps should succeed even if the OM was not prepared.
- Let me know your thought on this, and if I missed/misunderstood anything.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]