Re: [PR] HDDS-10657. Design Doc for overwriting a key if it has not changed [ozone]

via GitHub Tue, 23 Apr 2024 18:12:29 -0700


errose28 commented on code in PR #6482:
URL: https://github.com/apache/ozone/pull/6482#discussion_r1576968299



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.

Review Comment:
   `OmKeyInfo` is used in many places outside of just the open key table:
   - All open key, committed key, deleted key tables. I wouldn't really 
consider these "wire protocol" since they aren't part of the network.
   - On the client as part of `RpcClient#getKeyInfo`, where it is then 
wrapped/converted to `OzoneKeyDetails`



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:

Review Comment:
   +1 for the new method.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.

Review Comment:
   If the original key is deleted, this also counts as an update ID "change" 
that will fail the commit operation, right?



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.

Review Comment:
   More generally, it is not required to be stored in `OmKeyInfo`, which is 
stored in all key related tables. I know an empty protobuf field will not take 
up extra space, but it still reduces the scope of the change.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+The intended usage of this API, is that the existing key details are read, 
then used to open the new key, and then data is written, eg:
+
+```
+OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, 
existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), 
existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))
+}
+```
+
+## Upgrade and Compatibility
+
+If a newer client is talking to an older server, it could call the new atomic 
API but the server will ignore it without error. This is the case for any API 
change.

Review Comment:
   This is not a new API, it is a new method on the client that uses the 
existing get/put APIs with a new field. In this distinction lies the problem: 
the new client will think it has done a consistent, atomic rewrite because the 
server acks all requests, but actually it may have overwritten new data because 
the server does not support such functionality. We need to use the 
client/server versioning framework to have the client fail if the server's 
component version is too old to support rewrite.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+The intended usage of this API, is that the existing key details are read, 
then used to open the new key, and then data is written, eg:
+
+```
+OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, 
existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), 
existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))

Review Comment:
   Basically size and generation parameters could be removed and the method 
could pull them itself.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,149 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read updateID in a new field. Call this new 
field overwriteExpectedUpdateID.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the overwriteExpectedUpdateID.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == overwriteExpectedUpdateID. If so, it opens the key and 
stored the details including the overwriteExpectedUpdateID in the openKeyTable. 
As things stand, the other existing key metadata copied from the original key 
is stored in the openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the 
overwriteExpectedUpdateID again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its updateID is unchanged. If so the key is committed, otherwise an 
error is returned to the client.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The overwriteExpectedUpdateID needs to be added to the KeyInfo protobuf 
object so it can be stored in the openKey table.
+2. The overwriteExpectedUpdateID needs to be added to the keyArgs protobuf 
object, which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new overwriteExpectedUpdateID and perform the checked.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID of an existing key to be accessible when an 
existing details are read, by adding it to OzoneKey and OzoneKeyDetails. There 
are internal object changes and do no impact any APIs.
+ 2. To pass the overwriteExpectedUpdateID to OM on key open, it would be 
possible to overload the existing OzoneBucket.createKey() method, which already 
has several overloaded versions, or create a new explicit method on Ozone 
bucket called replaceKeyIfUnchanged, passing either the OzoneKeyDetails of the 
existing key (which includes the key name and existing updateID, or by passing 
the key name and updateID explicitly, eg:
+ 
+ ```
+ public OzoneOutputStream replaceKeyIfUnchanged(OzoneKeyDetails 
keyToOverwrite, ReplicationConfig replicationConfigOfNewKey)
+      throws IOException 
+         
+// Alternatively or additionally
+
+ public OzoneOutputStream replaceKeyIfUnchanged(String volumeName, String 
bucketName, String keyName, long size, long expectedUpdateID, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+Note the omission of the metaData map, as the intention of this API is to copy 
that from what already exisits on the server.

Review Comment:
   I think we are good here. Sounds like we are in agreement that metadata 
copying will work as usual from the server and API perspective, but the 
client's methods don't need to expose this functionality right now.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+The intended usage of this API, is that the existing key details are read, 
then used to open the new key, and then data is written, eg:
+
+```
+OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, 
existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), 
existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))
+}
+```
+
+## Upgrade and Compatibility
+
+If a newer client is talking to an older server, it could call the new atomic 
API but the server will ignore it without error. This is the case for any API 
change.
+
+There are no changes to protobuf methods.
+
+A single extra field is added to the KeyArgs object, which is passed from the 
client to OM on key open and commit. This is a new field, so it will be null if 
not set, and the server will ignore it if it does not expect it.
+
+A single extra field is added to the OMKeyInfo object which is stored in the 
openKey table. This is a new field, so it will be null if not set, and the 
server will ignore it if it does not expect it.
+
+There should be not impact on upgrade / downgrade with the new field added in 
this way.

Review Comment:
   It would be easier to follow if this section was separated into 
client/server compatibility and disk layout compatibility. I think disk layout 
compatibility is fine without extra handling, but client/server will need a new 
version.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+The intended usage of this API, is that the existing key details are read, 
then used to open the new key, and then data is written, eg:
+
+```
+OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, 
existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), 
existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))

Review Comment:
   Wouldn't it be easier to just give `rewriteKey` the path to the key and the 
fields you want to change, and have the method do the get and put operations 
inside of it? This seems like a lot of parameter copying for the common use 
case.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.

Review Comment:
   This needs to be quantified. "appears complex" seems like actual 
investigation of this approach was not done. The doc can site #5524 and the 
`atomicKeyCreation` field added. Only 3 files were changed to add this field:
   - `ECKeyOutputStream`
   - `KeyDataStreamOutput`
   - `KeyOutputStream`
   Now whether that is considered an excessive amount of change to rule out 
this approach is debatable, but at least the doc provides readers with all the 
information.



##########
hadoop-hdds/docs/content/design/overwrite-key-only-if-unchanged.md:
##########
@@ -0,0 +1,190 @@
+---
+title: Overwriting an Ozone Key only if it has not changed.
+summary: A minimal design illustrating how to replace a key in Ozone only if 
it has not changes since it was read.
+date: 2024-04-05
+jira: HDDS-10657
+status: accepted
+author: Stephen ODonnell
+---
+
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+
+Ozone offers write semantics where the last writer to commit a key wins. 
Therefore multiple writers can concurrently write the same key, and which ever 
commits last will effectively overwrite all data that came before it.
+
+As an extension of this, there is no "locking" on a key which is being 
replaced.
+
+For any key, but especially a large key, it can take significant time to read 
and write it. There are scenarios where it would be desirable to replace a key 
in Ozone, but only if the key has not changed since it was read. With the 
absence of a lock, such an operation is not possible today.
+
+## As Things Stand
+
+Internally, all Ozone keys have both an objectID and UpdateID which are stored 
in OM as part of the key metadata.
+
+Each time something changes on the key, whether it is data or metadata, the 
updateID is changed. It comes from the ratis transactionID and is generally an 
increasing number.
+
+When an existing key is over written, its existing metadata including the 
ObjectID and ACLs are mirrored onto the new key version. The only metadata 
which is replaced is any custom metadata stored on the key by the user. Upon 
commit, the updateID is also changed to the current Ratis transaction ID.
+
+Writing a key in Ozone is a 3 step process:
+
+1. The key is opened via an Open Key request from the client to OM
+2. The client writes data to the data nodes
+3. The client commits the key to OM via a Commit Key call.
+
+Note, that as things stand, it is possible to lose metadata updates (eg ACL 
changes) when a key is overwritten.
+
+1. If the key exists, then a new copy of the key is open for writing.
+2. While the new copy is open, another process updates the ACLs for the key
+3. On commit, the new ACLs are not copied to the new key as the new key made a 
copy of the existing metadata at the time the key was opened.
+
+With the technique described in the next section, that problem is removed in 
this design, as the ACL update will change the updateID, and the key will not 
be committed.
+
+## Atomic Key Replacement
+
+In relational database applications, records are often assigned an update 
counter similar to the updateID for a key in Ozone. The data record can be read 
and displayed on a UI to be edited, and then written back to the database. 
However another user could have made an edit to the same record in the mean 
time, and if the record is written back without any checks, those edits could 
be lost.
+
+To combat this, "optimistic locking" is used. With Optimistic locking, no 
locks are actually involved. The client reads the data along with the update 
counter. When it attempts to write the data back, it validates the record has 
not change by including the updateID in the update statement, eg:
+
+```
+update customerDetails
+set <columns = values>
+where customerID = :b1
+and updateCounter = :b2
+```
+If no records are updated, the application must display an error or reload the 
customer record to handle the problem.
+
+In Ozone the same concept can be used to perform an atomic update of a key 
only if it has not changed since the key details were originally read.
+
+To do this:
+
+1. The client reads the key details as usual. The key details can be extended 
to include the existing updateID as it is currently not passed to the client. 
This field already exists, but when exposed to the client it will be referred 
to as the key generation.
+2. The client opens a new key for writing with the same key name as the 
original, passing the previously read generation in a new field. Call this new 
field expectedGeneration.
+3. On OM, it receives the openKey request as usual and detects the presence of 
the expectedGeneration field.
+4. On OM, it first ensures that a key is present with the given key name and 
having a updateID == expectedGeneration. If so, it opens the key and stored the 
details including the expectedGeneration in the openKeyTable. As things stand, 
the other existing key metadata copied from the original key is stored in the 
openKeyTable too.
+5. The client continues to write the data as usual.
+6. On commit key, the client does not need to send the expectedGeneration 
again, as the open key contains it.
+7. On OM, on commit key, it validates the key still exists with the given key 
name and its stored updateID is unchanged when compared with the 
expectedGeneration. If so the key is committed, otherwise an error is returned 
to the client.
+
+Note that any change to a key will change the updateID. This is existing 
behaviour, and committing a rewritten key will also modify the updateID. Note 
this also offers protection against concurrent rewrites. 
+
+### Alternative Proposal
+
+1. Pass the expected expectedGeneration to the rewrite API which passes it 
down to the relevant key stream, effectively saving it on the client
+2. Client attaches the expectedGeneration to the commit request to indicate a 
rewrite instead of a put
+3. OM checks the passed generation against the stored update ID and returns 
the corresponding success/fail result
+
+The advantage of this alternative approach is that it does not require the 
expectedGeneration to be stored in the openKey table.
+
+However the client code required to implement this appears more complex due to 
having different key commit logic for Ratis and EC and the parameter needing to 
be passed through many method calls.
+
+The existing implementation for key creation stores various attributes 
(metadata, creation time, ACLs, ReplicationConfig) in the openKey table, so 
storing the expectedGeneration keeps with that convention, which is less 
confusing for future developers.
+
+In terms of forward / backward compatibility both solutions are equivalent. 
Only a new parameter is required within the KeyArgs passed to create and commit 
Key.
+
+If an upgraded server is rolled back, it will still be able to deal with an 
openKey entry containing expectedGeneration, but it will not process it 
atomically.
+
+### Scope
+
+The intention is to first implement this for OBS buckets. Then address FSO 
buckets. FSO bucket handling will reuse the same fields, but the handlers on OM 
are different. We also need to decide on what should happen if a key is renamed 
or moved folders during the rewrite.
+
+Multi-part keys need more investigation and hence are also excluded in the 
initial version.
+
+## Changes Required
+
+In order to enable the above steps on Ozone, several small changes are needed.
+
+### Wire Protocol
+
+1. The expectedGeneration needs to be added to the KeyInfo protobuf object so 
it can be stored in the openKey table.
+2. The expectedGeneration needs to be added to the keyArgs protobuf object, 
which is passed from the client to OM when creating a key.
+
+No new messages need to be defined.
+
+### On OM
+
+No new OM handlers are needed. The existing OpenKey and CommitKey handlers 
will receive the new expectedGeneration and perform the checks.
+
+No new locks are needed on OM. As part of the openKey and commitKey, there are 
existing locks taken to ensure the key open / commit is atomic. The new checks 
are performed under those locks, and come down to a couple of long comparisons, 
so add negligible overhead.
+
+### On The Client
+
+ 1. We need to allow the updateID (called generation on the client) of an 
existing key to be accessible when an existing details are read, by adding it 
to OzoneKey and OzoneKeyDetails. There are internal object changes and do no 
impact any APIs.
+ 2. To pass the expectedGeneration to OM on key open, it would be possible to 
overload the existing OzoneBucket.createKey() method, which already has several 
overloaded versions, or create a new explicit method on Ozone bucket called 
rewriteKey, passing the expectedGeneration, eg:
+ 
+ ```
+
+ public OzoneOutputStream rewriteKey(String volumeName, String bucketName, 
String keyName, long size, long expectedGeneration, ReplicationConfig 
replicationConfigOfNewKey)
+      throws IOException 
+      
+// Can also add an overloaded version of these methods to pass a metadata map, 
as with the existing
+// create key method.      
+
+         
+ ```
+This specification is roughly in line with the exiting createKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig,
+      Map<String, String> metadata)
+```
+
+An alternative, is to create a new overloaded createKey, but it is probably 
less confusing to have the new rewriteKey method:
+
+```
+  public OzoneOutputStream createKey(
+      String volumeName, String bucketName, String keyName, long size,
+      ReplicationConfig replicationConfig, long expectedUpdateID)
+```
+
+The intended usage of this API, is that the existing key details are read, 
then used to open the new key, and then data is written, eg:
+
+```
+OzoneKeyDetails exisitingKey = bucket.getKey(keyName);
+try (OutputStream os = bucket.rewriteKey(existingKey.getBucket, 
existingKey.getVolume, 
+    existingKey.getKeyName, existingKey.getSize(), 
existingKey.getGeneration(), newRepConfig) {
+  os.write(bucket.readKey(keyName))
+}
+```
+
+## Upgrade and Compatibility
+
+If a newer client is talking to an older server, it could call the new atomic 
API but the server will ignore it without error. This is the case for any API 
change.
+
+There are no changes to protobuf methods.

Review Comment:
   What do you mean by "protobuf methods"? The new protobuf fields will cause 
protobuf to generate new methods. Do you mean there are no new methods that 
take protobuf parameters, as in no new APIs? Is this referring to OM to DB 
protocol?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-10657. Design Doc for overwriting a key if it has not changed [ozone]

Reply via email to