Hi Ozone dev,

I once proposed fix for HDDS-5905, but it's been a while. Now our
cluster got stable after a few work and I've got time to resume my
work on HDDS-5905. - and I came up to face a design decision on key
formatting again, as I learned more in detail about Ozone internals.

Bharat once gave me an advice [1] to use object IDs instead of
transaction index (and instead of timestamps), to address restart and
cluster upgrade to Ratis. But it has a drawback on object overwrite
and I came up with another design choice. They are:

1. Use object IDs as a key in the delete table
Pros: object IDs are consistently used in OM and easy to pick up in
RocksDB batch.
Cons:
 - On objects being overwrite, object ID of the key is not updated,
while previous blocks
   of the overwritten key are eligible for deletion (see HDDS-5461 and
HDDS-5656).
   Under this condition, there are a race where blocks gets lost and
will never be
   collected. Example scenario is like:

key open  oid=1
key commit
key open (overwrite) oid=1’  #<= oid must be updated on overwrite, or
use update id
key delete oid=1
key commit
key delete oid=1’ (<= overwritten and previous block gets leaked)
deletion service deletes 1’

   This behavior should be changed as to assign new oid=2 on overwrite.
 - In addition to the need of this fix, blocks are deleted in the
order of key open,
   not in the order of key deletion. It's better than alphabetical
order, but not
   perfect.

2. Use update IDs as a key in the delete table
Pros: The design is cleaner and the order of block deletion will be correct.
Cons:
 - Currently, assignment of update IDs are somewhat fuzzy. In most places
   raw transaction index, in some places object ID is used as-is e.g. directory
   creation (See OMDirectoryCreateRequest.java).
 - A fix on the update ID assignment would be prefix them with epoch nubmer
   as well as object ID, but most part of setting update ID should be fixed.

I feel 1. is easier but a bit not correct, while 2 is more correct but
the required change is wide. I updated my proposal accordingly [2], so
please let me know your thoughts on which to choose. Also, my messy
working branch can be found here [3].

P.S. my fix on HDDS-5905 conflicts and depends on HDDS-5656, because
it's also about key deletion and overwrite. I want to get it reviewed
and merged beforehand. It's kinda leftover task from HDDS-5461 and
should be merged for 1.3.

[1] https://lists.apache.org/thread/79qgx598rv3qcojmzoxhc9ypkh1jj64y
[2] 
https://docs.google.com/document/d/1KeyhiE1i5SqRSgLy-pIOGW9X6mUYb8iYEkEoDAEQD9Q/edit#heading=h.nqxuhw78zsv7
[3] https://github.com/kuenishi/ozone/pull/1

-- 
--
Kota UENISHI, Engineer

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@ozone.apache.org
For additional commands, e-mail: dev-h...@ozone.apache.org

Reply via email to