[
https://issues.apache.org/jira/browse/HDDS-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087188#comment-18087188
]
Hui Fei commented on HDDS-15125:
--------------------------------
<<Bulk update>>
Apache Ozone 2.2.0 release is in progress. I'm updating all unresolved jiras
targeting 2.2.0 to retarget 2.3.0.
> MVCC-Style Snapshot Reclaimability Using seqNumMin / seqNumMax
> --------------------------------------------------------------
>
> Key: HDDS-15125
> URL: https://issues.apache.org/jira/browse/HDDS-15125
> Project: Apache Ozone
> Issue Type: Improvement
> Components: OM, Snapshot
> Reporter: Chu Cheng Li
> Assignee: Chu Cheng Li
> Priority: Major
>
> h2. Summary
> Refactor snapshot key reclamation so newly written deleted-key versions can
> be evaluated by MVCC-style visibility intervals instead of opening previous
> snapshot RocksDB instances and comparing key/block state.
> A key version is visible to a snapshot when:
> {code:java}
> seqNumMin <= snapshotCreateSeqNum < seqNumMax
> {code}
> A deleted key version is reclaimable when no active snapshot for that bucket
> has a create sequence number inside that interval.
> This is a phase-1 optimization for {{deletedTable}} key versions only.
> Existing snapshot deletion movement remains unchanged, and legacy entries
> without sequence interval metadata continue using the current
> {{ReclaimableKeyFilter}} fallback.
> h2. Schema And Data Model
> ||Table||Value Type||Change||
> |{{snapshotInfoTable}}|{{SnapshotInfo}}|No new field. Use existing
> {{createTransactionInfo}} as the snapshot create sequence source. Do not
> reuse deprecated {{{}dbTxSequenceNumber{}}}.|
> |{{keyTable}} / {{fileTable}}|{{KeyInfo}} via {{OmKeyInfo}}|Add optional
> {{{}uint64 seqNumMin = 24{}}}. Active key versions do not set
> {{{}seqNumMax{}}}.|
> |{{deletedTable}}|{{RepeatedKeyInfo}} containing {{KeyInfo}}|Store
> {{seqNumMin}} and {{seqNumMax}} on each contained {{{}KeyInfo{}}}. Add
> optional {{{}uint64 seqNumMax = 25{}}}.|
> |{{deletedDirTable}}|{{OmKeyInfo}}|No v1 schema behavior change beyond fields
> naturally existing on {{{}KeyInfo{}}}; directory interval optimization is out
> of v1.|
> |{{snapshotRenamedTable}}|{{String}}|No change in v1. Keep current
> previous-snapshot lookup logic for rename entries.|
> Add nullable boxed fields to {{OmKeyInfo}} and its builder:
> {code:java}
> private Long seqNumMin;
> private Long seqNumMax;
> {code}
> Serialize only when non-null. Missing fields mean “not eligible for interval
> optimization”.
> h2. Fill Rules
> Use OM/Ratis transaction index as the sequence number.
> For active key versions:
> * On new committed key creation, set {{{}seqNumMin =
> currentTransactionIndex{}}}.
> * On overwrite or data-changing commit, set the new key version’s
> {{{}seqNumMin = currentTransactionIndex{}}}.
> * On metadata-only operations such as ACL changes, mtime changes, and
> rename, preserve existing {{{}seqNumMin{}}}.
> * If an existing legacy key has no {{{}seqNumMin{}}}, do not infer one from
> {{{}updateID{}}}; keep it missing so deletion falls back to the old logic.
> For deleted key versions:
> * Centralize deleted-key preparation in
> {{{}OmUtils.prepareKeyForDelete(...){}}}.
> * Preserve the source key’s {{{}seqNumMin{}}}.
> * Set {{{}seqNumMax = currentTransactionIndex{}}}, where the current
> transaction is the delete or overwrite transaction that makes this version no
> longer visible.
> * For pseudo/uncommitted block reclaim records that were never visible to
> snapshots, set {{seqNumMin == seqNumMax == currentTransactionIndex}} or keep
> the existing unconditional reclaim marker path. The interval must be empty.
> * When {{SnapshotDeletingService}} moves deleted entries from a deleted
> snapshot to the next active snapshot or AOS, preserve {{seqNumMin}} and
> {{seqNumMax}} unchanged.
> For snapshots:
> * Continue setting {{SnapshotInfo.createTransactionInfo}} in snapshot create.
> * Extract the transaction index from {{createTransactionInfo}} for active
> snapshot visibility checks.
> * If any active snapshot for a bucket lacks {{{}createTransactionInfo{}}},
> disable interval optimization for that bucket and use the current fallback.
> h2. Reclaim Algorithm
> Build an in-memory sorted {{long[]}} of active snapshot create sequence
> numbers per snapshot path:
> {code:java}
> /volume/bucket -> sorted active snapshot create seq nums
> {code}
> Include only {{SNAPSHOT_ACTIVE}} snapshots. {{SNAPSHOT_DELETED}} snapshots no
> longer protect user visibility.
> For each deleted key version with both fields present:
> {code:java}
> int pos = lowerBound(activeSnapshotSeqNums, seqNumMin);
> boolean referenced = pos < activeSnapshotSeqNums.length
> && activeSnapshotSeqNums[pos] < seqNumMax;
> boolean reclaimable = !referenced;
> {code}
> If {{seqNumMin}} or {{seqNumMax}} is missing, or the bucket’s active snapshot
> sequence array cannot be built exactly, use the current
> {{{}ReclaimableKeyFilter{}}}.
> Keep the existing snapshot-chain race protection:
> * Capture {{expectedPreviousSnapshotId}} before filtering.
> * Validate it before submitting purge.
> * Include it in the purge request.
> * If a snapshot is created between filtering and purge, the purge request
> must be rejected or skipped and retried with the new active snapshot sequence
> array.
> h2. Complexity And Sweet Spot
> The configured default snapshot limit is currently {{{}10000{}}}, but this
> design remains efficient at the discussed maximum of {{{}65535{}}}.
> Current key reclaimability path:
> {code:java}
> O(D * previous snapshot DB lookups + D * block comparison cost)
> {code}
> Proposed phase-1 path for interval-enabled entries:
> {code:java}
> O(S) to build or refresh the active snapshot sequence array
> O(D * log S) to evaluate deleted key versions
> {code}
> At {{{}S = 65535{}}}:
> {code:java}
> log2(S) ~= 16
> {code}
> So each optimized deleted key version needs about 16 long comparisons and no
> previous snapshot RocksDB lookups.
> Use sorted primitive arrays and binary search as the sweet spot. Do not
> introduce a bitmap or persistent reclaim index in v1:
> * A {{long[]}} is exact, simple, and small.
> * 65,535 longs is about 512 KB at the extreme.
> * A bitmap only works cleanly for dense snapshot ordinals, not sparse OM
> transaction indexes.
> * A new persistent index adds migration and write-amplification cost that is
> not needed for the first optimization.
> Snapshot deletion movement remains:
> {code:java}
> O(M)
> {code}
> where {{M}} is the number of deleted/renamed/dir entries moved. Eliminating
> repeated movement requires a later design that stores deleted-version records
> in a global or per-bucket deleted-version table.
> h2. Backward Compatibility And Rollout
> * Adding optional protobuf fields {{seqNumMin = 24}} and {{seqNumMax = 25}}
> to {{KeyInfo}} is wire-compatible.
> * Old DB records will not have the fields and must use fallback behavior.
> * Do not use {{0}} as a sentinel; rely on protobuf field presence.
> * Old OMs may drop unknown fields if they rewrite a key through
> {{{}OmKeyInfo{}}}; this is safe because missing fields trigger fallback.
> * No in-place migration is required for v1.
> * Add metrics for optimized vs fallback reclaimability decisions.
> * Gate interval-based filtering behind a layout feature or feature flag so
> it can be enabled after upgrade confidence. Field writing is safe, but the
> optimization should remain fallback-capable at all times.
> h2. Test Plan
> * Proto round-trip tests for {{OmKeyInfo}} with present and missing
> {{seqNumMin}} / {{{}seqNumMax{}}}.
> * Unit tests for key create, overwrite, delete, and metadata-only update
> fill rules.
> * Unit tests confirming legacy keys without {{seqNumMin}} are not inferred
> from {{{}updateID{}}}.
> * Reclaim filter tests for interval boundaries: before interval, inside
> interval, at {{{}seqNumMax{}}}, empty interval, and no active snapshots.
> * Reclaim filter test with 65,535 active snapshot sequence numbers to
> validate binary-search behavior and memory assumptions.
> * Fallback tests when key interval fields are missing or active snapshot
> create sequence is missing.
> * Snapshot deletion move tests verifying interval fields are preserved when
> deleted entries move to the next active snapshot or AOS.
> * Race test where a snapshot is created after filtering but before purge;
> purge must be rejected/skipped through existing previous-snapshot validation.
> * Integration test with snapshots around create/delete/overwrite operations
> to verify user-visible snapshot reads remain correct and only unreferenced
> blocks are reclaimed.
> h2. Assumptions
> * {{seqNumMin}} and {{seqNumMax}} use OM transaction indexes, not RocksDB
> sequence numbers. The main distinction is that OM txn id is the {*}logical
> visibility clock{*}, while RocksDB sequence number is a {*}local physical
> storage clock{*}. Snapshot key reclamation is about whether a user-visible
> key version existed at a snapshot create operation, so the logical OM
> transaction order is the better fit.
> * {{seqNumMax}} is exclusive.
> * V1 optimizes deleted key reclamation only.
> * Rename entries and directory reclamation continue using existing logic in
> v1.
> * Existing snapshot deletion movement remains in place in v1.
> * Missing sequence interval metadata always means fallback to current
> behavior, never “safe to reclaim”.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]