[ 
https://issues.apache.org/jira/browse/HDDS-15125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui Fei updated HDDS-15125:
---------------------------
    Target Version/s: 2.3.0  (was: 2.2.0)

> MVCC-Style Snapshot Reclaimability Using seqNumMin / seqNumMax
> --------------------------------------------------------------
>
>                 Key: HDDS-15125
>                 URL: https://issues.apache.org/jira/browse/HDDS-15125
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: OM, Snapshot
>            Reporter: Chu Cheng Li
>            Assignee: Chu Cheng Li
>            Priority: Major
>
> h2. Summary
> Refactor snapshot key reclamation so newly written deleted-key versions can 
> be evaluated by MVCC-style visibility intervals instead of opening previous 
> snapshot RocksDB instances and comparing key/block state.
> A key version is visible to a snapshot when:
> {code:java}
> seqNumMin <= snapshotCreateSeqNum < seqNumMax
> {code}
> A deleted key version is reclaimable when no active snapshot for that bucket 
> has a create sequence number inside that interval.
> This is a phase-1 optimization for {{deletedTable}} key versions only. 
> Existing snapshot deletion movement remains unchanged, and legacy entries 
> without sequence interval metadata continue using the current 
> {{ReclaimableKeyFilter}} fallback.
> h2. Schema And Data Model
> ||Table||Value Type||Change||
> |{{snapshotInfoTable}}|{{SnapshotInfo}}|No new field. Use existing 
> {{createTransactionInfo}} as the snapshot create sequence source. Do not 
> reuse deprecated {{{}dbTxSequenceNumber{}}}.|
> |{{keyTable}} / {{fileTable}}|{{KeyInfo}} via {{OmKeyInfo}}|Add optional 
> {{{}uint64 seqNumMin = 24{}}}. Active key versions do not set 
> {{{}seqNumMax{}}}.|
> |{{deletedTable}}|{{RepeatedKeyInfo}} containing {{KeyInfo}}|Store 
> {{seqNumMin}} and {{seqNumMax}} on each contained {{{}KeyInfo{}}}. Add 
> optional {{{}uint64 seqNumMax = 25{}}}.|
> |{{deletedDirTable}}|{{OmKeyInfo}}|No v1 schema behavior change beyond fields 
> naturally existing on {{{}KeyInfo{}}}; directory interval optimization is out 
> of v1.|
> |{{snapshotRenamedTable}}|{{String}}|No change in v1. Keep current 
> previous-snapshot lookup logic for rename entries.|
> Add nullable boxed fields to {{OmKeyInfo}} and its builder:
> {code:java}
> private Long seqNumMin;
> private Long seqNumMax;
> {code}
> Serialize only when non-null. Missing fields mean “not eligible for interval 
> optimization”.
> h2. Fill Rules
> Use OM/Ratis transaction index as the sequence number.
> For active key versions:
>  * On new committed key creation, set {{{}seqNumMin = 
> currentTransactionIndex{}}}.
>  * On overwrite or data-changing commit, set the new key version’s 
> {{{}seqNumMin = currentTransactionIndex{}}}.
>  * On metadata-only operations such as ACL changes, mtime changes, and 
> rename, preserve existing {{{}seqNumMin{}}}.
>  * If an existing legacy key has no {{{}seqNumMin{}}}, do not infer one from 
> {{{}updateID{}}}; keep it missing so deletion falls back to the old logic.
> For deleted key versions:
>  * Centralize deleted-key preparation in 
> {{{}OmUtils.prepareKeyForDelete(...){}}}.
>  * Preserve the source key’s {{{}seqNumMin{}}}.
>  * Set {{{}seqNumMax = currentTransactionIndex{}}}, where the current 
> transaction is the delete or overwrite transaction that makes this version no 
> longer visible.
>  * For pseudo/uncommitted block reclaim records that were never visible to 
> snapshots, set {{seqNumMin == seqNumMax == currentTransactionIndex}} or keep 
> the existing unconditional reclaim marker path. The interval must be empty.
>  * When {{SnapshotDeletingService}} moves deleted entries from a deleted 
> snapshot to the next active snapshot or AOS, preserve {{seqNumMin}} and 
> {{seqNumMax}} unchanged.
> For snapshots:
>  * Continue setting {{SnapshotInfo.createTransactionInfo}} in snapshot create.
>  * Extract the transaction index from {{createTransactionInfo}} for active 
> snapshot visibility checks.
>  * If any active snapshot for a bucket lacks {{{}createTransactionInfo{}}}, 
> disable interval optimization for that bucket and use the current fallback.
> h2. Reclaim Algorithm
> Build an in-memory sorted {{long[]}} of active snapshot create sequence 
> numbers per snapshot path:
> {code:java}
> /volume/bucket -> sorted active snapshot create seq nums
> {code}
> Include only {{SNAPSHOT_ACTIVE}} snapshots. {{SNAPSHOT_DELETED}} snapshots no 
> longer protect user visibility.
> For each deleted key version with both fields present:
> {code:java}
> int pos = lowerBound(activeSnapshotSeqNums, seqNumMin);
> boolean referenced = pos < activeSnapshotSeqNums.length
>     && activeSnapshotSeqNums[pos] < seqNumMax;
> boolean reclaimable = !referenced;
> {code}
> If {{seqNumMin}} or {{seqNumMax}} is missing, or the bucket’s active snapshot 
> sequence array cannot be built exactly, use the current 
> {{{}ReclaimableKeyFilter{}}}.
> Keep the existing snapshot-chain race protection:
>  * Capture {{expectedPreviousSnapshotId}} before filtering.
>  * Validate it before submitting purge.
>  * Include it in the purge request.
>  * If a snapshot is created between filtering and purge, the purge request 
> must be rejected or skipped and retried with the new active snapshot sequence 
> array.
> h2. Complexity And Sweet Spot
> The configured default snapshot limit is currently {{{}10000{}}}, but this 
> design remains efficient at the discussed maximum of {{{}65535{}}}.
> Current key reclaimability path:
> {code:java}
> O(D * previous snapshot DB lookups + D * block comparison cost)
> {code}
> Proposed phase-1 path for interval-enabled entries:
> {code:java}
> O(S) to build or refresh the active snapshot sequence array
> O(D * log S) to evaluate deleted key versions
> {code}
> At {{{}S = 65535{}}}:
> {code:java}
> log2(S) ~= 16
> {code}
> So each optimized deleted key version needs about 16 long comparisons and no 
> previous snapshot RocksDB lookups.
> Use sorted primitive arrays and binary search as the sweet spot. Do not 
> introduce a bitmap or persistent reclaim index in v1:
>  * A {{long[]}} is exact, simple, and small.
>  * 65,535 longs is about 512 KB at the extreme.
>  * A bitmap only works cleanly for dense snapshot ordinals, not sparse OM 
> transaction indexes.
>  * A new persistent index adds migration and write-amplification cost that is 
> not needed for the first optimization.
> Snapshot deletion movement remains:
> {code:java}
> O(M)
> {code}
> where {{M}} is the number of deleted/renamed/dir entries moved. Eliminating 
> repeated movement requires a later design that stores deleted-version records 
> in a global or per-bucket deleted-version table.
> h2. Backward Compatibility And Rollout
>  * Adding optional protobuf fields {{seqNumMin = 24}} and {{seqNumMax = 25}} 
> to {{KeyInfo}} is wire-compatible.
>  * Old DB records will not have the fields and must use fallback behavior.
>  * Do not use {{0}} as a sentinel; rely on protobuf field presence.
>  * Old OMs may drop unknown fields if they rewrite a key through 
> {{{}OmKeyInfo{}}}; this is safe because missing fields trigger fallback.
>  * No in-place migration is required for v1.
>  * Add metrics for optimized vs fallback reclaimability decisions.
>  * Gate interval-based filtering behind a layout feature or feature flag so 
> it can be enabled after upgrade confidence. Field writing is safe, but the 
> optimization should remain fallback-capable at all times.
> h2. Test Plan
>  * Proto round-trip tests for {{OmKeyInfo}} with present and missing 
> {{seqNumMin}} / {{{}seqNumMax{}}}.
>  * Unit tests for key create, overwrite, delete, and metadata-only update 
> fill rules.
>  * Unit tests confirming legacy keys without {{seqNumMin}} are not inferred 
> from {{{}updateID{}}}.
>  * Reclaim filter tests for interval boundaries: before interval, inside 
> interval, at {{{}seqNumMax{}}}, empty interval, and no active snapshots.
>  * Reclaim filter test with 65,535 active snapshot sequence numbers to 
> validate binary-search behavior and memory assumptions.
>  * Fallback tests when key interval fields are missing or active snapshot 
> create sequence is missing.
>  * Snapshot deletion move tests verifying interval fields are preserved when 
> deleted entries move to the next active snapshot or AOS.
>  * Race test where a snapshot is created after filtering but before purge; 
> purge must be rejected/skipped through existing previous-snapshot validation.
>  * Integration test with snapshots around create/delete/overwrite operations 
> to verify user-visible snapshot reads remain correct and only unreferenced 
> blocks are reclaimed.
> h2. Assumptions
>  * {{seqNumMin}} and {{seqNumMax}} use OM transaction indexes, not RocksDB 
> sequence numbers. The main distinction is that OM txn id is the {*}logical 
> visibility clock{*}, while RocksDB sequence number is a {*}local physical 
> storage clock{*}. Snapshot key reclamation is about whether a user-visible 
> key version existed at a snapshot create operation, so the logical OM 
> transaction order is the better fit.
>  * {{seqNumMax}} is exclusive.
>  * V1 optimizes deleted key reclamation only.
>  * Rename entries and directory reclamation continue using existing logic in 
> v1.
>  * Existing snapshot deletion movement remains in place in v1.
>  * Missing sequence interval metadata always means fallback to current 
> behavior, never “safe to reclaim”.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to