[ 
https://issues.apache.org/jira/browse/HDDS-12682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17938282#comment-17938282
 ] 

Peter Lee edited comment on HDDS-12682 at 3/27/25 8:19 AM:
-----------------------------------------------------------

[~swamirishi] I recall you mentioning that RocksDB 9.x includes something like 
this proposal which can be configured with a few arguments. However, I couldn’t 
find related content in the RocksDB release notes. Do you have any recollection 
of a specific keyword that might help me track it down? I noticed that RocksDB 
has a feature called `CompactOnDeletionCollector` (introduced in version 4.x). 
Could this be what you were referring to?

 

I also observed that TiKV continues to maintain [manual compaction on 
ranges|#L30-L52] and is iterating on their 
[strategy|https://github.com/tikv/tikv/commits/master/components/raftstore/src/store/worker/compact.rs].
 This suggests it might still be worth exploring, especially with a 
comprehensive benchmark.

I’d love to hear your thoughts on this. Thank you!


was (Author: JIRAUSER306303):
[~ritesh] I recall you mentioning that RocksDB 9.x includes something like this 
proposal which can be configured with a few arguments. However, I couldn’t find 
related content in the RocksDB release notes. Do you have any recollection of a 
specific keyword that might help me track it down? I noticed that RocksDB has a 
feature called `CompactOnDeletionCollector` (introduced in version 4.x). Could 
this be what you were referring to?

 

I also observed that TiKV continues to maintain [manual compaction on 
ranges|#L30-L52] and is iterating on their 
[strategy|https://github.com/tikv/tikv/commits/master/components/raftstore/src/store/worker/compact.rs].
 This suggests it might still be worth exploring, especially with a 
comprehensive benchmark.

I’d love to hear your thoughts on this. Thank you!

> Aggressive DB Compaction with Minimal Degradation
> -------------------------------------------------
>
>                 Key: HDDS-12682
>                 URL: https://issues.apache.org/jira/browse/HDDS-12682
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: db, OM
>            Reporter: Peter Lee
>            Assignee: Peter Lee
>            Priority: Major
>
> After researching TiKV and RocksDB compaction, I have some thoughts on the 
> current OM compaction:
> 1. TiKV also runs a background task to perform compaction.
> 2. If we directly compact an entire column family (cf), it seems it would 
> impact online performance (write amplification).
> 3. We can use the built-in `TableProperties` in SST to check the 
> `num_entries` and `num_deletion` of an SST file, but these two metrics only 
> represent the number of operations and don’t deduplicate keys.
> 4. TiKV has implemented a custom `MVCTablePropertiesCollector`, which 
> includes deduplication for more accurate results. However, the Java API 
> currently doesn’t seem to support custom `TablePropertiesCollector` 💩, so 
> we’re forced to make do with the built-in statistical data.
> 5. TiKV logically splits key ranges (table regions, with a default size limit 
> of 256 MB per region), allowing it to gradually scan and compact known ranges.
>    - [Compaction key range 
> paging](https://github.com/tikv/tikv/pull/2631/files#diff-49d2597226cac1291163478f47bee5d4530bd4b9b84d322059e8afaf7dd3dedcR1896-R1938)
>    - [Check if each key range needs 
> compaction](https://github.com/tikv/tikv/pull/2631/files#diff-52d5655c2ce5a05afae67d216f55e98a1d71c971e1869628b7ebe387dda90a37R203-R217)
> If we want to apply this logic to the Ozone Manager (OM), since OM isn’t a 
> distributed KV store, it doesn’t have the concept of key ranges. The only 
> logical key range division we can use is the bucket prefix (file table). 
> However, there’s an even better point: for FSO buckets, we can further divide 
> key ranges based on the directory `parent_id`.
> So, I think for `KeyTable` compaction, we could compact each bucket 
> individually and keep track of a `next_bucket` for paging. For 
> directory-related tables, we could also compact each bucket, but if a bucket 
> turns out to be too large, we could further compact based on the ordered 
> `parent_id` key ranges. This would require two paging keys: `next_bucket` and 
> `next_parent_id`.
> - [TableProperties 
> class](https://github.com/facebook/rocksdb/blob/main/java/src/main/java/org/rocksdb/TableProperties.java#L12)
> - [`public Map<String, TableProperties> getPropertiesOfTablesInRange(final 
> ColumnFamilyHandle columnFamilyHandle, final List<Range> 
> ranges)`](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/RocksDB.java#L4575)
> - [Range 
> Class](https://github.com/facebook/rocksdb/blob/934cf2d40dc77905ec565ffec92bb54689c3199c/java/src/main/java/org/rocksdb/Range.java)
> The Java APIs currently provide some support that would allow us to implement 
> the above ideas.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to