DCjanus commented on issue #1658:
URL: https://github.com/apache/kvrocks/issues/1658#issuecomment-2877126587
Thanks for your guidance, I'd like to refine `RANDOMKEY` enhancement
proposal based on @git-hulk feedback:
Our approach focuses on using SST properties for a better random starting
point, with a fallback to the current mechanism:
1. **SST Key Sampling**: We'll add a `TablePropertiesCollector` to Metadata
CF SSTs to sample keys (e.g., every 128th key).
2. **SST Selection**: We'll select an SST file weighted by its live key
count, using information from `CompactOnExpiredCollector`.
3. **Iteration**: The sampled key from the SST will be a *starting point*.
We'll then iterate a limited number of steps across the entire DB (not just
within the SST) and pick a random key from this scan. This naturally includes
MemTable data.
4. **Fallback**: If no suitable SSTs or sampling data is found (e.g., only
MemTables exist, or old SSTs), we'll revert to the existing global cursor-based
random walk.
This addresses the MemTable-only scenario and the issue of deleted keys in
SSTs by using live key estimates for SST selection and DB-wide iteration from
the chosen start point.
I believe this offers a more robust and random selection while maintaining a
practical fallback.
Here is some pseudocode for this:
```pseudocode
MAX_CANDIDATES_TARGET = 128 // Example: try to gather up to 128 candidates
function get_random_key():
// 1. Attempt to use the new SST sampling method to get a starting point
start_key_info = try_get_start_key_from_sst_sampling()
if start_key_info.is_valid:
candidate_keys = []
// 2. Initialize an iterator and seek to the determined start_key
main_iter = create_db_iterator()
main_iter.seek(start_key_info.key)
// 3. Backward iteration from the key *before* start_key_info.key
// (Clone or re-seek for backward scan to preserve main_iter's
original seek point for forward scan)
iter_backward = create_db_iterator()
iter_backward.seek_for_prev(start_key_info.key)
// 4. Fill candidate_keys via iter
while main_iter.is_valid() and candidate_keys.size() <
MAX_CANDIDATES_TARGET:
current_key = main_iter.key()
if is_valid_user_key(current_key):
candidate_keys.add(current_key)
main_iter.next()
while iter_backward.is_valid() and candidate_keys.size() <
MAX_CANDIDATES_TARGET:
current_key = iter_backward.key()
if is_valid_user_key(current_key):
candidate_keys.add(current_key)
iter_backward.prev()
// 5. Select from candidates or fallback
if candidate_keys is not empty:
return randomly_select_from(candidate_keys)
else:
// Fallback if scan (both backward and forward) yields no valid keys
return get_random_key_via_global_cursor()
else:
// Fallback if try_get_start_key_from_sst_sampling itself failed
return get_random_key_via_global_cursor()
// function try_get_start_key_from_sst_sampling():
// (This function remains the same as previously defined:
// - Gets Metadata CF SSTs.
// - Weights them by live key estimate from CompactOnExpiredCollector.
// - Selects an SST.
// - Gets sampled keys from its properties via new
TablePropertiesCollector.
// - Returns a random sampled key or INVALID_START_KEY_INFO if any step
fails.)
// Helper functions:
// is_valid_user_key(key): Checks expiration, type, etc.
// get_random_key_via_global_cursor(): Represents the current RANDOMKEY
logic.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]