jadami10 opened a new pull request, #17254: URL: https://github.com/apache/pinot/pull/17254
This is a new table config deduplication feature to allow near realtime cleanup of metadata keys. When `realtimeTTLCleanupIntervalSeconds` > 0, we start a background thread per partition to call `removeExpiredPrimaryKeys` on every key that is outside of the `metadataTTL`. The use case for this feature is when `metadataTTL` << segment flush threshold. With high throughput topics, we may want just a few minutes of deduplication to handle transient, upstream duplicates, but we do not want the memory overhead of hours of keys. It is safe to repeatedly call `removeExpiredPrimaryKeys` because deduplication in the ingestion path already respects the TTL even if the key is already in the dictionary. We tested this over 12 hours publishing ~250 random events per second with a 10 minute metadataTTL and cleanup every 1 minute. We then republished those same events 3-5 minutes later. We then ran `count(*)` repeatedly to show it was < .01% behind the number of unique events published Monitoring shows a few things: - the primary key count is capped - we see rows dropped increase after a few minutes from republishing starting - we see expired key removal starting after ~10-11 minutes when keys first hit the TTL <img width="2072" height="309" alt="image" src="https://github.com/user-attachments/assets/a80e33e4-a31c-482b-a284-0ff8fa68e6cf" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
