jadami10 opened a new pull request, #17254:
URL: https://github.com/apache/pinot/pull/17254

   This is a new table config deduplication feature to allow near realtime 
cleanup of metadata keys. When `realtimeTTLCleanupIntervalSeconds` > 0, we 
start a background thread per partition to call `removeExpiredPrimaryKeys` on 
every key that is outside of the `metadataTTL`.
   
   The use case for this feature is when `metadataTTL` << segment flush 
threshold. With high throughput topics, we may want just a few minutes of 
deduplication to handle transient, upstream duplicates, but we do not want the 
memory overhead of hours of keys. It is safe to repeatedly call 
`removeExpiredPrimaryKeys` because deduplication in the ingestion path already 
respects the TTL even if the key is already in the dictionary.
   
   We tested this over 12 hours publishing ~250 random events per second with a 
10 minute metadataTTL and cleanup every 1 minute. We then republished those 
same events 3-5 minutes later. We then ran `count(*)` repeatedly to show it was 
< .01% behind the number of unique events published
   
   Monitoring shows a few things:
   - the primary key count is capped
   - we see rows dropped increase after a few minutes from republishing starting
   - we see expired key removal starting after ~10-11 minutes when keys first 
hit the TTL
   
   <img width="2072" height="309" alt="image" 
src="https://github.com/user-attachments/assets/a80e33e4-a31c-482b-a284-0ff8fa68e6cf";
 />
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to