[GitHub] [pinot] Jackie-Jiang opened a new issue, #11045: Re-design dedup to not reuse upsert mechanism

via GitHub Thu, 06 Jul 2023 13:10:31 -0700


Jackie-Jiang opened a new issue, #11045:
URL: https://github.com/apache/pinot/issues/11045


   Here are some of the main differences between dedup and upsert:
   - Dedup is done when ingesting data from the stream (apply to consuming 
segment only), and no need to track valid docs. The duplicate records are 
simply dropped
   - Dedup window (TTL of the metadata) is a must have to reduce the metadata 
size
   - There is no need to track the record location in the dedup metadata. We do 
want to track timestamp for the dedup window
   
   One potential solution for the dedup window is to keep 2 rotating maps, each 
storing metadata for one dedup window, and once the old map is completely out 
of the dedup window, clear it and use it as the new map.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] Jackie-Jiang opened a new issue, #11045: Re-design dedup to not reuse upsert mechanism

Reply via email to