[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

GitBox Mon, 26 Oct 2020 11:03:21 -0700


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716726587



   The cache key identifier could be used to identify the grouping of the data 
set and used as a key prefix. It sounds like it's just a matter of how we store 
record level hashes. The bloom filter stored in one cache record is necessary 
but the size is fixed regardless of the data set size. It's only driven by the 
capacity set on the bloom filter during the initial configuration. We could set 
that value to have a maximum to help protect the user from corruption. Storing 
individual hash sets for each record is obviously an option with this 
implementation already but it's good to be able to use Bloom filters because I 
think there will be scenarios for people don't care about exactness in unique 
records because it's primarily used to eliminate a lot of duplicate data 
processing. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

Reply via email to