Adam Fisher created NIFI-6166:
---------------------------------

             Summary: Add `Distributed HashSet Filter` Type to 
DetectDuplicateRecord Processor
                 Key: NIFI-6166
                 URL: https://issues.apache.org/jira/browse/NIFI-6166
             Project: Apache NiFi
          Issue Type: Improvement
          Components: Core Framework
            Reporter: Adam Fisher


Currently the *DetectDuplicateRecord* processor supports *HASH_SET_VALUE* and 
*BLOOM_FILTER_VALUE* but adding *DISTRIBUTED_HASH_SET_VALUE* as a third use 
case could be useful when you have large datasets you want to check for 
duplicates but not load all the cached entries into memory:

{code:java}
    static final AllowableValue DISTRIBUTED_HASH_SET_VALUE = new 
AllowableValue("distributed-hash-set", "Distributed HashSet",
"Exactly matches records seen before with 100% accuracy at the expense of more 
storage usage. " +
"Stores one entry per record in the distributed cache, and checks the cache 
directly rather than loading the filter into memory during duplicate detection. 
" +
"This filter is preferred when processing large data sets and complete accuracy 
is preferred.");
{code}

When the user selects this filter type, the cache entry identifier should 
probably be considered a prefix so the keys of entries into the cache would 
look like this:

{code:java}
CacheKey = CacheEntryIdenifier + Hash(RecordPath1 + "~" + RecordPath2 + "~" + 
RecordPath3 + "~" + ...)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to