[jira] [Commented] (NIFI-6166) Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor

Adam Fisher (JIRA) Sat, 30 Mar 2019 09:27:06 -0700


    [ 
https://issues.apache.org/jira/browse/NIFI-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16805873#comment-16805873
 ]


Adam Fisher commented on NIFI-6166:
-----------------------------------

I would be open to discussion on whether this _*should*_ be implemented. It is 
a nice to have but not sure if it is just adding more flexibility for users to 
manage their memory usage at the expense of additional complexity. This added 
filter type would just be adding a new way to store the filtered properties in 
the distributed cache on a per-record basis and may inadvertently impact cache 
lifetime if the number of records is higher than the distributed cache server's 
rotation policy of entries and therefore affecting other processors relying on 
cache entries. After brainstorming this, I'm almost hesitant to say this should 
not be implemented.

> Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor
> ------------------------------------------------------------------------
>
>                 Key: NIFI-6166
>                 URL: https://issues.apache.org/jira/browse/NIFI-6166
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Priority: Minor
>              Labels: features
>
> Currently the *DetectDuplicateRecord* processor supports *HASH_SET_VALUE* and 
> *BLOOM_FILTER_VALUE* but adding *DISTRIBUTED_HASH_SET_VALUE* as a third use 
> case could be useful when you have large datasets you want to check for 
> duplicates but not load all the cached entries into memory:
> {code:java}
>     static final AllowableValue DISTRIBUTED_HASH_SET_VALUE = new 
> AllowableValue("distributed-hash-set", "Distributed HashSet",
> "Exactly matches records seen before with 100% accuracy at the expense of 
> more storage usage. " +
> "Stores one entry per record in the distributed cache, and checks the cache 
> directly rather than loading the filter into memory during duplicate 
> detection. " +
> "This filter is preferred when processing large data sets and complete 
> accuracy is preferred.");
> {code}
> When the user selects this filter type, the cache entry identifier should 
> probably be considered a prefix so the keys of entries into the cache would 
> look like this:
> {code:java}
> CacheKey = CacheEntryIdenifier + Hash(RecordPath1 + "~" + RecordPath2 + "~" + 
> RecordPath3 + "~" + ...)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NIFI-6166) Add `Distributed HashSet Filter` Type to DetectDuplicateRecord Processor

Reply via email to