[ 
https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831611#comment-16831611
 ] 

Joseph Witt commented on NIFI-6047:
-----------------------------------

[~mike.thomsen] and [~spiglitz] have you two reached consensus on your 
differing PRs?  Want to make sure we progress one of these...

Thanks

> Add DetectDuplicateRecord Processor
> -----------------------------------
>
>                 Key: NIFI-6047
>                 URL: https://issues.apache.org/jira/browse/NIFI-6047
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Assignee: Adam Fisher
>            Priority: Major
>              Labels: features
>          Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate 
> processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached 
> record has already been seen. The name of user-defined properties determines 
> the RecordPath values used to determine if a record is unique. If no 
> user-defined properties are present, the entire record is used as the input 
> to determine uniqueness. All duplicate records are routed to 'duplicate'. If 
> the record is not determined to be a duplicate, the Processor routes the 
> record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available 
> depending on the level of precision and amount of records the user wishes to 
> process:
>  * A *HashSet* filter type will guarantee 100% duplicate detection at the 
> expense of storing one hash per record.
>  * A *BloomFilter* filter type will use efficient/constant space through 
> probabilistic guarantees. This is useful when processing an extremely large 
> number of records and some false positives are acceptable (i.e. some records 
> may be marked as duplicate even though they have not been seen before).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to