[
https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831611#comment-16831611
]
Joseph Witt commented on NIFI-6047:
-----------------------------------
[~mike.thomsen] and [~spiglitz] have you two reached consensus on your
differing PRs? Want to make sure we progress one of these...
Thanks
> Add DetectDuplicateRecord Processor
> -----------------------------------
>
> Key: NIFI-6047
> URL: https://issues.apache.org/jira/browse/NIFI-6047
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Adam Fisher
> Assignee: Adam Fisher
> Priority: Major
> Labels: features
> Time Spent: 3h 50m
> Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate
> processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached
> record has already been seen. The name of user-defined properties determines
> the RecordPath values used to determine if a record is unique. If no
> user-defined properties are present, the entire record is used as the input
> to determine uniqueness. All duplicate records are routed to 'duplicate'. If
> the record is not determined to be a duplicate, the Processor routes the
> record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available
> depending on the level of precision and amount of records the user wishes to
> process:
> * A *HashSet* filter type will guarantee 100% duplicate detection at the
> expense of storing one hash per record.
> * A *BloomFilter* filter type will use efficient/constant space through
> probabilistic guarantees. This is useful when processing an extremely large
> number of records and some false positives are acceptable (i.e. some records
> may be marked as duplicate even though they have not been seen before).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)