[
https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503912#comment-17503912
]
ASF subversion and git services commented on NIFI-6047:
-------------------------------------------------------
Commit 23132fb89f63b8eb1305103934cb5aaed061eefe in nifi's branch
refs/heads/main from Adam
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=23132fb ]
NIFI-6047
Added NiFi DetectDuplicateRecord standard processor.
Adding some documentation and PR review tweaks.
Exposing processor
Documentation updates, exception handling consolidation, added support for
record path field variables.
Added tests.
Build bump.
Migrated cache service to groovy folder.
Moved declarations for properties to @BeforeClass lifecycle method.
Adding some documentation and PR review tweaks.
Documentation updates, exception handling consolidation, added support for
record path field variables.
Added tests.
Build bump.
Migrated cache service to groovy folder.
Fixed variable type bug.
Fixed mapping of test params to usage.
Fixed potential illegal state exception bug.
> Add DetectDuplicateRecord Processor
> -----------------------------------
>
> Key: NIFI-6047
> URL: https://issues.apache.org/jira/browse/NIFI-6047
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Adam Fisher
> Assignee: Adam Fisher
> Priority: Major
> Labels: features
> Time Spent: 18h 50m
> Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate
> processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached
> record has already been seen. The name of user-defined properties determines
> the RecordPath values used to determine if a record is unique. If no
> user-defined properties are present, the entire record is used as the input
> to determine uniqueness. All duplicate records are routed to 'duplicate'. If
> the record is not determined to be a duplicate, the Processor routes the
> record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available
> depending on the level of precision and amount of records the user wishes to
> process:
> * A *HashSet* filter type will guarantee 100% duplicate detection at the
> expense of storing one hash per record.
> * A *BloomFilter* filter type will use efficient/constant space through
> probabilistic guarantees. This is useful when processing an extremely large
> number of records and some false positives are acceptable (i.e. some records
> may be marked as duplicate even though they have not been seen before).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)