[jira] [Commented] (NIFI-6047) Add DetectDuplicateRecord Processor

ASF subversion and git services (Jira) Wed, 09 Mar 2022 16:09:09 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503913#comment-17503913
 ]


ASF subversion and git services commented on NIFI-6047:
-------------------------------------------------------

Commit df00cc6cb576c11ae3ef0f1c6f64454598298936 in nifi's branch 
refs/heads/main from Mike Thomsen
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=df00cc6 ]

NIFI-6047 Cleaned up code to allow tests to run against 1.13.0-snapshot
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
NIFI-6047 Finished updates
First round of code review cleanup
Latest
Removed EL from the dynamic properties.
Finished code review requested refactoring.
Checkstyle fix.
Removed a Java 11 API
NIFI-6047 Renamed processor to DeduplicateRecord

Signed-off-by: Matthew Burgess <[email protected]>

This closes #4646


> Add DetectDuplicateRecord Processor
> -----------------------------------
>
>                 Key: NIFI-6047
>                 URL: https://issues.apache.org/jira/browse/NIFI-6047
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Adam Fisher
>            Assignee: Adam Fisher
>            Priority: Major
>              Labels: features
>          Time Spent: 18h 50m
>  Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate 
> processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached 
> record has already been seen. The name of user-defined properties determines 
> the RecordPath values used to determine if a record is unique. If no 
> user-defined properties are present, the entire record is used as the input 
> to determine uniqueness. All duplicate records are routed to 'duplicate'. If 
> the record is not determined to be a duplicate, the Processor routes the 
> record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available 
> depending on the level of precision and amount of records the user wishes to 
> process:
>  * A *HashSet* filter type will guarantee 100% duplicate detection at the 
> expense of storing one hash per record.
>  * A *BloomFilter* filter type will use efficient/constant space through 
> probabilistic guarantees. This is useful when processing an extremely large 
> number of records and some false positives are acceptable (i.e. some records 
> may be marked as duplicate even though they have not been seen before).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (NIFI-6047) Add DetectDuplicateRecord Processor

Reply via email to