[
https://issues.apache.org/jira/browse/NIFI-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17503913#comment-17503913
]
ASF subversion and git services commented on NIFI-6047:
-------------------------------------------------------
Commit df00cc6cb576c11ae3ef0f1c6f64454598298936 in nifi's branch
refs/heads/main from Mike Thomsen
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=df00cc6 ]
NIFI-6047 Cleaned up code to allow tests to run against 1.13.0-snapshot
Removed DMC.
NIFI-6047 Started integrating changes from NIFI-6014.
NIFI-6047 Added DMC tests.
NIFI-6047 Added cache identifier recordpath test.
NIFI-6047 Added additional details.
NIFI-6047 Removed old additional details.
NIFI-6047 made some changes requested in a follow up review.
NIFI-6047 latest.
NIFI-6047 Finished updates
First round of code review cleanup
Latest
Removed EL from the dynamic properties.
Finished code review requested refactoring.
Checkstyle fix.
Removed a Java 11 API
NIFI-6047 Renamed processor to DeduplicateRecord
Signed-off-by: Matthew Burgess <[email protected]>
This closes #4646
> Add DetectDuplicateRecord Processor
> -----------------------------------
>
> Key: NIFI-6047
> URL: https://issues.apache.org/jira/browse/NIFI-6047
> Project: Apache NiFi
> Issue Type: New Feature
> Components: Core Framework
> Reporter: Adam Fisher
> Assignee: Adam Fisher
> Priority: Major
> Labels: features
> Time Spent: 18h 50m
> Remaining Estimate: 0h
>
> Add a new standard NiFi processor to supplement the DetectDuplicate
> processor. The difference is this one works at the record-level.
> h3. *DetectDuplicateRecord*
> _*Caches records from each incoming FlowFile and determines if the cached
> record has already been seen. The name of user-defined properties determines
> the RecordPath values used to determine if a record is unique. If no
> user-defined properties are present, the entire record is used as the input
> to determine uniqueness. All duplicate records are routed to 'duplicate'. If
> the record is not determined to be a duplicate, the Processor routes the
> record to 'non-duplicate'.*_
> This processor makes two different filtering data structures available
> depending on the level of precision and amount of records the user wishes to
> process:
> * A *HashSet* filter type will guarantee 100% duplicate detection at the
> expense of storing one hash per record.
> * A *BloomFilter* filter type will use efficient/constant space through
> probabilistic guarantees. This is useful when processing an extremely large
> number of records and some false positives are acceptable (i.e. some records
> may be marked as duplicate even though they have not been seen before).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)