p-kimberley commented on code in PR #7603:
URL: https://github.com/apache/nifi/pull/7603#discussion_r1294476116
##########
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecord.java:
##########
@@ -625,40 +609,17 @@ public
DistributedMapCacheClientWrapper(DistributedMapCacheClient client) {
@Override
public boolean contains(String value) {
try {
- return client.containsKey(value, STRING_SERIALIZER);
+ return !client.putIfAbsent(value, "", STRING_SERIALIZER,
STRING_SERIALIZER);
Review Comment:
I posted some reasons for this change on Slack. The pertinent points are:
> If Cache Identifier is unset, the user must have a separate process to
actually put the hashed records to cache, as this processor doesn't currently
do that (it only reads). This is arguably counter-intuitive and inconsistent
with DetectDuplicate. The reliance on a separate process to write hash entries
is problematic as it means:
> Deduplication isn't actually performed on the current FlowFile at all;
> There's a possibility of duplication as we lose atomicity by separating
the cache read/write calls; and
> There don't appear to be any record-oriented processors that can
conveniently hash each record and put its value to a DMC. To me, this processor
should do that, as does DetectDuplicate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]