p-kimberley commented on code in PR #7603:
URL: https://github.com/apache/nifi/pull/7603#discussion_r1294476116


##########
nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/DeduplicateRecord.java:
##########
@@ -625,40 +609,17 @@ public 
DistributedMapCacheClientWrapper(DistributedMapCacheClient client) {
         @Override
         public boolean contains(String value) {
             try {
-                return client.containsKey(value, STRING_SERIALIZER);
+                return !client.putIfAbsent(value, "", STRING_SERIALIZER, 
STRING_SERIALIZER);

Review Comment:
   I posted some reasons for this change on Slack. The pertinent points are:
   
   > If Cache Identifier is unset, the user must have a separate process to 
actually put the hashed records to cache, as this processor doesn't currently 
do that (it only reads). This is arguably counter-intuitive and inconsistent 
with DetectDuplicate. The reliance on a separate process to write hash entries 
is problematic as it means:
   > Deduplication isn't actually performed on the current FlowFile at all;
   > There's a possibility of duplication as we lose atomicity by separating 
the cache read/write calls; and
   > There don't appear to be any record-oriented processors that can 
conveniently hash each record and put its value to a DMC. To me, this processor 
should do that, as does DetectDuplicate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to