[GitHub] [nifi] MikeThomsen commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

GitBox Mon, 26 Oct 2020 10:55:28 -0700


MikeThomsen commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716721764



   @adamfisher I've been spending some more time getting back into your 
contribution, and there are some core issues that I have noticed:
   
   1. It looks like the cache identifier cannot be set at the record level.
   2. You're serializing the entire a very big value (either a bloom filter or 
a very large hashset)
   3. Based on the logic that I see in how you're handling this, 
DetectDepublicateRecord as implemented here cannot filter across record sets.
   
   #2 is a potentially big problem for HBase or Redis, as you could be stuffing 
several MB to dozens of MB of binary data into a single column/kv pair.
   #3 is also a big issue because a large part of the use case revolves around 
detecting duplicates across an enterprise. For example, if two users submit a 
data set with a bunch of overlapping data it should be able to use the one data 
set to deduplicate the other.
   
   I am leaning toward reopening my PR, but regardless I'd like to say this was 
a great first shot even with the aforementioned issues.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] MikeThomsen commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

Reply via email to