MikeThomsen commented on pull request #3317: URL: https://github.com/apache/nifi/pull/3317#issuecomment-716721764
@adamfisher I've been spending some more time getting back into your contribution, and there are some core issues that I have noticed: 1. It looks like the cache identifier cannot be set at the record level. 2. You're serializing the entire a very big value (either a bloom filter or a very large hashset) 3. Based on the logic that I see in how you're handling this, DetectDepublicateRecord as implemented here cannot filter across record sets. #2 is a potentially big problem for HBase or Redis, as you could be stuffing several MB to dozens of MB of binary data into a single column/kv pair. #3 is also a big issue because a large part of the use case revolves around detecting duplicates across an enterprise. For example, if two users submit a data set with a bunch of overlapping data it should be able to use the one data set to deduplicate the other. I am leaning toward reopening my PR, but regardless I'd like to say this was a great first shot even with the aforementioned issues. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
