MikeThomsen commented on pull request #3317: URL: https://github.com/apache/nifi/pull/3317#issuecomment-716743057
Broadly speaking, we have two use cases that don't overlap that much: deduplication over one file vs over a data lake. Given the fact that NiFi follows a Unix-like philosophy of "simple tools that chain well together," I think the solution we're headed toward may be two processors. I think this processor could work out if we pare it back to in-memory deduplication of a single record set. That way users won't have to turn a lot of knobs and dials to configure it. Then combined with my submission which would require a DMC as it is focused on the entire data lake I think we could hit both use cases with a targeted tool that is fairly intuitive. Thoughts? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
