[GitHub] [nifi] MikeThomsen commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

GitBox Mon, 26 Oct 2020 11:31:22 -0700


MikeThomsen commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716743057



   Broadly speaking, we have two use cases that don't overlap that much: 
deduplication over one file vs over a data lake. Given the fact that NiFi 
follows a Unix-like philosophy of "simple tools that chain well together," I 
think the solution we're headed toward may be two processors.
   
   I think this processor could work out if we pare it back to in-memory 
deduplication of a single record set. That way users won't have to turn a lot 
of knobs and dials to configure it. Then combined with my submission which 
would require a DMC as it is focused on the entire data lake I think we could 
hit both use cases with a targeted tool that is fairly intuitive.
   
   Thoughts?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] MikeThomsen commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

Reply via email to