[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

GitBox Mon, 26 Oct 2020 11:35:45 -0700


adamfisher commented on pull request #3317:
URL: https://github.com/apache/nifi/pull/3317#issuecomment-716745290



   Yeah I like where you're headed with that. Are you thinking hashing across
   data sets would be the hashing implementation? Then we would have a
   separate detective wicked record processor for the bloom filter
   implementation which would be an in memory one.
   
   My time now is limited working on this since the use case has come and gone
   for it. I was really hoping to get this pushed through last year before but
   since git jujitsu tripped me up I was never able to get it into the main
   line.
   
   On Mon, Oct 26, 2020, 2:31 PM Mike <[email protected]> wrote:
   
   > Broadly speaking, we have two use cases that don't overlap that much:
   > deduplication over one file vs over a data lake. Given the fact that NiFi
   > follows a Unix-like philosophy of "simple tools that chain well together,"
   > I think the solution we're headed toward may be two processors.
   >
   > I think this processor could work out if we pare it back to in-memory
   > deduplication of a single record set. That way users won't have to turn a
   > lot of knobs and dials to configure it. Then combined with my submission
   > which would require a DMC as it is focused on the entire data lake I think
   > we could hit both use cases with a targeted tool that is fairly intuitive.
   >
   > Thoughts?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/nifi/pull/3317#issuecomment-716743057>, or
   > unsubscribe
   > 
<https://github.com/notifications/unsubscribe-auth/AAN75V2T2OBVX3JEIW27JFLSMW57TANCNFSM4GX5ED6A>
   > .
   >
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [nifi] adamfisher commented on pull request #3317: NIFI-6047 Add DetectDuplicateRecord Processor

Reply via email to