adamfisher commented on pull request #3317: URL: https://github.com/apache/nifi/pull/3317#issuecomment-716745290
Yeah I like where you're headed with that. Are you thinking hashing across data sets would be the hashing implementation? Then we would have a separate detective wicked record processor for the bloom filter implementation which would be an in memory one. My time now is limited working on this since the use case has come and gone for it. I was really hoping to get this pushed through last year before but since git jujitsu tripped me up I was never able to get it into the main line. On Mon, Oct 26, 2020, 2:31 PM Mike <[email protected]> wrote: > Broadly speaking, we have two use cases that don't overlap that much: > deduplication over one file vs over a data lake. Given the fact that NiFi > follows a Unix-like philosophy of "simple tools that chain well together," > I think the solution we're headed toward may be two processors. > > I think this processor could work out if we pare it back to in-memory > deduplication of a single record set. That way users won't have to turn a > lot of knobs and dials to configure it. Then combined with my submission > which would require a DMC as it is focused on the entire data lake I think > we could hit both use cases with a targeted tool that is fairly intuitive. > > Thoughts? > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <https://github.com/apache/nifi/pull/3317#issuecomment-716743057>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAN75V2T2OBVX3JEIW27JFLSMW57TANCNFSM4GX5ED6A> > . > ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
