Re: DetectDuplicateRecord Processor

Mike Thomsen Tue, 19 Feb 2019 05:07:34 -0800

I'll have to look at Adam's code in more depth, but I think one reason we
might need two is that I didn't see any ability to just check an existing
record path against the cache and call it a day. For teams using a standard
UUID scheme, that's all we'd need or want. Could be wrong abut that and
Adam please let me know if I am.


On Tue, Feb 19, 2019 at 7:28 AM Joe Witt <[email protected]> wrote:

> Mike, Adam,
>
> It appears the distinction of interest here between the two general
> approaches is less about in-mem vs map cache and instead is more about
> approximate/fast detection vs certain/depending on size of cache
> approaches.
>
> I'm not sure if this is quite right or if the distinction warrants two
> processors but this is my first impression.
>
> But it is probably best if the two of you, as contributors to this problem,
> discuss and find consensus.
>
> Thanks
>
> On Sat, Feb 16, 2019 at 9:33 PM Mike Thomsen <[email protected]>
> wrote:
>
> > Thanks, Adam. The use case I had, in stereotypical agile fashion could be
> > summarized like this:
> >
> > "As a NiFi user, I want to be able to generate UUIDv5 IDs for all of my
> > record sets and then have a downstream processor check each generated
> UUID
> > against the existing ingested data to see if there is an existing row
> with
> > that UUID."
> >
> > For us, at least, false positives are something that we would need to be
> > fairly aggressive in preventing.
> >
> > One possibility here is that we split the difference with your
> contribution
> > being an in-memory deduplicator and mine going purely against a
> distributed
> > map cache client. I think there might be enough ground to cover that we
> > might want to have two approaches to this problem that specialize rather
> > than a one-size-fits-most single solution.
> >
> > Thanks,
> >
> > Mike
> >
> > On Sat, Feb 16, 2019 at 9:18 PM Adam Fisher <[email protected]>
> wrote:
> >
> > > Hello NiFi developers! I'm new to NiFi and decided to create a
> > > *DetectDuplicateRecord
> > > *processor. Mike Thomsen also created an implementation about the same
> > > time. It was suggested we open this up for discussion with the
> community
> > to
> > > identify use cases.
> > >
> > > Below are the two implementations each with their respective
> properties.
> > >
> > >    - https://issues.apache.org/jira/browse/NIFI-6014
> > >    - *Record Reader*
> > >       - *Record Writer*
> > >       - *Cache Service*
> > >       - *Lookup Record Path:* The record path operation to use for
> > >       generating the lookup key for each record.
> > >       - *Cache Value Strategy:* This determines what will be written to
> > the
> > >       cache from the record. It can be either a literal value or the
> > > result of a
> > >       record path operation.
> > >       - *Cache Value: *This is the value that will be written to the
> > cache
> > >       at the appropriate record and record key if it does not exist.
> > >       - *Don't Send Empty Record Sets: *Same as "Include Zero Record
> > >       FlowFiles" below
> > >
> > >       - https://issues.apache.org/jira/browse/NIFI-6047
> > >    - *Record Reader*
> > >       -
> > > *Record Writer *
> > >       - *Include Zero Record FlowFiles*
> > >       - *Cache The Entry Identifier:* Similar to DetectDuplicate
> > >       - *Distributed Cache Service:* Similar to DetectDuplicate
> > >       - *Age Off Duration:* Similar to DetectDuplicate
> > >       - *Record Hashing Algorithm:* The algorithm used to hash the
> > combined
> > >       result of RecordPath values in the cache.
> > >       - *Filter Type: *The filter used to determine whether a record
> has
> > >       been seen before based on the matching RecordPath criteria
> defined
> > by
> > >       user-defined properties. Current options are *HashSet* or
> > >       *BloomFilter*.
> > >       - *Filter Capacity Hint:* An estimation of the total number of
> > unique
> > >       records to be processed.
> > >       - *BloomFilter Probability:* The desired false positive
> probability
> > >       when using the BloomFilter filter type.
> > >       - *<User Defined Properties>:* The name of the property is a
> record
> > >       path. All record paths are resolved on each record to determine
> > > the unique
> > >       value for a record. The value of the user-defined property is
> > > ignored.
> > >       Initial thought however was to make the value expose field
> > variables
> > > sort
> > >       of how UpdateRecord does (i.e. ${field.value})
> > >
> > >
> > > There are many ways duplicate records could be detected. Offering the
> > user
> > > the ability to:
> > >
> > >    - *Specify the cache identifier* means users can use the same
> > identifier
> > >    in different DetectDuplicateRecord blocks in different process
> groups.
> > >    Specifying a unique name based on the file name for example will
> > > conversely
> > >    isolate the unique check to just the daily load of a specific file.
> > >    - *Set a cache expiration* lets users do things like set it to last
> > for
> > >    24 hours so we only store unique cache information from one day to
> the
> > >    next. This is useful when you are doing a daily file load and you
> only
> > > want
> > >    to process the new records or the records that changed.
> > >    - *Select a filter type* will allow you to optimize for memory
> usage.
> > I
> > >    need to process multi-GB sized files and keeping a hash of each of
> > > those is
> > >    going to get expensive with a HashSet in memory. But offering a
> > > BloomFilter
> > >    is acceptable especially when you are doing database operations
> > > downstream
> > >    and don't care if you have some false positives but it will reduce
> the
> > >    number of attempted duplicate inserts/updates you perform.
> > >
> > >
> > > Here's to hoping this finds you all warm and well. I love this
> software!
> > >
> > >
> > > Adam
> > >
> >
>

Re: DetectDuplicateRecord Processor

Reply via email to