Our use case is that we are scraping a web site for new files to download. We use DetectDuplicate with a key of the URL to avoid downloading the same file multiple times. We use the HBase DistributedMapCache because the built in one doesn't work properly in a cluster, plus we don't really have the concept of a bounded set of keys.
On Sat, 15 Dec 2018, 14:21 Mike Thomsen <[email protected] wrote: > Sounds perfect. > > On Sat, Dec 15, 2018 at 9:11 AM Mark Payne <[email protected]> wrote: > > > Mike, > > > > There is a DetectDuplicate processor. It gives you the ability to provide > > an attribute to use for identification (for example, using a SHA256 hash > or > > looking at an identifier in the data or a filename, etc). It uses a > > DistributedMapCacheClient to track this so it could be backed by Redis or > > whatever other implementations we have available. Would that give you > what > > you need? > > > > Thanks > > -Mark > > > > Sent from my iPhone > > > > > On Dec 15, 2018, at 8:52 AM, Mike Thomsen <[email protected]> > > wrote: > > > > > > We are getting a lot of independent submissions of data from various > and > > > sundry teams that work with our client, and our client may need a > > processor > > > that roughly does this story: > > > > > > "as a NiFi user, I would like to be able to detect whether a file has > > been > > > seen before and processed based on feedback from a RDBMS/HBase/Elastic > > and > > > then be able to choose whether to reprocess it or drop it." > > > > > > Want to make sure that I'm not reinventing the wheel before writing > such > > a > > > processor. > > > > > > Thanks, > > > > > > Mike > > >
