Adam Fisher created NIFI-6166:
---------------------------------
Summary: Add `Distributed HashSet Filter` Type to
DetectDuplicateRecord Processor
Key: NIFI-6166
URL: https://issues.apache.org/jira/browse/NIFI-6166
Project: Apache NiFi
Issue Type: Improvement
Components: Core Framework
Reporter: Adam Fisher
Currently the *DetectDuplicateRecord* processor supports *HASH_SET_VALUE* and
*BLOOM_FILTER_VALUE* but adding *DISTRIBUTED_HASH_SET_VALUE* as a third use
case could be useful when you have large datasets you want to check for
duplicates but not load all the cached entries into memory:
{code:java}
static final AllowableValue DISTRIBUTED_HASH_SET_VALUE = new
AllowableValue("distributed-hash-set", "Distributed HashSet",
"Exactly matches records seen before with 100% accuracy at the expense of more
storage usage. " +
"Stores one entry per record in the distributed cache, and checks the cache
directly rather than loading the filter into memory during duplicate detection.
" +
"This filter is preferred when processing large data sets and complete accuracy
is preferred.");
{code}
When the user selects this filter type, the cache entry identifier should
probably be considered a prefix so the keys of entries into the cache would
look like this:
{code:java}
CacheKey = CacheEntryIdenifier + Hash(RecordPath1 + "~" + RecordPath2 + "~" +
RecordPath3 + "~" + ...)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)