adamfisher commented on pull request #3317: URL: https://github.com/apache/nifi/pull/3317#issuecomment-716726587
The cache key identifier could be used to identify the grouping of the data set and used as a key prefix. It sounds like it's just a matter of how we store record level hashes. The bloom filter stored in one cache record is necessary but the size is fixed regardless of the data set size. It's only driven by the capacity set on the bloom filter during the initial configuration. We could set that value to have a maximum to help protect the user from corruption. Storing individual hash sets for each record is obviously an option with this implementation already but it's good to be able to use Bloom filters because I think there will be scenarios for people don't care about exactness in unique records because it's primarily used to eliminate a lot of duplicate data processing. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
