[ https://issues.apache.org/jira/browse/APEXMALHAR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886659#comment-15886659 ]
bright chen commented on APEXMALHAR-2366: ----------------------------------------- Hi [~bhupesh] The only difference as I think is this BloomFilter implementation used SerializationBuffer to save some copy and garbage collection. I am not sure how much impact on performance. Another thing is Chaitanya's BloomFilter is in Megh, It at least need to move to the malhar lib before can use it. and I am not sure if there any license issue neither > Apply BloomFilter to Bucket > --------------------------- > > Key: APEXMALHAR-2366 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2366 > Project: Apache Apex Malhar > Issue Type: Improvement > Reporter: bright chen > Assignee: bright chen > Original Estimate: 192h > Remaining Estimate: 192h > > The bucket get() will check the cache and then check from the stored files if > the entry is not in the cache. The checking from files is a pretty heavy > operation due to file seek. > The chance of check from file is very high if the key range are large. > Suggest to apply BloomFilter for bucket to reduce the chance read from file. > If the buckets were managed by ManagedStateImpl, the entry of bucket would be > very huge and the BloomFilter maybe not useful after a while. But If the > buckets were managed by ManagedTimeUnifiedStateImpl, each bucket keep certain > amount of entry and BloomFilter would be very useful. > For implementation: > The Guava already have BloomFilter and the interface are pretty simple and > fit for our case. But Guava 11 is not compatible with Guava 14 (Guava 11 use > Sink while Guava 14 use PrimitiveSink). -- This message was sent by Atlassian JIRA (v6.3.15#6346)