kishoreg opened a new pull request #3528: Adding support for bloom filter URL: https://github.com/apache/incubator-pinot/pull/3528 Functional but needs clean up and test cases (WIP) BloomFilters can be very effective in pruning segments. This PR generates the bloomFilter dynamically based on the tableconfig->indexingconfig->bloomFilterColumns. Enhanced the ColumnValueSegmentPruner to apply bloomFilter if it exists. Sample stats Without bloom filter "numSegmentsProcessed": 136, "numSegmentsWithNoMatch": 128 With bloom filter "numSegmentsProcessed": 14, "numSegmentsWithNoMatch": 6 The number of segments processed reduces from 136 to 14. This, of course, comes with the additional overhead of creating and evaluating the bloomfilter. The current implementation loads the bloom filter on heap. The size of bloom filter can be quite big. For example, the size of bloom filter for real-time segments range from 300 to 700KB. I am thinking of two options 1. Limit the size of bloom filter to 1mb and sacrifice accuracy. 2. Off-heap implementation of bloom filter. For now, we will start with 1 and add support for 2 (this should not be hard), we just need an offheap bitset hooked into bloomfilter. I compared with ClearSpring bloom filter with Gauva (String and Integer). Gauva was slightly better in terms of size but ClearSpring API was much simpler. I will try to add some additional metadata in the serialized data so that we can switch the format later.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
