kishoreg opened a new pull request #3528: Adding support for bloom filter
URL: https://github.com/apache/incubator-pinot/pull/3528
 
 
   Functional but needs clean up and test cases (WIP)
   BloomFilters can be very effective in pruning segments. This PR generates 
the bloomFilter dynamically based on the 
tableconfig->indexingconfig->bloomFilterColumns. 
   Enhanced the ColumnValueSegmentPruner to apply bloomFilter if it exists.
   
   Sample stats
   Without bloom filter
       "numSegmentsProcessed": 136,
       "numSegmentsWithNoMatch": 128
   With bloom filter
       "numSegmentsProcessed": 14,
       "numSegmentsWithNoMatch": 6
   The number of segments processed reduces from 136 to 14.
   
   This, of course, comes with the additional overhead of creating and 
evaluating the bloomfilter.  The current implementation loads the bloom filter 
on heap. The size of bloom filter can be quite big. For example, the size of 
bloom filter for real-time segments range from 300 to 700KB.
   
   I am thinking of two options
   1. Limit the size of bloom filter to 1mb and sacrifice accuracy.
   2. Off-heap implementation of bloom filter. 
   
   For now, we will start with 1 and add support for 2 (this should not be 
hard), we just need an offheap bitset hooked into bloomfilter.
   
   I compared with ClearSpring bloom filter with Gauva (String and Integer). 
Gauva was slightly better in terms of size but ClearSpring API was much 
simpler. I will try to add some additional metadata in the serialized data so 
that we can switch the format later.
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to