ankitsultana commented on PR #16364:
URL: https://github.com/apache/pinot/pull/16364#issuecomment-3084910778

   @chenboat : great to see we are looking to add better support for n-gram 
indexes in Pinot. I had run an internal PoC on this a couple of years back with 
a similar approach and one of the main learnings was that the n-gram index size 
with an approach like this can get very high.
   
   The issue is that this approach stores a bitmap for each unique n-gram. Even 
for a moderately high number of unique n-grams, since the corresponding bitmaps 
can be quite dense and are random, the cumulative size of all bitmaps can 
exceed 100s of MB per-segment very quickly (i.e. almost equal or much higher 
than rest of the segment size).
   
   I left the PoC at the time because I thought we'll need to add several knobs 
to the index to support keeping only top n-grams when the size exceeds a 
certain threshold.
   
   I wonder if you have run some experiments around this already?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to