ankitsultana commented on PR #16364: URL: https://github.com/apache/pinot/pull/16364#issuecomment-3084910778
@chenboat : great to see we are looking to add better support for n-gram indexes in Pinot. I had run an internal PoC on this a couple of years back with a similar approach and one of the main learnings was that the n-gram index size with an approach like this can get very high. The issue is that this approach stores a bitmap for each unique n-gram. Even for a moderately high number of unique n-grams, since the corresponding bitmaps can be quite dense and are random, the cumulative size of all bitmaps can exceed 100s of MB per-segment very quickly (i.e. almost equal or much higher than rest of the segment size). I left the PoC at the time because I thought we'll need to add several knobs to the index to support keeping only top n-grams when the size exceeds a certain threshold. I wonder if you have run some experiments around this already? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
