jadami10 opened a new issue, #8932:
URL: https://github.com/apache/pinot/issues/8932

   This issue is getting us into the realm of other database functionality. 
Because we setup so many Pinot tables and queries, we try to standardize how we 
select indices on all of our tables. The current logic is all dimension columns 
get an inverted index, the time column gets a range index, and some "key 
column" also gets a bloom filter.
   
   While we expected this to be a clear trade off of cost (cost of generating 
and storing indices) to better latency, there are several cases where we see 
much 10x worse performance due to this approach. Specifically, when we have a 
table with ~10s millions documents, filtering by the "key column" will get us 
down to thousands of documents. It turns out using the inverted index on the 
remaining fields + filters actually causes queries to be return much more 
slowly than just looking up and aggregating those documents directly.
   
   Some ideas here:
   - have pinot automatically recognize these cases and stop using the index
   - give users some control over what indices are used (we've had similar 
examples where pinot uses a startree index even though we know there's not that 
many records, or a range index even when we know it won't be useful for that 
query)
   - provide information on what indices were used in the query response. 
without this, you have to constantly infer this information from other query 
stats like rows scanned in vs post filter)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to