renatocron commented on issue #14914: URL: https://github.com/apache/druid/issues/14914#issuecomment-1699028147
Hello! 2 years is potentially a lot of segments that need to be scanned, so you are missing one of the main advantage of having a primary index based on time, if you ask the database to scan all the segments all the time. There are a few options to consider: You can try turn cache on on the historical nodes (no on the broker, caching on the broker would not matter for this query), that could help if you don't have any late arriving data and reserve a some RAM for caching, both on historicals node RAM as well on the historical: https://druid.apache.org/docs/latest/configuration/#historical-caching But I'm afraid druid can't otimize `longLast`, because it may need to scan all the matched segments to send to the broker to compare which is the overall latest. You can consider make a external Service / Application: This service could be configured to read data from the same ingestion sources and update a separate, faster storage system (like Redis, DuckDB or an in-memory database). Here's a simple outline of how you could design it: - The service reads from the same ingestion source that Druid uses. - It updates a faster storage (like Redis). - The structures stored in this faster database mirror those in Druid. - You query the faster storage directly for the most recent transactions. - If this fails, you fall back to querying Druid. - This system would also rely on Druid as a source of truth, when your application boots up or if the in-memory database fails. This approach frees you from sending each query to Druid and allows you to retrieve the most recent transactions much more quickly. It does, however, require this additional service to be maintained. Nevertheless, you could also check if the segment compaction are optimized for your needs: Segments Optimization (Secondary Indexing): If your data has some natural order or frequently queried fields, you could specify secondary indexes using autocompaction with range, for example [https://druid.apache.org/docs/latest/ingestion/native-batch#benefits-of-range-partitioning](https://druid.apache.org/docs/latest/ingestion/native-batch#benefits-of-range-partitioning). Although time-based partitions are the primary indexing method in Druid, secondary indexes could help improve query latency by reducing the quantity of data that needs to be scanned, maybe you can have segments just with the triggerState=COMPLETED in one place, or create a new datasource just for them -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
