churromorales commented on issue #15412: URL: https://github.com/apache/lucene/issues/15412#issuecomment-3503947234
Thank you for the pointer — this approach works better for us than `TieredMergePolicy`. However, we have a very large index (1PB+) in Solr, so we inevitably have far more data on disk than can fit in RAM, causing frequent paging in and out. We retain customer data for all time (10+ years), but most queries only cover the last 3 years, with a significant portion focusing on just the past 7 days. Our time dimension is a `PointField`, and from reviewing the code, it seems that Solr performs segment pruning when the query’s time range doesn’t overlap with a segment’s time range (given they are point fields). This pruning provides a substantial performance improvement as it doesn't page in segments which are known not to have the data. I’ve been considering a configuration that defines exponential time windows between the Unix epoch and now. These windows wouldn’t slide over time; instead, as time progresses, new windows would be added while older ones merge into larger intervals. And have a configuration to stop compacting our oldest segments altogether. I worked on this type of compaction in HBase which is also a LSM-Tree design and it worked out quite well for us in terms of reducing I/O considerably to do range queries between now and some point in the past. I believe the main performance gain comes from avoiding the need to page segments in and out during recent range queries. While filtering itself is fast, it still incurs some memory overhead (which affects other queries) — and nothing beats simply not having to read unnecessary data at all. So while merging adjacent segments works, it might not lay out the data on disk as we would desire. Also please correct me if I’m mistaken here — I’m still relatively new to Lucene/Solr/OpenSearch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
