churromorales commented on issue #15412:
URL: https://github.com/apache/lucene/issues/15412#issuecomment-3503947234

   Thank you for the pointer — this approach works better for us than 
`TieredMergePolicy`. However, we have a very large index (1PB+) in Solr, so we 
inevitably have far more data on disk than can fit in RAM, causing frequent 
paging in and out.
   
   We retain customer data for all time (10+ years), but most queries only 
cover the last 3 years, with a significant portion focusing on just the past 7 
days. Our time dimension is a `PointField`, and from reviewing the code, it 
seems that Solr performs segment pruning when the query’s time range doesn’t 
overlap with a segment’s time range (given they are point fields). This pruning 
provides a substantial performance improvement as it doesn't page in segments 
which are known not to have the data. 
   
   I’ve been considering a configuration that defines exponential time windows 
between the Unix epoch and now. These windows wouldn’t slide over time; 
instead, as time progresses, new windows would be added while older ones merge 
into larger intervals.  And have a configuration to stop compacting our oldest 
segments altogether.  I worked on this type of compaction in HBase which is 
also a LSM-Tree design and it worked out quite well for us in terms of reducing 
I/O considerably to do range queries between now and some point in the past. 
   
   I believe the main performance gain comes from avoiding the need to page 
segments in and out during recent range queries. While filtering itself is 
fast, it still incurs some memory overhead (which affects other queries) — and 
nothing beats simply not having to read unnecessary data at all.  So while 
merging adjacent segments works, it might not lay out the data on disk as we 
would desire.
   
   Also please correct me if I’m mistaken here — I’m still relatively new to 
Lucene/Solr/OpenSearch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to