[
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921796#comment-16921796
]
David Smiley commented on LUCENE-8962:
--------------------------------------
At Salesforce I worked on a custom merge policy to better address handling of
small segments than TieredMergePolicy's choices. What's disappointing about
TMP is that TMP insists on merging getSegmentsPerTier() (10) segments, _even
when they are small_ (below getFloorSegmentMB()). Instead we wanted some
"cheap merges" of a smaller number of segments (even as few as 3 for us) that
solely consist of the small segments. This cut our average segment count in
half, although cost us more I/O -- a trade-off we were happy with. I'd like to
open-source this, perhaps as a direct change to TMP with defaults to do a
similar amount of I/O but averaging fewer segments. The difficult part is
doing simulations to prove out the theories.
Additionally, I worked on a custom MergeScheduler that executed those "cheap
merges" synchronously (directly in the calling thread) while having the regular
other merges pass through to the concurrent scheduler. The rationale wasn't
tied to NRT but I could see NRT benefiting from this if getting an NRT searcher
calls out to the merge code (I don't know if it does).
Perhaps your use-case could benefit from this as well. Unlike what you propose
in the description, it doesn't involve changes/features to Lucene itself. WDYT?
> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index
> Reporter: Michael McCandless
> Priority: Major
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory
> segments to disk and open an {{IndexReader}} to search them, and this is
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}}
> will accumulate write many small segments during {{refresh}} and this then
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if
> given a little time ... so, could we somehow improve {{IndexWriter'}}s
> refresh to optionally kick off merge policy to merge segments below some
> threshold before opening the near-real-time reader? It'd be a bit tricky
> because while we are waiting for merges, indexing may continue, and new
> segments may be flushed, but those new segments shouldn't be included in the
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy,
> and some hackity logic to have the merge policy target small segments just
> written by refresh, but it's tricky to then open a near-real-time reader,
> excluding newly flushed but including newly merged segments since the refresh
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for
> discussion!
--
This message was sent by Atlassian Jira
(v8.3.2#803003)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]