[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053640#comment-17053640
 ] 

Michael Froh commented on LUCENE-8962:
--------------------------------------

bq. With a slightly refactored IW we can share the merge logic and let the 
reader re-write itself since we are talking about very small segments the 
overhead is very small. This would in turn mean that we are doing the work 
twice ie. the IW would do its normal work and might merge later etc.

Just to provide a bit more context, for the case where my team uses this 
change, we're replicating the index (think Solr master/slave) from "writers" to 
many "searchers", so we're avoiding doing the work many times.

An earlier (less invasive) approach I tried to address the small flushed 
segments problem was roughly: call commit on writer, hard link the commit files 
to another filesystem directory to "clone" the index, open an IW on that 
directory, merge small segments on the clone, let searchers replicate from the 
clone. That approach does mean that the merging work happens twice (since the 
"real" index doesn't benefit from the merge on the clone), but it doesn't 
involve any changes in Lucene.

Maybe that less-invasive approach is a better way to address this. It's 
certainly more consistent with [~simonw]'s suggestion above.

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 9.5h
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to