[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17053231#comment-17053231
 ] 

Simon Willnauer commented on LUCENE-8962:
-----------------------------------------

I read through this issue and I want to share some of my thoughts. First, I 
understand the need for this and the motivation, yet every time we add 
something like this to the IndexWriter to do something _as part of_ another 
method it triggers an alarm on my end. I have spent hours and days thinking 
about how IW can be simpler and the biggest issues that I see is that the 
primitives on IW like commit or openReader are doing too much. Just look at 
openReader it's pretty involved and changing the bus factor or making it easier 
to understand is hard. Adding stuff like _wait for merge_ with something like a 
timeout is not what I think we should do neither to _openReader_ nor to 
_commit_.  
That said, I think we can make the same things happen but we should think in 
primitives rather than changing method behavior with configuration. Let me 
explain what I mean:

Lets say we keep _commit_ and _openReader_ the way it is and would instead 
allow to use an existing reader NRT or not and allow itself to _optimize_ 
itself (yeah I said that  - it might be a good name after all). With a slightly 
refactored IW we can share the merge logic and let the reader re-write itself 
since we are talking about very small segments the overhead is very small. This 
would in turn mean that we are doing the work twice ie. the IW would do its 
normal work and might merge later etc. We might even merge this stuff into 
heap-space or so if we have enough I haven't thought too much about that. This 
way we can clean up IW potentially and add a very nice optimization that works 
for commit as well as NRT. We should strive for making IW simpler not do more. 
I hope I wasn't too discouraging. 

> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.5
>
>         Attachments: LUCENE-8962_demo.png
>
>          Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to