Re: Can we change forceMerge to not need as much disk space?

Erick Erickson Fri, 13 Dec 2019 14:06:44 -0800

Coming back to this after a while.

Opening a new searcher is a sticky wicket. Say you’re merging segments 1, 2, 3 
into 4. readers have handles to segments 1, 2, 3. Even after 4 is created and 
1, 2, and 3 are deleted, until the searcher is closed the disk files still hang 
around.


That said, I wonder if we start the forceMerge at time T. It only operates on 
already-closed segments as it stands. So theoretically, opening and closing 
readers as new segments were created would change nothing in terms of search 
results _assuming_ that there was no indexing happening. All the docs that were 
visible in 1, 2, and 3 will also be the only ones visible in 4.

The only case where users would be surprised is if there were ongoing indexing 
going on and they weren’t opening searchers, just committing with 
openSearcher=false.

Thinking about this some more, I think it’s reasonable to say “If you 
forceMerge while indexing is happening, new documents will appear even if you 
don’t do an explicit commit”. From Solr’s perspective, it’s something of an 
anti-pattern to index for a long time without opening a new searcher, as 
internal structures to support Real Time Get grow until there’s a new searcher 
opened.

And since we discourage forceMerge in the first place, I could live with that.

FWIW.

> On Sep 13, 2019, at 3:58 PM, Shawn Heisey <[email protected]> wrote:
> 
> On 9/2/2019 9:19 AM, Erick Erickson wrote:
>> Anyway, it occurred to me that once a max-sized segment is created, _if_ we 
>> write the segments_n file out with the current state of the index, we could 
>> freely delete the segments that were merged into the new one. With 300G 
>> indexes (which I see regularly in the field, even multiple ones per node 
>> that size), this could result in substantial disk savings.
> 
> <snip>
> 
>> Off the top of my head, I can see some concerns:
>> 1> we’d have to open new searchers every time we wrote the segments_n file 
>> to release file handles on the old segments
> 
> How would that interact with user applications that normally handle opening 
> new searchers (such as Solr)?  When users want there to be no new searchers 
> until they issue an explicit commit, I think they're going to be a little 
> irritated if Lucene decides to open a new searcher on its own.  Maybe we'd 
> need to advise people to turn off their indexing anytime they're doing a 
> forceMerge/optimize.  That's generally a good idea anyway, and pretty much 
> required if deleteByQuery is being used.
> 
>> 2> coordinating multiple merge threads
> 
> I would think the scheduler already handles that ... thinking about all this 
> makes my brain hurt ... if I have to think about the scheduler too, there 
> might be implosions. :)
> 
>> 3> maxMergeAtOnceExplicit could mean unnecessary thrashing/opening searchers 
>> (could this be deprecated?)
> 
> It has always bothered me that when I looked for info about changing the 
> policy settings, and set the two "main" parts of the policy to 35 (instead of 
> the default 10), that the info I was finding never mentioned 
> maxMergeAtOnceExplicit.  I also needed to set this value (to 105) to have an 
> optimize work like I expected.  Without it, a lot more merging occurred than 
> was necessary when I did an optimize.  This was on a really old version of 
> Solr, either 1.4.x or 3.2.x, back when it was relatively new.
> 
> The maxMergeAtOnceExplicit setting is not even mentioned in the Solr ref 
> guide page about IndexConfig.  I got the information for that setting from 
> solr-user, when I asked why an optimize with values increased from 10 to 35 
> was doing more merge passes than I thought it needed.  I think that either 
> that parameter needs to go away or docs need improvement.
> 
>> 4> Don’t quite know what to do if maxSegments is 1 (or other very low 
>> number).
> 
> I don't think anything can be done about disk usage for that.  Just the 
> nature of the beast.
> 
>> Something like this would also pave the way for “background optimizing”. 
>> Instead of a monolithic forceMerge, I can envision a process whereby we 
>> created a low-level task that merged one max-sized segment at a time, came 
>> up for air and reopened searchers then went back in and merged the next one. 
>> With its own problems about coordinating ongoing updates, but that’s another 
>> discussion ;).
> 
> As mentioned above, I worry about low-level code opening new searchers 
> because lots of users want to have that be completely under their control.  
> Maybe TMP needs another setting to tell it whether or not it's allowed to 
> open searchers, with documentation saying that less disk space might be 
> required if it is allowed.
> 
> It would be awesome to eliminate the huge forceMerge disk requirement for 
> most users, so I think it's worth exploring.  Can the stuff with readers that 
> Mike mentioned happen without opening a new searcher at the app level?  My 
> knowledge of Lucene internals is unfortunately too vague to answer my own 
> question.
> 
> Thanks,
> Shawn
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Can we change forceMerge to not need as much disk space?

Reply via email to