It’s always bothered me that optimize/forceMerge needs 100% of the disk space. 
I’ve recently been wondering whether that’s absolutely necessary, especially 
now that forceMerge respects the max segment size.

I HAVE NOT looked at the code closely, so this is mostly theory for someone to 
shoot down before diving in at all.

I’ve seen some situations where optimizing makes a radical difference. For 
instance, the time it takes for the Terms component to return is essentially 
linear to the number of segments. An artificially bad case to be sure, but 
still. We’re talking the difference between 17 seconds and sub-second here. A 
large index to be sure…

Anyway, it occurred to me that once a max-sized segment is created, _if_ we 
write the segments_n file out with the current state of the index, we could 
freely delete the segments that were merged into the new one. With 300G indexes 
(which I see regularly in the field, even multiple ones per node that size), 
this could result in substantial disk savings.

Off the top of my head, I can see some concerns:
1> we’d have to open new searchers every time we wrote the segments_n file to 
release file handles on the old segments
2> coordinating multiple merge threads
3> maxMergeAtOnceExplicit could mean unnecessary thrashing/opening searchers 
(could this be deprecated?)
4> Don’t quite know what to do if maxSegments is 1 (or other very low number).

Something like this would also pave the way for “background optimizing”. 
Instead of a monolithic forceMerge, I can envision a process whereby we created 
a low-level task that merged one max-sized segment at a time, came up for air 
and reopened searchers then went back in and merged the next one. With its own 
problems about coordinating ongoing updates, but that’s another discussion ;).

There’s lots of details to work out, throwing this out for discussion. I can 
raise a JIRA if people think the idea has legs.

Erick



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to