[ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16213266#comment-16213266
 ] 

Michael McCandless commented on LUCENE-7976:
--------------------------------------------

 bq. This can cause jitter in results where the ordering will depend on which 
shard answered a query because the frequencies are off significantly enough. 

Segment-based replication 
(http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
 would improve this situation, in that the jitter no longer varies by shard 
since all replicas search identical point-in-time views of the index.  It's 
also quite a bit more efficient if you need many replicas.

bq. I suspect that the current behavior, where a segment that's 20 times larger 
than the configured max segment size is ineligible for automatic merging until 
97.5 percent deleted docs, was not actually what was desired.

Right!  The designer didn't think about this case because he didn't call 
{{forceMerge}} so frequently :)

bq. Max segment sizes are a target, not a hard guarantee... Lucene doesn't know 
exactly how big the segment will be before it actually completes the merge, and 
it can end up going over the limit.

Right, it's only an estimate, but in my experience it's conservative, i.e. the 
resulting merged segment is usually smaller than the max segment size, but you 
cannot count on that.

bq. The downside to a max segment size is that one can start getting many more 
segments than anticipated or desired (and can impact performance in 
unpredictable ways, depending on the exact usage).

Right, but the proposed solution (TMP always respects the max segment size) 
would work well such users: they just need to increase their max segment size 
if they need to get a 10 TB index down to 20 segments.

bq. So 50% deleted documents consumes a lot of resources, both disk and RAM 
when considered in aggregate at that scale.

Well, disks are cheap and getting cheaper.  And 50% is the worst case -- TMP 
merges those segments away once they hit 50%, so that the net across the index 
is less than 50% deletions.  Users already must have a lot of free disk space 
to accommodate running merges, pending refreshes, pending commits, etc.

Erick, are these timestamp'd documents?  It's better to index those into 
indices that rollover with time (see how Elasticsearch recommends it: 
https://www.elastic.co/blog/managing-time-based-indices-efficiently), where 
it's far more efficient to drop whole indices than delete documents in one 
index.

Still, I think it's OK to relax TMP so it will allow max sized segments with 
less than 50% deletions to be eligible for merging, and users can tune the 
deletions weight to force TMP to aggressively merge such segments.  This would 
be a tiny change in the loop that computes {{tooBigCount}}.

bq. The root cause of the problem here seems to be that we have only one 
variable (maxSegmentSize) and multiple use-cases we're forcing on it:

But how can that work?

If you have two different max sizes, then how can natural merging work with the 
too-large segments in the index due to a past {{forceMerge}}?  It cannot merge 
them and produce a small enough segment until enough (too many) deletes 
accumulate on them.

Or, if we had two settings, we could insist that the 
{{maxForcedMergeSegmentSize}} is <= the {{maxSegmentSize}} but then what's the 
point :)

The problem here is {{forceMerge}} today sets up an index structure that 
natural merging is unable to cope with; having {{forceMerge}} respect the max 
segment size would fix that nicely.  Users can simply increase that size if 
they want massive segments.

> Add a parameter to TieredMergePolicy to merge segments that have more than X 
> percent deleted documents
> ------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to