[jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments

Erick Erickson (JIRA) Sat, 21 Apr 2018 17:05:16 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447046#comment-16447046
 ]


Erick Erickson commented on LUCENE-7976:
----------------------------------------

[~mikemccand] Thanks for looking!

About removing  {{@lucene.experimental}}, yes that was deliberate, TMP has been 
around for a very long time and it seemed to me that it's now mainstream. I 
have no problem with putting it back. Let me know if that's your preference. Is 
putting it back for back-compat? Well, actually so we don't _have_ to maintain 
back-compat?

bq. Can we do this change in two parts? First part is the nice refactoring to 
have all the methods share a common scoring loop, which should show no behavior 
change I think?

Maybe I got the block quote thing right this time, thanks!

What's the purpose here? Mechanically it's simple and I'll be glad to do it, 
I'd just like to know what the goal is. My guess is so we can have a clear 
distinction between changes in behavior in NATURAL indexing and refactoring.

When you say "no change in behavior" you were referring to NATURAL merging, 
correct? Not FORCE_MERGE or FORCE_MERGE_DELETES. Those will behave quite 
differently.

bq. can you name it NATURAL, FORCE_MERGE and FORCE_MERGE_DELETES?

done.

{{quote}}
Hmm what is the note/NOTD? Can you change to As of Lucene 7.4
{{quote}}
What can I say? I spend 99% of my life in Solr, _everything_ is Solr, right? As 
for rest, typos late at night.

Done.

Finally, can you comment on this nocommit?
{{quote}}
Should you be using writer.numDeletesToMerge rather than the info.getDelDocs 
other places
{{quote}}

I see both of these in the code, and writer.numDeletesToMerge seems 
considerably more expensive. Is there a reason to prefer one over the other?

Thanks again!

> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of 
> very large segments
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7976
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7976
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch, 
> LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on 
> disk) handled quite easily in a single Lucene index. This is particularly 
> true as features like docValues move data into MMapDirectory space. The 
> current TMP algorithm allows on the order of 50% deleted documents as per a 
> dev list conversation with Mike McCandless (and his blog here:  
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many 
> TB) solutions like "you need to distribute your collection over more shards" 
> become very costly. Additionally, the tempting "optimize" button exacerbates 
> the issue since once you form, say, a 100G segment (by 
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the 
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like 
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions 
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at 
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO 
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with 
> >> smaller segments to bring the resulting segment up to 5G. If no smaller 
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). 
> >> It would be rewritten into a single segment removing all deleted docs no 
> >> matter how big it is to start. The 100G example above would be rewritten 
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the 
> default would be the same behavior we see now. As it stands now, though, 
> there's no way to recover from an optimize/forceMerge except to re-index from 
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the 
> wild" with 10s of  shards replicated 3 or more times. And that doesn't even 
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A 
> new merge policy is certainly an alternative.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7976) Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments

Reply via email to