[
https://issues.apache.org/jira/browse/LUCENE-7976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16450104#comment-16450104
]
Michael McCandless commented on LUCENE-7976:
--------------------------------------------
{quote}Right, but that has quite a few consequences when comparing old .vs. new
behavior for FORCE_MERGE and FORCE_MERGE_DELETES for several reasons, mostly
stemming from having these two operations respect maxSegmentBytes:
{quote}
OK I see ... I think it still makes sense to try to break these changes into a
couple issues. This one (just refactoring to share the scoring approach, with
the corresponding change in behavior) is going to be big enough!
Hmm I see some more failing tests e.g.:
{quote}[junit4] Suite:
org.apache.lucene.search.TestTopFieldCollectorEarlyTermination
[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestTopFieldCollectorEarlyTermination
-Dtests.method=testEarlyTermination -Dtests.seed=355D07976851D85A
-Dtests.badapples=true -Dtests.locale=nn-N\
O -Dtests.timezone=America/Cambridge_Bay -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] ERROR 869s J3 |
TestTopFieldCollectorEarlyTermination.testEarlyTermination <<<
[junit4] > Throwable #1: java.lang.OutOfMemoryError: GC overhead limit exceeded
[junit4] > at
__randomizedtesting.SeedInfo.seed([355D07976851D85A:FACA46C8503D4859]:0)
[junit4] > at java.util.Arrays.copyOf(Arrays.java:3332)
[junit4] > at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
[junit4] > at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
[junit4] > at java.lang.StringBuilder.append(StringBuilder.java:136)
[junit4] > at
org.apache.lucene.store.MockIndexInputWrapper.toString(MockIndexInputWrapper.java:224)
[junit4] > at java.lang.String.valueOf(String.java:2994)
[junit4] > at java.lang.StringBuilder.append(StringBuilder.java:131)
[junit4] > at
org.apache.lucene.store.BufferedChecksumIndexInput.<init>(BufferedChecksumIndexInput.java:34)
[junit4] > at
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:119)
[junit4] > at
org.apache.lucene.store.MockDirectoryWrapper.openChecksumInput(MockDirectoryWrapper.java:1072)
[junit4] > at
org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.readEntries(Lucene50CompoundReader.java:105)
[junit4] > at
org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.<init>(Lucene50CompoundReader.java:69)
[junit4] > at
org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:70)
[junit4] > at
org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:100)
[junit4] > at
org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:78)
[junit4] > at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:202)
[junit4] > at
org.apache.lucene.index.ReadersAndUpdates.getReaderForMerge(ReadersAndUpdates.java:782)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4221)
[junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3910)
[junit4] > at
org.apache.lucene.index.SerialMergeScheduler.merge(SerialMergeScheduler.java:40)
[junit4] > at
org.apache.lucene.index.IndexWriter.maybeMerge(IndexWriter.java:2077)
[junit4] > at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1910)
[junit4] > at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1861)
[junit4] > at
org.apache.lucene.index.RandomIndexWriter.forceMerge(RandomIndexWriter.java:454)
[junit4] > at
org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.createRandomIndex(TestTopFieldCollectorEarlyTermination.java:96)
[junit4] > at
org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.doTestEarlyTermination(TestTopFieldCollectorEarlyTermination.java:123)
[junit4] > at
org.apache.lucene.search.TestTopFieldCollectorEarlyTermination.testEarlyTermination(TestTopFieldCollectorEarlyTermination.java:113)
{quote}
and
{quote}[junit4] 2> NOTE: reproduce with: ant test
-Dtestcase=TestIndexWriterDelete
-Dtests.method=testOnlyDeletesTriggersMergeOnClose
-Dtests.seed=355D07976851D85A -Dtests.badapples=true -Dtests.locale=en-IE\
-Dtests.timezone=Australia/Perth -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] ERROR 0.05s J0 |
TestIndexWriterDelete.testOnlyDeletesTriggersMergeOnClose <<<
[junit4] > Throwable #1:
com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught
exception in thread: Thread[id=660, name=Lucene Merge Thread #6,
state=RUNNABLE, group=TGRP-Tes\
tIndexWriterDelete]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.RuntimeException: segments must include at least one segment
[junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
[junit4] > Caused by: java.lang.RuntimeException: segments must include at
least one segment
[junit4] > at
org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228)
[junit4] > at
org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701)
[junit4] > at
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103)
[junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)Throwable
#2: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Capture\
d an uncaught exception in thread: Thread[id=661, name=Lucene Merge Thread #7,
state=RUNNABLE, group=TGRP-TestIndexWriterDelete|#7, state=RUNNABLE,
group=TGRP-TestIndexWriterDelete]
[junit4] > Caused by: org.apache.lucene.index.MergePolicy$MergeException:
java.lang.IllegalStateException: this writer hit an unrecoverable error; cannot
merge
[junit4] > at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
[junit4] > Caused by: java.lang.IllegalStateException: this writer hit an
unrecoverable error; cannot merge
[junit4] > at
org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:4072)
[junit4] > at
org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:4052)
[junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3904)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
[junit4] > Caused by: java.lang.RuntimeException: segments must include at
least one segment
[junit4] > at
org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228)
[junit4] > at
org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701)
[junit4] > at
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103)
[junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929)
[junit4] > ... 2 more
{quote}
and
{quote}
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestIndexWriterDelete
-Dtests.method=testDeleteAllSlowly -Dtests.seed=355D07976851D85A
-Dtests.badapples=true -Dtests.locale=en-IE -Dtests.timezon\
e=Australia/Perth -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] ERROR 0.21s J0 | TestIndexWriterDelete.testDeleteAllSlowly <<<
[junit4] > Throwable #1: java.lang.IllegalStateException: this writer hit an
unrecoverable error; cannot complete forceMerge
[junit4] > at
__randomizedtesting.SeedInfo.seed([355D07976851D85A:C651573F1DF18CA2]:0)
[junit4] > at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1917)
[junit4] > at
org.apache.lucene.index.IndexWriter.forceMerge(IndexWriter.java:1861)
[junit4] > at
org.apache.lucene.index.RandomIndexWriter.doRandomForceMerge(RandomIndexWriter.java:371)
[junit4] > at
org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:386)
[junit4] > at
org.apache.lucene.index.RandomIndexWriter.getReader(RandomIndexWriter.java:332)
[junit4] > at
org.apache.lucene.index.TestIndexWriterDelete.testDeleteAllSlowly(TestIndexWriterDelete.java:984)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] > Caused by: java.lang.RuntimeException: segments must include at
least one segment
[junit4] > at
org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228)
[junit4] > at
org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701)
[junit4] > at
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103)
[junit4] > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
[junit4] > at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
[junit4] 2> Apr 24, 2018 9:27:54 PM
com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler
uncaughtException
[junit4] 2> WARNING: Uncaught exception in thread: Thread[Lucene Merge Thread
#6,5,TGRP-TestIndexWriterDelete|#6,5,TGRP-TestIndexWriterDelete]
[junit4] 2> org.apache.lucene.index.MergePolicy$MergeException:
java.lang.RuntimeException: segments must include at least one segment
[junit4] 2> at __randomizedtesting.SeedInfo.seed([355D07976851D85A]:0)
[junit4] 2> at
org.apache.lucene.index.ConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler.java:704)
[junit4] 2> at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:684)
[junit4] 2> Caused by: java.lang.RuntimeException: segments must include at
least one segment
[junit4] 2> at
org.apache.lucene.index.MergePolicy$OneMerge.<init>(MergePolicy.java:228)
[junit4] 2> at
org.apache.lucene.index.TieredMergePolicy.findForcedMerges(TieredMergePolicy.java:701)
[junit4] 2> at
org.apache.lucene.index.IndexWriter.updatePendingMerges(IndexWriter.java:2103)
[junit4] 2> at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3929)
[junit4] 2> at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625)
[junit4] 2> at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662)
[junit4] 2>
{quote}
Can we make these ints, and cast to double when we need to divide them?:
{quote}+ double totalDelDocs = 0;
+ double totalMaxDocs = 0;
{quote}
Hmm that {{50/100}} integer division will just be zero:
{quote}cutoffSize = (long) ((double) maxMergeSegmentBytesThisMerge * (1.0 -
(50/100)));
{quote}
Hmm this left me hanging (in {{findForcedMerges}}):
{quote}// First condition is that
{quote}
We define this:
{quote}int totalEligibleSegs = eligible.size();
{quote}
But do not decrement it when we remove segments from {{eligible}} in the loop
after?
In {{findForcedMerges}} since we pre-compute the per-segment sizes using
{{getSegmentSizes}}, can you use that map instead of calling {{size(info,
writer)}} again?
> Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of
> very large segments
> -------------------------------------------------------------------------------------------------
>
> Key: LUCENE-7976
> URL: https://issues.apache.org/jira/browse/LUCENE-7976
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Erick Erickson
> Assignee: Erick Erickson
> Priority: Major
> Attachments: LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch,
> LUCENE-7976.patch, LUCENE-7976.patch, LUCENE-7976.patch
>
>
> We're seeing situations "in the wild" where there are very large indexes (on
> disk) handled quite easily in a single Lucene index. This is particularly
> true as features like docValues move data into MMapDirectory space. The
> current TMP algorithm allows on the order of 50% deleted documents as per a
> dev list conversation with Mike McCandless (and his blog here:
> https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).
> Especially in the current era of very large indexes in aggregate, (think many
> TB) solutions like "you need to distribute your collection over more shards"
> become very costly. Additionally, the tempting "optimize" button exacerbates
> the issue since once you form, say, a 100G segment (by
> optimizing/forceMerging) it is not eligible for merging until 97.5G of the
> docs in it are deleted (current default 5G max segment size).
> The proposal here would be to add a new parameter to TMP, something like
> <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions
> welcome) which would default to 100 (or the same behavior we have now).
> So if I set this parameter to, say, 20%, and the max segment size stays at
> 5G, the following would happen when segments were selected for merging:
> > any segment with > 20% deleted documents would be merged or rewritten NO
> > MATTER HOW LARGE. There are two cases,
> >> the segment has < 5G "live" docs. In that case it would be merged with
> >> smaller segments to bring the resulting segment up to 5G. If no smaller
> >> segments exist, it would just be rewritten
> >> The segment has > 5G "live" docs (the result of a forceMerge or optimize).
> >> It would be rewritten into a single segment removing all deleted docs no
> >> matter how big it is to start. The 100G example above would be rewritten
> >> to an 80G segment for instance.
> Of course this would lead to potentially much more I/O which is why the
> default would be the same behavior we see now. As it stands now, though,
> there's no way to recover from an optimize/forceMerge except to re-index from
> scratch. We routinely see 200G-300G Lucene indexes at this point "in the
> wild" with 10s of shards replicated 3 or more times. And that doesn't even
> include having these over HDFS.
> Alternatives welcome! Something like the above seems minimally invasive. A
> new merge policy is certainly an alternative.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]