[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Simon Willnauer (Jira) Mon, 22 Jun 2020 13:24:25 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142389#comment-17142389
 ]


Simon Willnauer commented on LUCENE-8962:
-----------------------------------------

This would also explain the failure I am looking into right now on elastic CI:


{noformat}
12:20:39    [junit4] Suite: org.apache.lucene.index.TestIndexingSequenceNumbers
12:20:39    [junit4]   2> NOTE: reproduce with: ant test  
-Dtestcase=TestIndexingSequenceNumbers 
-Dtests.method=testStressConcurrentCommit -Dtests.seed=FA92259FC8239E7 
-Dtests.nightly=true -Dtests.slow=true -Dtests.badapples=true 
-Dtests.locale=en-TC -Dtests.timezone=Pacific/Pago_Pago -Dtests.asserts=true 
-Dtests.file.encoding=UTF8
12:20:39    [junit4] FAILURE 85.1s J2 | 
TestIndexingSequenceNumbers.testStressConcurrentCommit <<<
12:20:39    [junit4]    > Throwable #1: java.lang.AssertionError: expected:<1> 
but was:<0>
12:20:39    [junit4]    >       at 
__randomizedtesting.SeedInfo.seed([FA92259FC8239E7:743DB8E67E421850]:0)
12:20:39    [junit4]    >       at 
org.apache.lucene.index.TestIndexingSequenceNumbers.testStressConcurrentCommit(TestIndexingSequenceNumbers.java:273)
12:20:39    [junit4]    >       at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
12:20:39    [junit4]    >       at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
12:20:39    [junit4]    >       at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
12:20:39    [junit4]    >       at 
java.base/java.lang.reflect.Method.invoke(Method.java:566)
12:20:39    [junit4]    >       at 
java.base/java.lang.Thread.run(Thread.java:834)
12:20:39    [junit4]   2> NOTE: leaving temporary files on disk at: 
/var/lib/jenkins/workspace/apache+lucene-solr+nightly+branch_8x/lucene/build/core/test/J2/temp/lucene.index.TestIndexingSequenceNumbers_FA92259FC8239E7-001
12:20:39    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene86): 
{id=Lucene84}, docValues:{thread=DocValuesFormat(name=Asserting), 
___soft_deletes=DocValuesFormat(name=Asserting)}, maxPointsInLeafNode=279, 
maxMBSortInHeap=6.3565394726088424, 
sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@1c08a55f),
 locale=en-TC, timezone=Pacific/Pago_Pago
12:20:39    [junit4]   2> NOTE: Linux 4.18.0-193.6.3.el8_2.x86_64 amd64/Oracle 
Corporation 11.0.2 (64-bit)/cpus=32,threads=1,free=339006432,total=400556032
12:20:39    [junit4]   2> NOTE: All tests run in this JVM: 
[TestMultiTermConstantScore, TestSparseFixedBitSet, TestSpanBoostQuery, 
TestOfflineSorter, TestRadixSelector, TestBagOfPositions, 
TestPhrasePrefixQuery, TestAxiomaticF2LOG, TestLucene50CompoundFormat, 
TestLongBitSet, TestBinaryDocument, TestBooleanOr, 
TestComplexExplanationsOfNonMatches, TestTransactions, TestTermQuery, 
TestUTF32ToUTF8, TestWildcard, TestIndexWriterForceMerge, TestMinShouldMatch2, 
TestCodecs, TestBytesStore, Test2BNumericDocValues, TestCharFilter, 
TestXYMultiPolygonShapeQueries, TestFeatureDoubleValues, TestPackedInts, 
TestGraphTokenizers, TestFilterCodecReader, TestSetOnce, TestLatLonPoint, 
TestLongRange, TestQueryRescorer, TestNRTThreads, TestMergedIterator, 
TestLucene86SegmentInfoFormat, TestPackedTokenAttributeImpl, TestBasicModelIne, 
TestDocCount, TestLMDirichletSimilarity, TestAttributeSource, 
TestPositiveScoresOnlyCollector, TestXYPointQueries, TestElevationComparator, 
TestIndexWriterMergePolicy, TestUpgradeIndexMergePolicy, 
TestIntRangeFieldQueries, TestIndexingSequenceNumbers]
12:20:39    [junit4] Completed [369/567 (1!)] on J2 in 168.09s, 8 tests, 1 
failure <<< FAILURES!
{noformat}

it's missing a document, pretty scary but explains what we see here.


> Can we merge small segments during refresh, for faster searching?
> -----------------------------------------------------------------
>
>                 Key: LUCENE-8962
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8962
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index
>            Reporter: Michael McCandless
>            Priority: Major
>             Fix For: 8.6
>
>         Attachments: LUCENE-8962_demo.png, failed-tests.patch, failure_log.txt
>
>          Time Spent: 19h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

Reply via email to