[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Uwe Schindler (JIRA) Fri, 23 Jan 2009 07:56:26 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666574#action_12666574
 ]


Uwe Schindler commented on LUCENE-1483:
---------------------------------------

Hi Michael & Mike,
great work. I patched my tree and compiled, no problems. One test failed with 
the following exception:
{code}
<testcase classname="org.apache.lucene.benchmark.quality.TestQualityRun" 
name="testTrecQuality" time="64.656">
  <error type="java.lang.NullPointerException">java.lang.NullPointerException
        at 
org.apache.lucene.benchmark.byTask.feeds.ReutersDocMaker.getNextDocData(ReutersDocMaker.java:112)
        at 
org.apache.lucene.benchmark.byTask.feeds.BasicDocMaker.makeDocument(BasicDocMaker.java:98)
        at 
org.apache.lucene.benchmark.byTask.tasks.AddDocTask.setup(AddDocTask.java:61)
        at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:89)
        at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:141)
        at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:122)
        at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:92)
        at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:141)
        at 
org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:122)
        at 
org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:92)
        at 
org.apache.lucene.benchmark.byTask.utils.Algorithm.execute(Algorithm.java:246)
        at 
org.apache.lucene.benchmark.byTask.Benchmark.execute(Benchmark.java:73)
        at 
org.apache.lucene.benchmark.byTask.TestPerfTasksLogic.execBenchmark(TestPerfTasksLogic.java:455)
        at 
org.apache.lucene.benchmark.quality.TestQualityRun.createReutersIndex(TestQualityRun.java:173)
        at 
org.apache.lucene.benchmark.quality.TestQualityRun.testTrecQuality(TestQualityRun.java:56)
  </error>
</testcase>
{code}
I am not sure if this failing test have anything to do with the patch.

I then tested the resulting lucene-core.jar as a drop-in-replacement for my big 
portal (www.pangaea.de) - as we are waiting for the optimized reopen with 
sorted search results here since months. In my test environment I got no 
errors/exception etc. After tests, I set it online on my productive system (I 
will look into the error logs of the webserver for exceptions).

The speed increase of a sorted search after a reopen of the index (600,000 
docs), that only added some documents into a new cfs segment, is incredible. In 
the past, the warmup for filling the field cache was about 3 seconds - now it 
shows up without any recognizable time. So no warmup after reopen is needed 
anymore.

One thing I noticed when compiling my code against the new lucene-core.jar:
Top(Field)Docs deprecated the method getMaxScore(). Why is this so? To display 
search results with a score normalized to 1.0, you need to divide by the 
maximum score. The docs say, you should implement your own HitCollector (why 
that, I want to use TopDocs?), but why does TopDocs deprecate the maximum 
score? OK, for relevance/score sorted TopDocs, this is no problem, as the 
maximum score is in ScoreDoc[0], but for docs sorted by fields (which extends 
TopFieldDoc), you need the max score. If you have generic search code that does 
not distinguish between TopDocs and TopFieldDocs, the generic code is to divide 
by TopDocs.getMaxScore().

I am happy, +1 for commiting. But let getMaxScore live after Lucene 3.0!

> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1483
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1483
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Reply via email to