[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Michael McCandless (JIRA) Wed, 21 Jan 2009 12:14:23 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-1483:
---------------------------------------

    Attachment: LUCENE-1483.patch


OK I made a bunch of small fixes.

With this patch there is one back-compat test failing
(TestScorerPerf), because it seems to assume that a searcher on an
index with 0 segments will call Scorer.collect(HitCollector).  We used
to do so, but no longer as of this patch, and I don't think it's a
reasonable assumption, so I plan to commit a small fix to the test on
the back compat branch (and trunk) to add a single empty doc to the
index.

One thing worries me: we changed TopDocCollector to subclass
MultiReaderHitCollector instead of HitCollector, which means it
(TopDocCollector) now tracks docBase.  But this means subclasses of
TopDocCollector will suddenly see the not-rebased docID passed to
their collect methods, resulting in a hard to detect bug for the app.

Maybe we should leave TopDocCollector how it was, and make a new
TopDocCollector (any name suggestion?) that subclasses from
MultiReaderHitCollector?

Changes:

  * Renamed StringOrdValOnDem2Comparator --> StringOrdValComparator;
    commented out the other String*Comparator except for StringVal

  * Switched over to sorted sub-readers for all searching in
    IndexSearcher

  * Moved sub-reader sorting to IndexSearcher's ctor

  * Also removed SortField.STRING* except for STRING (uses
    StringOrdValComparator) and STRING_VAL (uses StringValComparator)

  * Changed FieldComparatorSource from interface --> abstract class;
    removed Serializable

  * Moved the "set legacy sort when using legacy custom sort" logic
    into SortField out of IndexSearcher

  * Fixed TimeLimitedCollector, ParallelMultiSearcher, MultiSearcher
    to "pass down" MultiReaderHitCollector if that's what they were
    passed in, else wrap the HitCollector

  * Added test in TestSort to test FieldComparatorSource (custom
    sort)

  * Addressed/removed all nocommits, removed dead code

  * Added some javadocs; removed unused imports

  * Fixed whitespace

  * Added entries to CHANGES.txt


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1483
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1483
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Reply via email to