[ https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1483: --------------------------------------- Attachment: LUCENE-1483.patch OK I made a bunch of small fixes. With this patch there is one back-compat test failing (TestScorerPerf), because it seems to assume that a searcher on an index with 0 segments will call Scorer.collect(HitCollector). We used to do so, but no longer as of this patch, and I don't think it's a reasonable assumption, so I plan to commit a small fix to the test on the back compat branch (and trunk) to add a single empty doc to the index. One thing worries me: we changed TopDocCollector to subclass MultiReaderHitCollector instead of HitCollector, which means it (TopDocCollector) now tracks docBase. But this means subclasses of TopDocCollector will suddenly see the not-rebased docID passed to their collect methods, resulting in a hard to detect bug for the app. Maybe we should leave TopDocCollector how it was, and make a new TopDocCollector (any name suggestion?) that subclasses from MultiReaderHitCollector? Changes: * Renamed StringOrdValOnDem2Comparator --> StringOrdValComparator; commented out the other String*Comparator except for StringVal * Switched over to sorted sub-readers for all searching in IndexSearcher * Moved sub-reader sorting to IndexSearcher's ctor * Also removed SortField.STRING* except for STRING (uses StringOrdValComparator) and STRING_VAL (uses StringValComparator) * Changed FieldComparatorSource from interface --> abstract class; removed Serializable * Moved the "set legacy sort when using legacy custom sort" logic into SortField out of IndexSearcher * Fixed TimeLimitedCollector, ParallelMultiSearcher, MultiSearcher to "pass down" MultiReaderHitCollector if that's what they were passed in, else wrap the HitCollector * Added test in TestSort to test FieldComparatorSource (custom sort) * Addressed/removed all nocommits, removed dead code * Added some javadocs; removed unused imports * Fixed whitespace * Added entries to CHANGES.txt > Change IndexSearcher multisegment searches to search each individual segment > using a single HitCollector > -------------------------------------------------------------------------------------------------------- > > Key: LUCENE-1483 > URL: https://issues.apache.org/jira/browse/LUCENE-1483 > Project: Lucene - Java > Issue Type: Improvement > Affects Versions: 2.9 > Reporter: Mark Miller > Priority: Minor > Attachments: LUCENE-1483-partial.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, > LUCENE-1483.patch, sortBench.py, sortCollate.py > > > This issue changes how an IndexSearcher searches over multiple segments. The > current method of searching multiple segments is to use a MultiSegmentReader > and treat all of the segments as one. This causes filters and FieldCaches to > be keyed to the MultiReader and makes reopen expensive. If only a few > segments change, the FieldCache is still loaded for all of them. > This patch changes things by searching each individual segment one at a time, > but sharing the HitCollector used across each segment. This allows > FieldCaches and Filters to be keyed on individual SegmentReaders, making > reopen much cheaper. FieldCache loading over multiple segments can be much > faster as well - with the old method, all unique terms for every segment is > enumerated against each segment - because of the likely logarithmic change in > terms per segment, this can be very wasteful. Searching individual segments > avoids this cost. The term/document statistics from the multireader are used > to score results for each segment. > When sorting, its more difficult to use a single HitCollector for each sub > searcher. Ordinals are not comparable across segments. To account for this, a > new field sort enabled HitCollector is introduced that is able to collect and > sort across segments (because of its ability to compare ordinals across > segments). This TopFieldCollector class will collect the values/ordinals for > a given segment, and upon moving to the next segment, translate any > ordinals/values so that they can be compared against the values for the new > segment. This is done lazily. > All and all, the switch seems to provide numerous performance benefits, in > both sorted and non sorted search. We were seeing a good loss on indices with > lots of segments (1000?) and certain queue sizes / queries, but the latest > results seem to show thats been mostly taken care of (you shouldnt be using > such a large queue on such a segmented index anyway). > * Introduces > ** MultiReaderHitCollector - a HitCollector that can collect across multiple > IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders. > ** TopFieldCollector - a HitCollector that can compare values/ordinals across > IndexReaders and sort on fields. > ** FieldValueHitQueue - a Priority queue that is part of the > TopFieldCollector implementation. > ** FieldComparator - a new Comparator class that works across IndexReaders. > Part of the TopFieldCollector implementation. > ** FieldComparatorSource - new class to allow for custom Comparators. > * Alters > ** IndexSearcher uses a single HitCollector to collect hits against each > individual SegmentReader. All the other changes stem from this ;) > * Deprecates > ** TopFieldDocCollector > ** FieldSortedHitQueue -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org