[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Yonik Seeley (JIRA) Fri, 30 Jan 2009 07:13:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668891#action_12668891
 ]


Yonik Seeley commented on LUCENE-1483:
--------------------------------------

My previous comment:
{quote}
I tracked down how this patch was causing Solr failures:

ExternalFileField in Solr maps from a uniqueKey to a float value from a 
separate file.
There is a cache that is essentially keyed by (IndexReader,field) that gives 
back a float[].

Any change in the index used to cause all values to be updated (cache miss 
because the MultiReader was a different instance). Now, since it's called 
segment-at-a-time, only new segments are reloaded from the file, leaving older 
segments with stale values.

It's certainly in the very gray area... but perhaps Solr won't be the only one 
affected by this - maybe apps that implement security filters, etc?
{quote}

bq. Yonik, why was the failure so intermittent? 

It failed for others but not for me due to a Solr bug that prevented 
IndexReader.reopen() from being used on Windows.
As to why it reportedly worked for Mark when he built Lucene himself.... 
<shrug>... at this point perhaps testing error.

[...]
{quote}
Lucene implicitly assumes that a FieldCache's arrays do not change for
a given segment; this is normally safe since the arrays are derived
from the postings in the field (which are write once).

But it sounds like Solr changed that assumption, and the values in the
(Solr-subclass of) FieldCache's arrays are now derived from something
external, which is no longer write once.
{quote}
Right... it used to hold in solr because nothing really operated below the 
MultiReader level.
The intention is that at the time when a new IndexReader is opened, the entire 
file is read.
This patch changes that up.

{quote}
How do you plan to fix it with Solr? It seems like, since you are
maintaining a private cache, you could forcefully evict entries from
the cache for all SegmentReaders whenever the external file has
changed (or a new MultiSegmentReader had been opened)?
{quote}

It's not so easy... the same segment could be associated with two different 
active MultiReaders (with a different set of values for each).  When the scorer 
is created, only the SegmentReader is passed with no other context.


> Change IndexSearcher multisegment searches to search each individual segment 
> using a single HitCollector
> --------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1483
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1483
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9
>            Reporter: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1483-backcompat.patch, LUCENE-1483-partial.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch, LUCENE-1483.patch, sortBench.py, sortCollate.py
>
>
> This issue changes how an IndexSearcher searches over multiple segments. The 
> current method of searching multiple segments is to use a MultiSegmentReader 
> and treat all of the segments as one. This causes filters and FieldCaches to 
> be keyed to the MultiReader and makes reopen expensive. If only a few 
> segments change, the FieldCache is still loaded for all of them.
> This patch changes things by searching each individual segment one at a time, 
> but sharing the HitCollector used across each segment. This allows 
> FieldCaches and Filters to be keyed on individual SegmentReaders, making 
> reopen much cheaper. FieldCache loading over multiple segments can be much 
> faster as well - with the old method, all unique terms for every segment is 
> enumerated against each segment - because of the likely logarithmic change in 
> terms per segment, this can be very wasteful. Searching individual segments 
> avoids this cost. The term/document statistics from the multireader are used 
> to score results for each segment.
> When sorting, its more difficult to use a single HitCollector for each sub 
> searcher. Ordinals are not comparable across segments. To account for this, a 
> new field sort enabled HitCollector is introduced that is able to collect and 
> sort across segments (because of its ability to compare ordinals across 
> segments). This TopFieldCollector class will collect the values/ordinals for 
> a given segment, and upon moving to the next segment, translate any 
> ordinals/values so that they can be compared against the values for the new 
> segment. This is done lazily.
> All and all, the switch seems to provide numerous performance benefits, in 
> both sorted and non sorted search. We were seeing a good loss on indices with 
> lots of segments (1000?) and certain queue sizes / queries, but the latest 
> results seem to show thats been mostly taken care of (you shouldnt be using 
> such a large queue on such a segmented index anyway).
> * Introduces
> ** MultiReaderHitCollector - a HitCollector that can collect across multiple 
> IndexReaders. Old HitCollectors are wrapped to support multiple IndexReaders.
> ** TopFieldCollector - a HitCollector that can compare values/ordinals across 
> IndexReaders and sort on fields.
> ** FieldValueHitQueue - a Priority queue that is part of the 
> TopFieldCollector implementation.
> ** FieldComparator - a new Comparator class that works across IndexReaders. 
> Part of the TopFieldCollector implementation.
> ** FieldComparatorSource - new class to allow for custom Comparators.
> * Alters
> ** IndexSearcher uses a single HitCollector to collect hits against each 
> individual SegmentReader. All the other changes stem from this ;)
> * Deprecates
> ** TopFieldDocCollector
> ** FieldSortedHitQueue

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1483) Change IndexSearcher multisegment searches to search each individual segment using a single HitCollector

Reply via email to