[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Tim Smith (JIRA) Fri, 21 Aug 2009 15:00:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746263#action_12746263
 ]


Tim Smith commented on LUCENE-1821:
-----------------------------------

I started integrating the per-segment searching (removed my hack that was doing 
searching on MultiReader)

In order to get my query implementations to work, i had to hold onto my 
Searcher in the Weight constructor and add getIndexReaderBase() method to my 
IndexSearcher implementation, and this seems to be working well

I had 3 query implementations that were affected:
one used a cache that will be easy to create per segment (will have this use a 
per segment cache as soon as i can)
one used an int[] ord index (the underlaying cache cannot be made per segment)
one used a cached DocIdSet created over the top level MultiReader (should be 
able to have a DocIdSet per Segment reader here, but this will take some more 
thinking (source of the matching docids is from a separate index), will also 
need to know which sub docidset to use based on which IndexReader is passed to 
scorer() - shouldn't be any big deal)

i'm a bit concerned that i may not be testing "multi-segment" searching quite 
properly right now though since i think most of my indexes being tested only 
have one segment.
On that topic, if i create a subclass of LogByteSizeMergePolicy and return null 
from findMerges() and findMergesToExpungeDeletes() will this guarantee that 
segments will only be merged if i explicitly optimize? In which case, i can 
just pepper in some commits as i add documents to guarantee that i have more 
than 1 segment.

Overall, i am really liking the per-segment stuff, and the Collector API in 
general 
its already made it possible to optimize a good deal of things away (like 
calling Scorer.score() for docs that end up getting filtered away), however i 
hit some deoptimization due to some of the crazy stuff i had to do to make 
those 3 query implementations work, but this should only really be isolated to 
one of the implementations (and i can hopefully reoptimize those cases anyway)

I would still like to see IndexSearcher passed to Weight.scorer(), and the 
getIndexReaderBase() method added to IndexSearcher
This will clean up my current "hacks" to map docids 




> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a 
> Scorer to know the "actual" doc id for the document's it matches (only the 
> relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all 
> segments), there is now no way to index into them properly from inside a 
> Scorer because the scorer is not passed the needed offset to calculate the 
> "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as 
> well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created 
> "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed 
> in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of 
> YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if 
> gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight 
> implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Reply via email to