[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Mark Miller (JIRA) Thu, 20 Aug 2009 18:16:05 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745756#action_12745756
 ]


Mark Miller edited comment on LUCENE-1821 at 8/20/09 6:12 PM:
--------------------------------------------------------------

{quote}
howevever, this method is actually also a bit broken now with per segment 
searching
what if reader is a MultiReader
{quote}

Right - there are many places where this could be the case - your still free to 
use multi-readers, though we encourage you to switch. We provide a cool cache 
sanity checker to help you find these cases, and evaluate whether or not you 
can make the switch. *edit* I know this doesn't help with filters - there was 
an issue that helped address that I think though - worked on by Hoss and Mike 
McCandless - not sure if that helps here or if this was overlooked or what 
though - I'll have to go skim that issue again. *edit*

If you just pass 0, many times it will be wrong. Why shouldn't this have access 
to a doc id cache as well? We always ask for everything in the context of the 
Reader given. I think thats the issue. Lucene just never officially supported 
this use case - we can't with MultiSearcher, Searchable, Remote - the API 
doesn't work with the idea that you can count on all the doc ids from a Reader. 
You were taking advantage of the implementation and your limited use of the 
full API - but its never been part of the API IMHO.

Perhaps we could one day change things - RMI hasn't really worked out in 
comparison to other methods large scale (supposedly very chatty - though I have 
been told very large installations have been built with it ) - we have already 
factored it into contrib. But this still doesn't fit the current model/API, and 
if we address it, it will take longer than 2.9 to do right IMO.

      was (Author: [email protected]):
    {quote}
howevever, this method is actually also a bit broken now with per segment 
searching
what if reader is a MultiReader
{quote}

Right - there are many places where this could be the case - your still free to 
use multi-readers, though we encourage you to switch. We provide a cool cache 
sanity checker to help you find these cases, and evaluate whether or not you 
can make the switch.

If you just pass 0, many times it will be wrong. Why shouldn't this have access 
to a doc id cache as well? We always ask for everything in the context of the 
Reader given. I think thats the issue. Lucene just never officially supported 
this use case - we can't with MultiSearcher, Searchable, Remote - the API 
doesn't work with the idea that you can count on all the doc ids from a Reader. 
You were taking advantage of the implementation and your limited use of the 
full API - but its never been part of the API IMHO.

Perhaps we could one day change things - RMI hasn't really worked out in 
comparison to other methods large scale (supposedly very chatty - though I have 
been told very large installations have been built with it ) - we have already 
factored it into contrib. But this still doesn't fit the current model/API, and 
if we address it, it will take longer than 2.9 to do right IMO.
  
> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a 
> Scorer to know the "actual" doc id for the document's it matches (only the 
> relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all 
> segments), there is now no way to index into them properly from inside a 
> Scorer because the scorer is not passed the needed offset to calculate the 
> "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as 
> well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created 
> "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed 
> in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of 
> YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if 
> gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight 
> implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Reply via email to