[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Tim Smith (JIRA) Fri, 21 Aug 2009 05:37:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745941#action_12745941
 ]


Tim Smith commented on LUCENE-1821:
-----------------------------------

I'm OK with having to jump through some hoops in order to get back to the "full 
index" context

It would be nice if this was more facilitated by lucene's API (IMO, this would 
be best handled by adding a Searcher as the first arg to Weight.scorer(), as 
then a Weight will not need to hold on to this (breaking serializable))

There are definitely plenty of use cases that take advantage of the "whole" 
index (one created by IndexWriter), so this ability should not be removed
I have at least 3 in my application alone (and they are all very important)

You get tradeoffs working "Per-Segment" vs "Per-MultiReader" when it comes to 
caching in general
going per-segment means caches load faster, and load less frequently, however 
this causes algorithms working with the caches to be slower (depending on 
algorithm and cache type)

for static boosting from a field value (ValueSource), it makes no difference
for numeric sorting, it makes no difference 

for string sorting, it makes a big difference - you now have to do a bunch of 
String.equals() calls, where you didn't have to in 2.4 (just used the ord index)
Given this reason, you should really be able to do string sorting 2 ways
* using per segment field cache (commit time/first query faster, sort time 
slower)
* using multi-reader field cache (commit time/first query slower, sort time 
faster)

This same argument also goes for features like faceting (not provided by 
lucene, but is provided by applications like solr, and my application). Using a 
per-segment cache will cause some significant performance loss when performing 
faceting, as it requires creating the facets for each segment, and then merging 
them (this results in a good deal of extra object overhead/memory overhead/more 
work where faceting on the multi-reader does not see this)

In the end, it should be up to the application developer to choose what 
strategy works best for them, and their application (fast commits/fast cache 
loading may take a back seat to fast query execution)

In general, i find there is a tradeoff between commit time and query time. The 
more you speed up commit time, the slower query time gets, and vice versa
I just want/need the ability to choose






> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>             Fix For: 2.9
>
>         Attachments: LUCENE-1821.patch
>
>
> Now that searching is done on a per segment basis, there is no way for a 
> Scorer to know the "actual" doc id for the document's it matches (only the 
> relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all 
> segments), there is now no way to index into them properly from inside a 
> Scorer because the scorer is not passed the needed offset to calculate the 
> "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as 
> well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created 
> "sub" weights
> Details on workaround:
> In order to work around this, you must do the following:
> * Subclass IndexSearcher
> * Add "int getIndexReaderBase(IndexReader)" method to your subclass
> * during Weight creation, the Weight must hold onto a reference to the passed 
> in Searcher (casted to your sub class)
> * during Scorer creation, the Scorer must be passed the result of 
> YourSearcher.getIndexReaderBase(reader)
> * Scorer can now rebase any collected docids using this offset
> Example implementation of getIndexReaderBase():
> {code}
> // NOTE: more efficient implementation can be done if you cache the result if 
> gatherSubReaders in your constructor
> public int getIndexReaderBase(IndexReader reader) {
>   if (reader == getReader()) {
>     return 0;
>   } else {
>     List readers = new ArrayList();
>     gatherSubReaders(readers);
>     Iterator iter = readers.iterator();
>     int maxDoc = 0;
>     while (iter.hasNext()) {
>       IndexReader r = (IndexReader)iter.next();
>       if (r == reader) {
>         return maxDoc;
>       } 
>       maxDoc += r.maxDoc();
>     } 
>   }
>   return -1; // reader not in searcher
> }
> {code}
> Notes:
> * This workaround makes it so you cannot serialize your custom Weight 
> implementation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Reply via email to