[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Tim Smith (JIRA) Tue, 18 Aug 2009 17:12:38 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744793#action_12744793
 ]


Tim Smith commented on LUCENE-1821:
-----------------------------------

My current plan of attack for this use case will be to:
* pull the cache using the MultiReader at createWeight() time (index into cache 
will be MultiReader docid)
* pull the base offset for the IndexReader at scorer() creation time (will need 
to add the getIndexReaderBase() method to my searcher to do so)
* when the scorer needs to hit the cache, it'll add the base to the scorer's 
docid to get the key for the cache lookup

I should be able to do this easily enough with a customized IndexSearcher 
(subclass)

there are use cases where documents from one segment need to be aware of 
documents from other segments
sorting is such a use case (this is just done at the Collector level, so there 
are more hooks to do the needed base offset stuff)
duplicate removal is another such use case (only return the first document for 
docs sharing a field value)

both these use cases can be done at the Collector level, however Duplicate 
Removal could potentially be done at the Query level in order to perform 
duplicate removal at any location in the query matching
also, efficient duplicate removal for a String field would require the int[] 
ord index in order to reduce overall memory requirements
Using the int[] ord index allows using a BitSet for the hash set required to 
mark if a document for a specified value has been encountered (would need a 
HashSet<String> otherwise (ugh))

my particular use case must be done at the query level in order to have full 
boolean query support, and the ability to layer multiple queries with all 
combinations of AND/OR/NOT, and any other query operators, and sadly i have yet 
to come up with any way to create a cache on a per segment level (without 
creating the cache at the MultiReader level)


> Weight.scorer() not passed doc offset for "sub reader"
> ------------------------------------------------------
>
>                 Key: LUCENE-1821
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1821
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Search
>    Affects Versions: 2.9
>            Reporter: Tim Smith
>
> Now that searching is done on a per segment basis, there is no way for a 
> Scorer to know the "actual" doc id for the document's it matches (only the 
> relative doc offset into the segment)
> If using caches in your scorer that are based on the "entire" index (all 
> segments), there is now no way to index into them properly from inside a 
> Scorer because the scorer is not passed the needed offset to calculate the 
> "real" docid
> suggest having Weight.scorer() method also take a integer for the doc offset
> Abstract Weight class should have a constructor that takes this offset as 
> well as a method to get the offset
> All Weights that have "sub" weights must pass this offset down to created 
> "sub" weights

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1821) Weight.scorer() not passed doc offset for "sub reader"

Reply via email to