[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Mike Sokolov (JIRA) Tue, 28 Jun 2011 20:14:00 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056967#comment-13056967
 ]


Mike Sokolov commented on LUCENE-2878:
--------------------------------------

I've been fiddling with highlighter performance, and this looks a great step 
towards being able to do hl in an integrated way that doesn't require a lot of 
post-hoc recalculation etc.  I worked up a hackish highlighter that uses it as 
a POC, partly just as a way of understanding what you've done, but this could 
eventually become usable.

Here are a few comments:

I found it convenient to add:
{{boolean Collector.needsPositions() and needsPayloads()}}
and modified
{{IndexSearcher.search(AtomicReaderContext[] leaves, Weight weight, Filter 
filter, Collector collector)}}
to set up the ScorerContext accordingly

And then I am accessing the scorer.positions() from Collector.collect(), which 
I think is a very natural use of this API?  At least it was intuitive for me, 
and I am pretty new to all this. 

I think that when it comes to traversing the tree of 
PositionsIntervalIterators, the API you propose above might have some issues.  
What would the status of the returned iterators be? Would they have to be 
copies of some sort in order to preserve the state of the iteration (so scoring 
isn't impacted by some other consumer of position intervals)?  The iterators 
that are currently in flight shouldn't be advanced by the caller usually 
(ever?), or else the state of the dependent iterator (the parent) won't be 
updated correctly, I think?  I wonder if (1) you could add 
{{PositionInterval PositionIntervalIterator.current()}} and (2) return from 
subs() and nextSubIntervals() some unmodifiable wrappers - maybe a superclass 
of PII that would only provide current() and subs(), but not allow advancing 
the iterator.

I hope you'll be able to pick it up again soon, Simon! 

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to