[jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Thu, 03 Feb 2011 10:28:53 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990217#comment-12990217
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

{quote}
How does/should scoring work? EG do the SpanQueries score
according to the details of which position intervals match?
{quote}

I didn't pay any attention to scoring yet. IMO scoring should be left to the 
query which is using the positions so a higher level query like a NearQuery 
could just put a custom scorer on top of a boolean conjunction and apply its 
own proximity based score. This should be done after we have the infrastructure 
to do it. I think that opens up some nice scoring improvements. I am not sure 
if we should add proximity scoring to existing queries, I rather aim towards 
making it easy to customize.

{quote}
The part I'm wondering about is what API we should use for
communicating positions of the sub scorers in a BooleanQuery to
consumers like position filters (for matching) or eg Highlighter
(which really should be a core functionality that works w/ any query).
Multiplying out ("denormalizing") all combinations (into a flat stream
of PositionIntervals) is going to be too costly in general, I think?
{quote}

I thought about that for a while and I think we should enrich the 
PosIntervalIterator API  to enable the caller to pull the actual subintervals 
instead of an Interval from the next method. Something like this:

{code}
 public abstract class PositionIntervalIterator implements Serializable{
  public abstract PositionInterval next() throws IOException;
 /**
  *Returns all sub interval for the next accepted interval.
  **/
  public abstract PositionIntervalIterator nextSubIntervals() throws 
IOException;
  public abstract PositionIntervalIterator[] subs(boolean inOrder);

{code}

so that if you are interested in the plain positions for eventually each term 
like highlighting you can pull them per match occurence. That way you have 
positional matching and you can iterate the subs.

{quote}
Maybe, instead of the denormalized stream, we could present a
UnionPositionsIntervalIterator, which has multiple subs, where each
sub is its own PositionIntervalIterator? This way eg a NEAR query
could filter these subs in parallel (like a merge sort) looking for a
match, and (I think) then presenting its own union iterator to whoever
consumes it? Ie it'd only let through those positions of each sub
that satisfied the NEAR constraint.
{quote}

I don't get that entirely ;)


bq. Does it make sense that we could just want AttributeSources as we go here?
you mean like we are not extending Scorer but add an AttributeSource to it? I 
think this is really a core API and should be supported directly

 

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>         Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to