[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Alan Woodward (JIRA) Fri, 12 Dec 2014 02:22:46 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14243949#comment-14243949
 ]


Alan Woodward commented on LUCENE-2878:
---------------------------------------

The remaining nocommits are:
* implement nextPosition() etc on TermAutomatonScorer
** this should be fairly simple, and I'm working on it now
* implement nextPosition() etc on SpanScorer
** this is not so simple, as Spans.next() will move to the next document if 
it's exhausted the positions on the current doc, a completely different API.  
I'm tempted to just not implement these, seeing as they already don't work with 
'normal' queries.  A followup JIRA will remove Span queries entirely from trunk 
and deprecate them in 5.0 anyway.
* should PositionFilteredScorer enforce that all its subqueries should be on 
the same field?
** again, I'm tempted not to.  1) this means adding getField() to all queries, 
which is a far more invasive API change, and 2) we already have 
FieldMaskingSpanQuery to get around this limitation in SpanQueries, which 
suggests that it's not necessarily a good thing anyway.  On the other hand, 
most of the time you want phrases or proximity queries to be on the same field.
* how to deal with Payloads on intervals.
** At the moment we just return the payload at the starting position.  This is 
probably not what we want.  Spans has a different API, returning a 
Collection<byte[]> from getPayload rather than a BytesRef, which may be closer 
to what should happen.  I think this can be changed from a nocommit to a TODO, 
however, and dealt with in a separate issue.

I also want to have a look at the MultiTermQuery rewrite API, as at the moment 
it defaults to CONSTANT_SCORE_FILTER_REWRITE, which of course doesn't support 
positions.  It would be nice to have some way of telling the query whether 
positions are needed or not at rewrite time.  But again, that's for another 
issue.

So I think this is close.  I'll get TermAutomatonScorer sorted today.


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Robert Muir
>              Labels: gsoc2014
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878-vs-trunk.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to