[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Thu, 07 Jul 2011 08:35:43 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061391#comment-13061391
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

hey Mike, I applied all your patches and walked through, this looks great. I 
mean this entire thing is far from committable but I think we should take this 
further and open a branch for it. I want to commit both your latest patch and 
the highlighter prototype and work from there.

{quote}So after working with this a bit more (and reading the paper), I see now 
that it's really not necessary to cache positions in the iterators. So never 
mind all that! In the end, for some uses like highlighting I think somebody 
needs to cache positions (I put it in a ScorePosDoc created by the 
PosCollector), but I agree that doesn't belong in the "lower level" 
iterator.{quote}

after looking into your patch I think I understand now what is needed to enable 
low level stuff like highlighting. what is missing here is a positions 
collector interface that you can pass in and that collects positions on the 
lowest levels like for pharses or simple terms. The PositionIterator itself 
(btw. i think we should call it Positions or something along those lines - try 
to not introduce spans in the name :) ) should accept this collector and simply 
call back each low level position if needed. For highlighting I think we should 
also go a two stage approach. First stage does the matching (with or without 
positions) and second stage takes the first stages resutls and does the 
highlighting. that way we don't slow down the query and the second one can even 
choose a different rewrite method (for MTQ this is needed as we don't have 
positions on filters)

{quote}
As I'm learning more, I am beginning to see this is going to require sweeping 
updates. Basically everywhere we currently create a DocsEnum, we might now want 
to create a DocsAndPositionsEnum, and then the options (needs 
positions/payloads) have to be threaded through all the surrounding APIs. I 
wonder if it wouldn't make sense to encapsulate those options 
(needsPositions/needsPayloads) in some kind of EnumConfig object. Just in case, 
down the line, there is some other information that gets stored in the index, 
and wants to be made available during scoring, then the required change would 
be much less painful to implement.
{quote}

what do you mean by sweeping updates?  For the enum config I think we only have 
2 or 3 places where we need to make the decision. 1. TermScorer 2. PhraseScorer 
(maybe 2. goes away anyway) so this is not needed for now I think?
{quote}
I'm thinking for example (Robert M's idea), that it might be nice to have a 
positions->offsets map in the index (this would be better for highlighting than 
term vectors). Maybe this would just be part of payload, but maybe not? And it 
seems possible there could be other things like that we don't know about yet?
{quote}

yeah this would be awesome... next step :)



> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, 
> PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to