[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Fri, 08 Jul 2011 02:45:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061872#comment-13061872
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

{quote}
I think I agree. The only possible trade-off that goes the other way is in the 
case where you have the positions available already during initial 
search/scoring, and there is not too much turnover in the TopDocs priority 
queue during hit collection. Then a Highlighter might save some time by not 
re-scoring and re-iterating the positions if it accumulated them up front (even 
for docs that were eventually dropped off the queue). I think it should be 
possible to test out both approaches given the right API here though?
{quote}

Yes, I think we should go and provide both possibilities here.

{quote}

The callback idea sounds appealing, but I still think we should also consider 
enabling the top-down approach: especially if this is going to run in two 
passes, why not let the highlighter drive the iteration? Keep in mind that 
positions consumers (like highlighters) may possibly be interested in more than 
just the lowest-level positions (they may want to see phrases, eg, and 
near-clauses - trying to avoid the s-word).
{quote}

I am not sure if I understand this correctly. I think the collector should be 
some kind of a visitor that walks down the query/scorer tree and each scorer 
can ask if it should pass the current positions to the collector something like 
this: 
{code}
class PositionCollector {

  public boolean register(Scorer scorer) {
    if(interestedInScorere(scorere)) {
       // store infor about the scorer
       return true;
    }
    return false;
  }

  /*
   * Called by a registered scorer for each position change
   */
  public void nexPosition(Scorer scorer) {
   // collect positions for the current scorer
  } 
}
{code}
that way the iteration process is still driven by the top-level consumer but if 
you need information about intermediate positions you can collect them.

{quote}
Another consideration is ordering. I think  that positions are retrieved from 
the index in document order. This could be a natural order for many cases, but 
score order will also be useful. I'm not sure whose responsibility the sorting 
should be. Highlighters will want to be able to optimize their work (esp for 
very large documents) by terminating after considering only the first N 
matches, where the ordering could either be score or document-order.
{quote}

so the order here depends on the first collector I figure. the usual case it 
that you do your search and retrieve the top N documents (those are also the 
top N you want to highlight right?) then you pass in your top N and do the 
highlighting collection based on those top N. In that collection you are not 
interested all matches but only in the top N from the previous collection. The 
simplest yet maybe not the best way to do this is using a simple filter that is 
build from the top N docs.

I will go ahead and create the branch now


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, 
> PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to