[ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Sokolov updated LUCENE-2878:
---------------------------------

    Attachment: LUCENE-2878.patch

I made some progress with a Collector-based positions API; my tests (and 
existing tests) are passing now.  I am getting all the right positions back for 
highlighting from a variety of queries, so I think the basic idea is workable. 
But there are a bunch of things I don't like still (and more queries/scorers to 
instrument); needs more work, for sure.

The basic idea is to extend Collector so it can accept positions.  Collectors 
simply return true from Collector.needsPositions() and implement 
collectPositions(Scorer, PositionIntervalIterator).  The way of hooking this up 
is the easiest possible for now; I added Scorer.setPositionCollector(Collector);
Collectors would call it in Collector.setScorer().  This approach forced one 
change that probably isn't good for performance, so we might want to revisit 
this, and "hook up" the Scorer.positionCollector during construction, say with 
the register() idea you mentioned.  The problem currently is that in 
DisjunctionSumScorer, there is some initialization that happens (advancing all 
the iterators) during the constructor, before setScorer() is ever called.  I've 
changed it to happen lazily, but this introduces an extra if(initialized) check 
into every call to advance() and nextDoc().

Another problem with this callback approach (from the performance perspective) 
is that you end up paying the cost to check whether there is a 
positionCollector registered in a fairly tight inner loop (while 
matching/scoring), I think, which you'd rather not do if there is no interest 
in positions. It would probably be better to introduce some kind of Scorer 
wrapper that calls collectPositions() and only gets created if needed. Another 
thought I had was to push the callback into PositionIntervalIterator, but 
somehow someone has to know to actually iterate over those iterators; if it's 
the Scorer, you have the same problem as now, but the Collector can't do it (in 
collect(), say) because it is often too late then; the Scorer will have 
advanced its internal state (and thus its positions() iterator) to the next 
document.

The core problem solved here is how to report positions that are not consumed 
during scoring, and also those that are,
but not to report positions that don't contribute to the query match (terms 
outside their phrase).  In order to do this, different Scorers do different 
things: the Boolean family of scorers (and wrapping scorers like 
ConstantScorer) simply pass any positionCollector down to their child Scorers, 
since they effectively want to report all positions found.  PositionTermScorer 
reports its positions, naturally.  The interesting case is 
PositionFilterScorer, which filters its child Scorers.  I added 
PositionIntervalIterator.getTermPositions() to enable this; this walks the tree 
of position iterators and returns a snapshot of their current state (as another 
iterator) so the consumer can retrieve all the term positions as filtered by 
intermediate iterators without advancing them.

There are a few (11) failing tests with this branch+patch (ran lucene tests 
only), but they seem unrelated (TestFlushByRamOrCountsPolicy has 5, eg) I am 
ignoring?

Simon - thanks for setting up the positions branch - I would really appreciate 
it if you have the time to review this because I think there is headway, but 
I'm also sure there is lots of room for improvement.


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to