[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Sun, 28 Oct 2012 11:45:14 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485685#comment-13485685
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

thanks mike for taking a look at this. It still has it's edges so every review 
is very valuable.

{quote}Instead of PostingFeatures.isProximityFeature, can we just use
X.compareTo(PostingsFeatures.POSITIONS) >= 0? (We do this for
IndexOptions).{quote}

sure, I was actually thinking about this for a while though. After a day of 
playing with different ways of doing it I really asked myself why we have 2 
different docs enums and why not just one and one set of features / flags this 
would make a lot of things easier. Different discussion / progress over 
perfection..

{quote}Should we move PostingFeatures to its own source instead of hiding it
in Weight.java?{quote}

alone the same lines, sure lets move it out.


{quote}Can we put back single imports (not wildcard, eg "import
org.apache.lucene.index.*")?{quote}

yeah I saw that when I created the diff I will go over it and bring it back.

{quote}PostingFeatures is very similar to FieldInfo.IndexOptions (except the
latter does not cover payloads) ... would be nice if we could somehow
combine them ...{quote}

I agree it would be nice to unify all of this. Lets open another issue - we 
have a good set of usecases now.

{quote} I'm confused on how one uses IntervalIterator along with the Scorer it
"belongs" to. Say I want to visit all Intervals for a given TermQuery
... do I first get the TermScorer and then call .intervals, up front?
And then call TermScorer.nextDoc(), but then how to iterate over all
intervals for that one document? EG, is the caller supposed to call
IntervalIterator.scorerAdvanced for each next'd doc?{quote}

so my major goal here was to make this totally detached, optional and lazy ie 
no additional code in scorer except of IntervalIterator creation on demand. 
once you have a scorer you can call intervals() and get an iterator. This 
instance can and should be reused while docs are collected / scored / matched 
on a given reader. For each doc I need to iterate over intervals I call 
scorerAdvanced and update the internal structures this prevents any additional 
work if it is not really needed ie. on a complex query/scorer tree. Once the 
iterator is setup (scorerAdvanced is called) you can just call next() on it in 
a loop --> while ((interval = iter.next) != null) and get all the intervals. 
makes sense?

{quote}Or ... am I supposed to call .intervals() after each .nextDoc() (but
that looks rather costly/wasteful since it's a newly alloc'd
TermIntervalIterator each time).{quote}

no that is not what you should do. I think the scorer#interval method javadoc 
make this clear, no? if not I should make it clear!

{quote} I'm also confused why TermIntervalIterator.collect only collects one
interval (I think?). Shouldn't it collect all intervals for the
current doc?{quote}

the collect method is special. It's an interface that allows to collect the 
"current" interval or all "current" intervals that contributed to a higher 
level interval. For each next call you should call collect if you need all the 
subtrees intervals or the leaves. one usecase where we do this right now is 
highlighing. you can highlight based on phrases ie. if you collect on a BQ or 
you can do individual terms ie. collect leaves. makes sense?
                
> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Positions Branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, gsoc2012, lucene-gsoc-11, lucene-gsoc-12, 
> mentor
>             Fix For: Positions Branch
>
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> LUCENE-2878-vs-trunk.patch, PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to