Re: Best Practices for getting Strings from a position range

Peter Keegan Fri, 10 Aug 2007 08:04:30 -0700

Grant,

I'm afraid I don't understand how to use this mapper in the context of a
SpanQuery. It seems like I would have to modify SpanScorer to fetch payload
data and provide a new method to access the payloads while iterating through
the documents. If this can be accomplished without modifying Spans, could
you provide a bit more detail?


Thanks,
Peter

On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
>
> Hi Grant,
>
> I'm hoping to check this out soon.
>
> Thanks,
> Peter
>
> On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> >
> > Hi Peter,
> >
> > Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
> > provides a TermVectorMapper that loads by position.
> >
> > Still not what ideally what you want, but I haven't had time to scope
> > that one out yet.,
> >
> > -Grant
> >
> > On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
> >
> > > Hi Grant,
> > >
> > > No problem - I know you are very busy.  I just wanted to get a
> > > sense for the
> > > timing because I'd like to use this for a release this Fall. If I
> > > can get a
> > > prototype working in the coming weeks AND the performance is
> > > great :) , this
> > > would be terrific. If not, I'll have to fall back on a more complex
> > > design
> > > that handles the query outside of Lucene :(
> > >
> > > In the meantime, I'll try playing with LUCENE-868.
> > >
> > > Thanks for the update.
> > > Peter
> > >
> > > On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
> > >>
> > >> Sorry, Peter, I haven't had a chance to work on it.  I don't see it
> > >> happening this week, but maybe next.
> > >>
> > >> I do think the Mapper approach via TermVectors will work.  It will
> > >> require implementing a new mapper that orders by position, but I
> > >> don't think that is too hard.   I started on one on the LUCENE-868
> > >> patch (version 4) but it is not complete.  Maybe you want to pick
> > >> it up?
> > >>
> > >> With this approach, you would iterate your spans, when you come to a
> > >> new doc, you would load the term vector using the PositionMapper, and
> > >> then you could index into the positions for the matches in the
> > >> document.
> > >>
> > >> I realize this does not cover the just wanting to get the Payload at
> > >> the match issue.  Maybe next week...
> > >>
> > >> Cheers,
> > >> Grant
> > >>
> > >> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
> > >>
> > >> > Any idea on when this might be available (days, weeks...)?
> > >> >
> > >> > Peter
> > >> >
> > >> > On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:
> > >> >>
> > >> >>
> > >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
> > >> >>
> > >> >> >
> > >> >> > : Do we have a best practice for going from, say a SpanQuery
> > >> doc/
> > >> >> > : position information and retrieving the actual range of
> > >> >> positions of
> > >> >> > : content from the Document?  Is it just to reanalyze the
> > >> Document
> > >> >> > : using the appropriate Analyzer and start recording once you
> > >> >> hit the
> > >> >> > : positions you are interested in?    Seems like Term Vectors
> > >> >> _could_
> > >> >> > : help, but even my new Mapper approach patch (LUCENE-868)
> > >> doesn't
> > >> >> > : really help, because they are stored in a term-centric
> > >> manner.  I
> > >> >> > : guess what I am after is a position centric approach.  That
> > >> >> is, give
> > >> >> >
> > >> >> > this is kind of what i was suggesting in the last message i sent
> > >> >> > to the java-user thread about paylods and SpanQueries (which i'm
> > >> >> > guessing is what prompted this thread as well)...
> > >> >> >
> > >> >> > http://www.nabble.com/Payloads-and-PhraseQuery-
> > >> >> > tf3988826.html#a11551628
> > >> >>
> > >> >>
> > >> >> This is one use case, the other is related to the new patch I
> > >> >> submitted for LUCENE-960.  In this case, I have a SpanQueryFilter
> > >> >> that identifies a bunch of docs and positions ahead of time.  Then
> >
> > >> >> the user enters new Span Query and I want to relate the matches
> > >> from
> > >> >> the user query with the positions of matches in the filter and
> > >> then
> > >> >> show that window.
> > >> >>
> > >> >> >
> > >> >> > my point was that currently, to retrieve a payload you need a
> > >> >> > TermPositions instance, which is designed for iterating in the
> > >> >> > order of...
> > >> >> >     seek(term)
> > >> >> >       skipTo(doc)
> > >> >> >          nextPosition()
> > >> >> >             getPayload()
> > >> >> > ...which is great for getting the payload of every instance
> > >> >> > (ie:position) of a specific term in a given document (or in
> > >> every
> > >> >> > document) but without serious changes to the Spans API, the
> > >> ideal
> > >> >> > payload
> > >> >> > API would let you say...
> > >> >> >     skipTo(doc)
> > >> >> >        advance(startPosition)
> > >> >> >          getPayload()
> > >> >> >        while (nextPosition() < endPosition)
> > >> >> >          getPosition()
> > >> >> >
> > >> >> > but this seems like a nearly impossible API to implement
> > >> given the
> > >> >> > natore
> > >> >> > of hte inverted index and the fact that terms aren't ever
> > >> stored in
> > >> >> > position order.
> > >> >> >
> > >> >> > there's a lot i really don't know/understand about the lucene
> > >> term
> > >> >> > position internals ... but as i recall, the datastructure
> > >> written
> > >> >> > to disk
> > >> >> > isn't actually a tree structure inverted index, it's a long
> > >> >> > sequence of
> > >> >> > tuples correct?  so in theory you could scan along the tuples
> > >> >> > untill you
> > >> >> > find the doc you are interested in, ignoring all of the term
> > >> info
> > >> >> > along
> > >> >> > the way, then whatever term you happen be on at the moment, you
> > >> >> > could scan
> > >> >> > along all of the positions until you find one in the range
> > >> you are
> > >> >> > interested in -- assuming you do, then you record the current
> > >> Term
> > >> >> > (and
> > >> >> > read your payload data if interested)
> > >> >>
> > >> >> I think the main issue I see is in both the payloads and the
> > >> matching
> > >> >> case above is that they require a document centric approach.  And
> > >> >> then, for each Document,
> > >> >> you ideally want to be able to just index into an array so that
> > >> you
> > >> >> can go directly to the position that is needed based on
> > >> >> Span.getStart()
> > >> >>
> > >> >> >
> > >> >> > if i remember correctly, the first part of this is easy, and
> > >> >> > relative fast
> > >> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will
> > >> happily
> > >> >> > scan for
> > >> >> > the first <term,doc> pair with the correct docId,
> > >> irregardless of
> > >> >> > the term
> > >> >> > ... the only thing i'm not sure about is how efficient it is to
> > >> >> > loop over
> > >> >> > nextPosition() for every term you find to see if any of them
> > >> are in
> > >> >> > your
> > >> >> > range ... the best case scenerio is that the first position
> > >> >> > returned is
> > >> >> > above the high end of your range, in which case you can stop
> > >> >> > immediately
> > >> >> > and seek to the next term -- butthe worst case is that you call
> > >> >> > nextPosition() over an over a lot of times before you get a
> > >> >> > position in
> > >> >> > (or above) your rnage .... an advancePosition(pos) that
> > >> wokred like
> > >> >> > seek
> > >> >> > or skipTo might be helpful here.
> > >> >> >
> > >> >> > : I feel like I am missing something obvious.  I would
> > >> suspect the
> > >> >> > : highlighter needs to do this, but it seems to take the
> > >> reanalyze
> > >> >> > : approach as well (I admit, though, that I have little
> > >> experience
> > >> >> > with
> > >> >> > : the highlighter.)
> > >> >> >
> > >> >> > as i understand it the default case is to reanalyze, but if you
> > >> >> have
> > >> >> > TermFreqVector info stored with positions (ie: a
> > >> >> > TermPositionVector) then
> > >> >> > it can use that to construct a TokenStream by iterating over all
> > >> >> > terms and
> > >> >> > writing them into a big array in position order (see the
> > >> >> > TermSources class
> > >> >> > in the highlighter)
> > >> >>
> > >> >>
> > >> >> Ah, I see that now.  Thanks.
> > >> >> >
> > >> >> > this makes sense when highlighting because it doesn't know what
> > >> >> > kind of
> > >> >> > fragmenter is going to be used so it needs the whole
> > >> TokenStream,
> > >> >> > but it
> > >> >> > seems less then ideal when you are only interested in a small
> > >> >> > number of
> > >> >> > position ranges that you know in advance.
> > >> >> >
> > >> >> > : I am wondering if it would be useful to have an alternative
> > >> Term
> > >> >> > : Vector storage mechanism that was position centric.
> > >> Because we
> > >> >> > : couldn't take advantage of the lexicographic compression, it
> > >> >> would
> > >> >> > : take up more disk space, but it would be a lot faster for
> > >> these
> > >> >> > kinds
> > >> >> >
> > >> >> > i'm not sure if it's really neccessary to store the data in a
> > >> >> position
> > >> >> > centric manner, assuming we have a way to "seek" by position
> > >> like i
> > >> >> > described above -- but then again i don't really know that
> > >> what i
> > >> >> > described above is all that possible/practical/performant.
> > >> >> >
> > >> >>
> > >> >> I suppose I could use my Mapper approach to organize things in a
> > >> >> position centric way now that I think about it more.  Just
> > >> means some
> > >> >> unpacking and repacking.  Still, probably would perform well
> > >> enough
> > >> >> since I can setup the correct structure on the fly.  I will
> > >> give this
> > >> >> a try.  Maybe even add a Mapper to do this.
> > >> >>
> > >> >>
> > >> >> -Grant
> > >> >>
> > >> >>
> > >> ---------------------------------------------------------------------
> >
> > >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >> >> For additional commands, e-mail: [EMAIL PROTECTED]
> > >> >>
> > >> >>
> > >>
> > >> ------------------------------------------------------
> > >> Grant Ingersoll
> > >> http://www.grantingersoll.com/
> > >> http://lucene.grantingersoll.com
> > >> http://www.paperoftheweek.com/
> > >>
> > >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >> For additional commands, e-mail: [EMAIL PROTECTED]
> > >>
> > >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://lucene.grantingersoll.com
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Re: Best Practices for getting Strings from a position range

Reply via email to