Re: Best Practices for getting Strings from a position range

Grant Ingersoll Fri, 10 Aug 2007 10:40:29 -0700

Sorry for the confusion. I thought you just wanted access to theterm info per position. I think we will have to add something tothe Spans like we talked about before.


-Grant


On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:

Grant,

I'm afraid I don't understand how to use this mapper in the contextof aSpanQuery. It seems like I would have to modify SpanScorer to fetchpayloaddata and provide a new method to access the payloads whileiterating throughthe documents. If this can be accomplished without modifying Spans,could

you provide a bit more detail?

Thanks,
Peter

On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote:


Hi Grant,

I'm hoping to check this out soon.

Thanks,
Peter

On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:


Hi Peter,

Give https://issues.apache.org/jira/browse/LUCENE-975 a try.  It
provides a TermVectorMapper that loads by position.

Still not what ideally what you want, but I haven't had time toscope

that one out yet.,

-Grant

On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:

Hi Grant,

No problem - I know you are very busy.  I just wanted to get a
sense for the
timing because I'd like to use this for a release this Fall. If I
can get a
prototype working in the coming weeks AND the performance is
great :) , this
would be terrific. If not, I'll have to fall back on a more complex
design
that handles the query outside of Lucene :(

In the meantime, I'll try playing with LUCENE-868.

Thanks for the update.
Peter

On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:

Sorry, Peter, I haven't had a chance to work on it. I don'tsee it

happening this week, but maybe next.

I do think the Mapper approach via TermVectors will work.  It will
require implementing a new mapper that orders by position, but I
don't think that is too hard.   I started on one on the LUCENE-868
patch (version 4) but it is not complete.  Maybe you want to pick
it up?

With this approach, you would iterate your spans, when you cometo anew doc, you would load the term vector using thePositionMapper, and

then you could index into the positions for the matches in the
document.

I realize this does not cover the just wanting to get thePayload at

the match issue.  Maybe next week...

Cheers,
Grant

On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:

Any idea on when this might be available (days, weeks...)?

Peter

On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:



On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:


: Do we have a best practice for going from, say a SpanQuery

doc/

: position information and retrieving the actual range of

positions of

: content from the Document?  Is it just to reanalyze the

Document

: using the appropriate Analyzer and start recording once you

hit the

: positions you are interested in?    Seems like Term Vectors

_could_

: help, but even my new Mapper approach patch (LUCENE-868)

doesn't

: really help, because they are stored in a term-centric

manner.  I

: guess what I am after is a position centric approach.  That
is, give
this is kind of what i was suggesting in the last message isentto the java-user thread about paylods and SpanQueries (whichi'm
guessing is what prompted this thread as well)...

http://www.nabble.com/Payloads-and-PhraseQuery-
tf3988826.html#a11551628
This is one use case, the other is related to the new patch I
submitted for LUCENE-960. In this case, I have aSpanQueryFilterthat identifies a bunch of docs and positions ahead of time.Then

the user enters new Span Query and I want to relate the matches

from

the user query with the positions of matches in the filter and

then

show that window.


my point was that currently, to retrieve a payload you need a
TermPositions instance, which is designed for iterating in the
order of...
    seek(term)
      skipTo(doc)
         nextPosition()
            getPayload()
...which is great for getting the payload of every instance
(ie:position) of a specific term in a given document (or in

every

document) but without serious changes to the Spans API, the

ideal

payload
API would let you say...
    skipTo(doc)
       advance(startPosition)
         getPayload()
       while (nextPosition() < endPosition)
         getPosition()

but this seems like a nearly impossible API to implement

given the

natore
of hte inverted index and the fact that terms aren't ever

stored in

position order.

there's a lot i really don't know/understand about the lucene

term

position internals ... but as i recall, the datastructure

written

to disk
isn't actually a tree structure inverted index, it's a long
sequence of
tuples correct?  so in theory you could scan along the tuples
untill you
find the doc you are interested in, ignoring all of the term

info

along
the way, then whatever term you happen be on at the moment, you
could scan
along all of the positions until you find one in the range

you are

interested in -- assuming you do, then you record the current

Term

(and
read your payload data if interested)


I think the main issue I see is in both the payloads and the

matching

case above is that they require a document centric approach.And
then, for each Document,
you ideally want to be able to just index into an array so that

you

can go directly to the position that is needed based on
Span.getStart()


if i remember correctly, the first part of this is easy, and
relative fast
-- i think skipTo(doc) on a TermDoc or TermPositions will

happily

scan for
the first <term,doc> pair with the correct docId,

irregardless of

the term
... the only thing i'm not sure about is how efficient it is to
loop over
nextPosition() for every term you find to see if any of them

are in

your
range ... the best case scenerio is that the first position
returned is
above the high end of your range, in which case you can stop
immediately
and seek to the next term -- butthe worst case is that you call
nextPosition() over an over a lot of times before you get a
position in
(or above) your rnage .... an advancePosition(pos) that

wokred like

seek
or skipTo might be helpful here.

: I feel like I am missing something obvious.  I would

suspect the

: highlighter needs to do this, but it seems to take the

reanalyze

: approach as well (I admit, though, that I have little

experience

with
: the highlighter.)

as i understand it the default case is to reanalyze, but if you

have

TermFreqVector info stored with positions (ie: a
TermPositionVector) then

it can use that to construct a TokenStream by iterating overall

terms and
writing them into a big array in position order (see the
TermSources class
in the highlighter)



Ah, I see that now.  Thanks.


this makes sense when highlighting because it doesn't know what
kind of
fragmenter is going to be used so it needs the whole

TokenStream,

but it
seems less then ideal when you are only interested in a small
number of
position ranges that you know in advance.

: I am wondering if it would be useful to have an alternative

Term

: Vector storage mechanism that was position centric.

Because we

: couldn't take advantage of the lexicographic compression, it

would

: take up more disk space, but it would be a lot faster for

these

kinds

i'm not sure if it's really neccessary to store the data in a

position

centric manner, assuming we have a way to "seek" by position

like i

described above -- but then again i don't really know that

what i

described above is all that possible/practical/performant.


I suppose I could use my Mapper approach to organize things in a
position centric way now that I think about it more.  Just

means some

unpacking and repacking.  Still, probably would perform well

enough

since I can setup the correct structure on the fly.  I will

give this

a try.  Maybe even add a Mapper to do this.


-Grant

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best Practices for getting Strings from a position range

Reply via email to