Sorry for the confusion. I thought you just wanted access to the
term info per position. I think we will have to add something to
the Spans like we talked about before.
-Grant
On Aug 10, 2007, at 11:03 AM, Peter Keegan wrote:
Grant,
I'm afraid I don't understand how to use this mapper in the context
of a
SpanQuery. It seems like I would have to modify SpanScorer to fetch
payload
data and provide a new method to access the payloads while
iterating through
the documents. If this can be accomplished without modifying Spans,
could
you provide a bit more detail?
Thanks,
Peter
On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
Hi Grant,
I'm hoping to check this out soon.
Thanks,
Peter
On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
Hi Peter,
Give https://issues.apache.org/jira/browse/LUCENE-975 a try. It
provides a TermVectorMapper that loads by position.
Still not what ideally what you want, but I haven't had time to
scope
that one out yet.,
-Grant
On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote:
Hi Grant,
No problem - I know you are very busy. I just wanted to get a
sense for the
timing because I'd like to use this for a release this Fall. If I
can get a
prototype working in the coming weeks AND the performance is
great :) , this
would be terrific. If not, I'll have to fall back on a more complex
design
that handles the query outside of Lucene :(
In the meantime, I'll try playing with LUCENE-868.
Thanks for the update.
Peter
On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote:
Sorry, Peter, I haven't had a chance to work on it. I don't
see it
happening this week, but maybe next.
I do think the Mapper approach via TermVectors will work. It will
require implementing a new mapper that orders by position, but I
don't think that is too hard. I started on one on the LUCENE-868
patch (version 4) but it is not complete. Maybe you want to pick
it up?
With this approach, you would iterate your spans, when you come
to a
new doc, you would load the term vector using the
PositionMapper, and
then you could index into the positions for the matches in the
document.
I realize this does not cover the just wanting to get the
Payload at
the match issue. Maybe next week...
Cheers,
Grant
On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote:
Any idea on when this might be available (days, weeks...)?
Peter
On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote:
On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote:
: Do we have a best practice for going from, say a SpanQuery
doc/
: position information and retrieving the actual range of
positions of
: content from the Document? Is it just to reanalyze the
Document
: using the appropriate Analyzer and start recording once you
hit the
: positions you are interested in? Seems like Term Vectors
_could_
: help, but even my new Mapper approach patch (LUCENE-868)
doesn't
: really help, because they are stored in a term-centric
manner. I
: guess what I am after is a position centric approach. That
is, give
this is kind of what i was suggesting in the last message i
sent
to the java-user thread about paylods and SpanQueries (which
i'm
guessing is what prompted this thread as well)...
http://www.nabble.com/Payloads-and-PhraseQuery-
tf3988826.html#a11551628
This is one use case, the other is related to the new patch I
submitted for LUCENE-960. In this case, I have a
SpanQueryFilter
that identifies a bunch of docs and positions ahead of time.
Then
the user enters new Span Query and I want to relate the matches
from
the user query with the positions of matches in the filter and
then
show that window.
my point was that currently, to retrieve a payload you need a
TermPositions instance, which is designed for iterating in the
order of...
seek(term)
skipTo(doc)
nextPosition()
getPayload()
...which is great for getting the payload of every instance
(ie:position) of a specific term in a given document (or in
every
document) but without serious changes to the Spans API, the
ideal
payload
API would let you say...
skipTo(doc)
advance(startPosition)
getPayload()
while (nextPosition() < endPosition)
getPosition()
but this seems like a nearly impossible API to implement
given the
natore
of hte inverted index and the fact that terms aren't ever
stored in
position order.
there's a lot i really don't know/understand about the lucene
term
position internals ... but as i recall, the datastructure
written
to disk
isn't actually a tree structure inverted index, it's a long
sequence of
tuples correct? so in theory you could scan along the tuples
untill you
find the doc you are interested in, ignoring all of the term
info
along
the way, then whatever term you happen be on at the moment, you
could scan
along all of the positions until you find one in the range
you are
interested in -- assuming you do, then you record the current
Term
(and
read your payload data if interested)
I think the main issue I see is in both the payloads and the
matching
case above is that they require a document centric approach.
And
then, for each Document,
you ideally want to be able to just index into an array so that
you
can go directly to the position that is needed based on
Span.getStart()
if i remember correctly, the first part of this is easy, and
relative fast
-- i think skipTo(doc) on a TermDoc or TermPositions will
happily
scan for
the first <term,doc> pair with the correct docId,
irregardless of
the term
... the only thing i'm not sure about is how efficient it is to
loop over
nextPosition() for every term you find to see if any of them
are in
your
range ... the best case scenerio is that the first position
returned is
above the high end of your range, in which case you can stop
immediately
and seek to the next term -- butthe worst case is that you call
nextPosition() over an over a lot of times before you get a
position in
(or above) your rnage .... an advancePosition(pos) that
wokred like
seek
or skipTo might be helpful here.
: I feel like I am missing something obvious. I would
suspect the
: highlighter needs to do this, but it seems to take the
reanalyze
: approach as well (I admit, though, that I have little
experience
with
: the highlighter.)
as i understand it the default case is to reanalyze, but if you
have
TermFreqVector info stored with positions (ie: a
TermPositionVector) then
it can use that to construct a TokenStream by iterating over
all
terms and
writing them into a big array in position order (see the
TermSources class
in the highlighter)
Ah, I see that now. Thanks.
this makes sense when highlighting because it doesn't know what
kind of
fragmenter is going to be used so it needs the whole
TokenStream,
but it
seems less then ideal when you are only interested in a small
number of
position ranges that you know in advance.
: I am wondering if it would be useful to have an alternative
Term
: Vector storage mechanism that was position centric.
Because we
: couldn't take advantage of the lexicographic compression, it
would
: take up more disk space, but it would be a lot faster for
these
kinds
i'm not sure if it's really neccessary to store the data in a
position
centric manner, assuming we have a way to "seek" by position
like i
described above -- but then again i don't really know that
what i
described above is all that possible/practical/performant.
I suppose I could use my Mapper approach to organize things in a
position centric way now that I think about it more. Just
means some
unpacking and repacking. Still, probably would perform well
enough
since I can setup the correct structure on the fly. I will
give this
a try. Maybe even add a Mapper to do this.
-Grant
------------------------------------------------------------------
---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
------------------------------------------------------------------
---
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]