Grant, I'm afraid I don't understand how to use this mapper in the context of a SpanQuery. It seems like I would have to modify SpanScorer to fetch payload data and provide a new method to access the payloads while iterating through the documents. If this can be accomplished without modifying Spans, could you provide a bit more detail?
Thanks, Peter On 8/9/07, Peter Keegan <[EMAIL PROTECTED]> wrote: > > Hi Grant, > > I'm hoping to check this out soon. > > Thanks, > Peter > > On 8/7/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > > > > Hi Peter, > > > > Give https://issues.apache.org/jira/browse/LUCENE-975 a try. It > > provides a TermVectorMapper that loads by position. > > > > Still not what ideally what you want, but I haven't had time to scope > > that one out yet., > > > > -Grant > > > > On Jul 24, 2007, at 6:02 PM, Peter Keegan wrote: > > > > > Hi Grant, > > > > > > No problem - I know you are very busy. I just wanted to get a > > > sense for the > > > timing because I'd like to use this for a release this Fall. If I > > > can get a > > > prototype working in the coming weeks AND the performance is > > > great :) , this > > > would be terrific. If not, I'll have to fall back on a more complex > > > design > > > that handles the query outside of Lucene :( > > > > > > In the meantime, I'll try playing with LUCENE-868. > > > > > > Thanks for the update. > > > Peter > > > > > > On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > > >> > > >> Sorry, Peter, I haven't had a chance to work on it. I don't see it > > >> happening this week, but maybe next. > > >> > > >> I do think the Mapper approach via TermVectors will work. It will > > >> require implementing a new mapper that orders by position, but I > > >> don't think that is too hard. I started on one on the LUCENE-868 > > >> patch (version 4) but it is not complete. Maybe you want to pick > > >> it up? > > >> > > >> With this approach, you would iterate your spans, when you come to a > > >> new doc, you would load the term vector using the PositionMapper, and > > >> then you could index into the positions for the matches in the > > >> document. > > >> > > >> I realize this does not cover the just wanting to get the Payload at > > >> the match issue. Maybe next week... > > >> > > >> Cheers, > > >> Grant > > >> > > >> On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote: > > >> > > >> > Any idea on when this might be available (days, weeks...)? > > >> > > > >> > Peter > > >> > > > >> > On 7/16/07, Grant Ingersoll < [EMAIL PROTECTED]> wrote: > > >> >> > > >> >> > > >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote: > > >> >> > > >> >> > > > >> >> > : Do we have a best practice for going from, say a SpanQuery > > >> doc/ > > >> >> > : position information and retrieving the actual range of > > >> >> positions of > > >> >> > : content from the Document? Is it just to reanalyze the > > >> Document > > >> >> > : using the appropriate Analyzer and start recording once you > > >> >> hit the > > >> >> > : positions you are interested in? Seems like Term Vectors > > >> >> _could_ > > >> >> > : help, but even my new Mapper approach patch (LUCENE-868) > > >> doesn't > > >> >> > : really help, because they are stored in a term-centric > > >> manner. I > > >> >> > : guess what I am after is a position centric approach. That > > >> >> is, give > > >> >> > > > >> >> > this is kind of what i was suggesting in the last message i sent > > >> >> > to the java-user thread about paylods and SpanQueries (which i'm > > >> >> > guessing is what prompted this thread as well)... > > >> >> > > > >> >> > http://www.nabble.com/Payloads-and-PhraseQuery- > > >> >> > tf3988826.html#a11551628 > > >> >> > > >> >> > > >> >> This is one use case, the other is related to the new patch I > > >> >> submitted for LUCENE-960. In this case, I have a SpanQueryFilter > > >> >> that identifies a bunch of docs and positions ahead of time. Then > > > > >> >> the user enters new Span Query and I want to relate the matches > > >> from > > >> >> the user query with the positions of matches in the filter and > > >> then > > >> >> show that window. > > >> >> > > >> >> > > > >> >> > my point was that currently, to retrieve a payload you need a > > >> >> > TermPositions instance, which is designed for iterating in the > > >> >> > order of... > > >> >> > seek(term) > > >> >> > skipTo(doc) > > >> >> > nextPosition() > > >> >> > getPayload() > > >> >> > ...which is great for getting the payload of every instance > > >> >> > (ie:position) of a specific term in a given document (or in > > >> every > > >> >> > document) but without serious changes to the Spans API, the > > >> ideal > > >> >> > payload > > >> >> > API would let you say... > > >> >> > skipTo(doc) > > >> >> > advance(startPosition) > > >> >> > getPayload() > > >> >> > while (nextPosition() < endPosition) > > >> >> > getPosition() > > >> >> > > > >> >> > but this seems like a nearly impossible API to implement > > >> given the > > >> >> > natore > > >> >> > of hte inverted index and the fact that terms aren't ever > > >> stored in > > >> >> > position order. > > >> >> > > > >> >> > there's a lot i really don't know/understand about the lucene > > >> term > > >> >> > position internals ... but as i recall, the datastructure > > >> written > > >> >> > to disk > > >> >> > isn't actually a tree structure inverted index, it's a long > > >> >> > sequence of > > >> >> > tuples correct? so in theory you could scan along the tuples > > >> >> > untill you > > >> >> > find the doc you are interested in, ignoring all of the term > > >> info > > >> >> > along > > >> >> > the way, then whatever term you happen be on at the moment, you > > >> >> > could scan > > >> >> > along all of the positions until you find one in the range > > >> you are > > >> >> > interested in -- assuming you do, then you record the current > > >> Term > > >> >> > (and > > >> >> > read your payload data if interested) > > >> >> > > >> >> I think the main issue I see is in both the payloads and the > > >> matching > > >> >> case above is that they require a document centric approach. And > > >> >> then, for each Document, > > >> >> you ideally want to be able to just index into an array so that > > >> you > > >> >> can go directly to the position that is needed based on > > >> >> Span.getStart() > > >> >> > > >> >> > > > >> >> > if i remember correctly, the first part of this is easy, and > > >> >> > relative fast > > >> >> > -- i think skipTo(doc) on a TermDoc or TermPositions will > > >> happily > > >> >> > scan for > > >> >> > the first <term,doc> pair with the correct docId, > > >> irregardless of > > >> >> > the term > > >> >> > ... the only thing i'm not sure about is how efficient it is to > > >> >> > loop over > > >> >> > nextPosition() for every term you find to see if any of them > > >> are in > > >> >> > your > > >> >> > range ... the best case scenerio is that the first position > > >> >> > returned is > > >> >> > above the high end of your range, in which case you can stop > > >> >> > immediately > > >> >> > and seek to the next term -- butthe worst case is that you call > > >> >> > nextPosition() over an over a lot of times before you get a > > >> >> > position in > > >> >> > (or above) your rnage .... an advancePosition(pos) that > > >> wokred like > > >> >> > seek > > >> >> > or skipTo might be helpful here. > > >> >> > > > >> >> > : I feel like I am missing something obvious. I would > > >> suspect the > > >> >> > : highlighter needs to do this, but it seems to take the > > >> reanalyze > > >> >> > : approach as well (I admit, though, that I have little > > >> experience > > >> >> > with > > >> >> > : the highlighter.) > > >> >> > > > >> >> > as i understand it the default case is to reanalyze, but if you > > >> >> have > > >> >> > TermFreqVector info stored with positions (ie: a > > >> >> > TermPositionVector) then > > >> >> > it can use that to construct a TokenStream by iterating over all > > >> >> > terms and > > >> >> > writing them into a big array in position order (see the > > >> >> > TermSources class > > >> >> > in the highlighter) > > >> >> > > >> >> > > >> >> Ah, I see that now. Thanks. > > >> >> > > > >> >> > this makes sense when highlighting because it doesn't know what > > >> >> > kind of > > >> >> > fragmenter is going to be used so it needs the whole > > >> TokenStream, > > >> >> > but it > > >> >> > seems less then ideal when you are only interested in a small > > >> >> > number of > > >> >> > position ranges that you know in advance. > > >> >> > > > >> >> > : I am wondering if it would be useful to have an alternative > > >> Term > > >> >> > : Vector storage mechanism that was position centric. > > >> Because we > > >> >> > : couldn't take advantage of the lexicographic compression, it > > >> >> would > > >> >> > : take up more disk space, but it would be a lot faster for > > >> these > > >> >> > kinds > > >> >> > > > >> >> > i'm not sure if it's really neccessary to store the data in a > > >> >> position > > >> >> > centric manner, assuming we have a way to "seek" by position > > >> like i > > >> >> > described above -- but then again i don't really know that > > >> what i > > >> >> > described above is all that possible/practical/performant. > > >> >> > > > >> >> > > >> >> I suppose I could use my Mapper approach to organize things in a > > >> >> position centric way now that I think about it more. Just > > >> means some > > >> >> unpacking and repacking. Still, probably would perform well > > >> enough > > >> >> since I can setup the correct structure on the fly. I will > > >> give this > > >> >> a try. Maybe even add a Mapper to do this. > > >> >> > > >> >> > > >> >> -Grant > > >> >> > > >> >> > > >> --------------------------------------------------------------------- > > > > >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] > > >> >> For additional commands, e-mail: [EMAIL PROTECTED] > > >> >> > > >> >> > > >> > > >> ------------------------------------------------------ > > >> Grant Ingersoll > > >> http://www.grantingersoll.com/ > > >> http://lucene.grantingersoll.com > > >> http://www.paperoftheweek.com/ > > >> > > >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > > >> For additional commands, e-mail: [EMAIL PROTECTED] > > >> > > >> > > > > -------------------------- > > Grant Ingersoll > > http://lucene.grantingersoll.com > > > > Lucene Helpful Hints: > > http://wiki.apache.org/lucene-java/BasicsOfPerformance > > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > >