Hi Grant, No problem - I know you are very busy. I just wanted to get a sense for the timing because I'd like to use this for a release this Fall. If I can get a prototype working in the coming weeks AND the performance is great :) , this would be terrific. If not, I'll have to fall back on a more complex design that handles the query outside of Lucene :(
In the meantime, I'll try playing with LUCENE-868. Thanks for the update. Peter On 7/24/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
Sorry, Peter, I haven't had a chance to work on it. I don't see it happening this week, but maybe next. I do think the Mapper approach via TermVectors will work. It will require implementing a new mapper that orders by position, but I don't think that is too hard. I started on one on the LUCENE-868 patch (version 4) but it is not complete. Maybe you want to pick it up? With this approach, you would iterate your spans, when you come to a new doc, you would load the term vector using the PositionMapper, and then you could index into the positions for the matches in the document. I realize this does not cover the just wanting to get the Payload at the match issue. Maybe next week... Cheers, Grant On Jul 23, 2007, at 8:51 AM, Peter Keegan wrote: > Any idea on when this might be available (days, weeks...)? > > Peter > > On 7/16/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> >> On Jul 16, 2007, at 1:06 AM, Chris Hostetter wrote: >> >> > >> > : Do we have a best practice for going from, say a SpanQuery doc/ >> > : position information and retrieving the actual range of >> positions of >> > : content from the Document? Is it just to reanalyze the Document >> > : using the appropriate Analyzer and start recording once you >> hit the >> > : positions you are interested in? Seems like Term Vectors >> _could_ >> > : help, but even my new Mapper approach patch (LUCENE-868) doesn't >> > : really help, because they are stored in a term-centric manner. I >> > : guess what I am after is a position centric approach. That >> is, give >> > >> > this is kind of what i was suggesting in the last message i sent >> > to the java-user thread about paylods and SpanQueries (which i'm >> > guessing is what prompted this thread as well)... >> > >> > http://www.nabble.com/Payloads-and-PhraseQuery- >> > tf3988826.html#a11551628 >> >> >> This is one use case, the other is related to the new patch I >> submitted for LUCENE-960. In this case, I have a SpanQueryFilter >> that identifies a bunch of docs and positions ahead of time. Then >> the user enters new Span Query and I want to relate the matches from >> the user query with the positions of matches in the filter and then >> show that window. >> >> > >> > my point was that currently, to retrieve a payload you need a >> > TermPositions instance, which is designed for iterating in the >> > order of... >> > seek(term) >> > skipTo(doc) >> > nextPosition() >> > getPayload() >> > ...which is great for getting the payload of every instance >> > (ie:position) of a specific term in a given document (or in every >> > document) but without serious changes to the Spans API, the ideal >> > payload >> > API would let you say... >> > skipTo(doc) >> > advance(startPosition) >> > getPayload() >> > while (nextPosition() < endPosition) >> > getPosition() >> > >> > but this seems like a nearly impossible API to implement given the >> > natore >> > of hte inverted index and the fact that terms aren't ever stored in >> > position order. >> > >> > there's a lot i really don't know/understand about the lucene term >> > position internals ... but as i recall, the datastructure written >> > to disk >> > isn't actually a tree structure inverted index, it's a long >> > sequence of >> > tuples correct? so in theory you could scan along the tuples >> > untill you >> > find the doc you are interested in, ignoring all of the term info >> > along >> > the way, then whatever term you happen be on at the moment, you >> > could scan >> > along all of the positions until you find one in the range you are >> > interested in -- assuming you do, then you record the current Term >> > (and >> > read your payload data if interested) >> >> I think the main issue I see is in both the payloads and the matching >> case above is that they require a document centric approach. And >> then, for each Document, >> you ideally want to be able to just index into an array so that you >> can go directly to the position that is needed based on >> Span.getStart() >> >> > >> > if i remember correctly, the first part of this is easy, and >> > relative fast >> > -- i think skipTo(doc) on a TermDoc or TermPositions will happily >> > scan for >> > the first <term,doc> pair with the correct docId, irregardless of >> > the term >> > ... the only thing i'm not sure about is how efficient it is to >> > loop over >> > nextPosition() for every term you find to see if any of them are in >> > your >> > range ... the best case scenerio is that the first position >> > returned is >> > above the high end of your range, in which case you can stop >> > immediately >> > and seek to the next term -- butthe worst case is that you call >> > nextPosition() over an over a lot of times before you get a >> > position in >> > (or above) your rnage .... an advancePosition(pos) that wokred like >> > seek >> > or skipTo might be helpful here. >> > >> > : I feel like I am missing something obvious. I would suspect the >> > : highlighter needs to do this, but it seems to take the reanalyze >> > : approach as well (I admit, though, that I have little experience >> > with >> > : the highlighter.) >> > >> > as i understand it the default case is to reanalyze, but if you >> have >> > TermFreqVector info stored with positions (ie: a >> > TermPositionVector) then >> > it can use that to construct a TokenStream by iterating over all >> > terms and >> > writing them into a big array in position order (see the >> > TermSources class >> > in the highlighter) >> >> >> Ah, I see that now. Thanks. >> > >> > this makes sense when highlighting because it doesn't know what >> > kind of >> > fragmenter is going to be used so it needs the whole TokenStream, >> > but it >> > seems less then ideal when you are only interested in a small >> > number of >> > position ranges that you know in advance. >> > >> > : I am wondering if it would be useful to have an alternative Term >> > : Vector storage mechanism that was position centric. Because we >> > : couldn't take advantage of the lexicographic compression, it >> would >> > : take up more disk space, but it would be a lot faster for these >> > kinds >> > >> > i'm not sure if it's really neccessary to store the data in a >> position >> > centric manner, assuming we have a way to "seek" by position like i >> > described above -- but then again i don't really know that what i >> > described above is all that possible/practical/performant. >> > >> >> I suppose I could use my Mapper approach to organize things in a >> position centric way now that I think about it more. Just means some >> unpacking and repacking. Still, probably would perform well enough >> since I can setup the correct structure on the fly. I will give this >> a try. Maybe even add a Mapper to do this. >> >> >> -Grant >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]