: Do we have a best practice for going from, say a SpanQuery doc/ : position information and retrieving the actual range of positions of : content from the Document? Is it just to reanalyze the Document : using the appropriate Analyzer and start recording once you hit the : positions you are interested in? Seems like Term Vectors _could_ : help, but even my new Mapper approach patch (LUCENE-868) doesn't : really help, because they are stored in a term-centric manner. I : guess what I am after is a position centric approach. That is, give
this is kind of what i was suggesting in the last message i sent to the java-user thread about paylods and SpanQueries (which i'm guessing is what prompted this thread as well)... http://www.nabble.com/Payloads-and-PhraseQuery-tf3988826.html#a11551628 my point was that currently, to retrieve a payload you need a TermPositions instance, which is designed for iterating in the order of... seek(term) skipTo(doc) nextPosition() getPayload() ...which is great for getting the payload of every instance (ie:position) of a specific term in a given document (or in every document) but without serious changes to the Spans API, the ideal payload API would let you say... skipTo(doc) advance(startPosition) getPayload() while (nextPosition() < endPosition) getPosition() but this seems like a nearly impossible API to implement given the natore of hte inverted index and the fact that terms aren't ever stored in position order. there's a lot i really don't know/understand about the lucene term position internals ... but as i recall, the datastructure written to disk isn't actually a tree structure inverted index, it's a long sequence of tuples correct? so in theory you could scan along the tuples untill you find the doc you are interested in, ignoring all of the term info along the way, then whatever term you happen be on at the moment, you could scan along all of the positions until you find one in the range you are interested in -- assuming you do, then you record the current Term (and read your payload data if interested) if i remember correctly, the first part of this is easy, and relative fast -- i think skipTo(doc) on a TermDoc or TermPositions will happily scan for the first <term,doc> pair with the correct docId, irregardless of the term ... the only thing i'm not sure about is how efficient it is to loop over nextPosition() for every term you find to see if any of them are in your range ... the best case scenerio is that the first position returned is above the high end of your range, in which case you can stop immediately and seek to the next term -- butthe worst case is that you call nextPosition() over an over a lot of times before you get a position in (or above) your rnage .... an advancePosition(pos) that wokred like seek or skipTo might be helpful here. : I feel like I am missing something obvious. I would suspect the : highlighter needs to do this, but it seems to take the reanalyze : approach as well (I admit, though, that I have little experience with : the highlighter.) as i understand it the default case is to reanalyze, but if you have TermFreqVector info stored with positions (ie: a TermPositionVector) then it can use that to construct a TokenStream by iterating over all terms and writing them into a big array in position order (see the TermSources class in the highlighter) this makes sense when highlighting because it doesn't know what kind of fragmenter is going to be used so it needs the whole TokenStream, but it seems less then ideal when you are only interested in a small number of position ranges that you know in advance. : I am wondering if it would be useful to have an alternative Term : Vector storage mechanism that was position centric. Because we : couldn't take advantage of the lexicographic compression, it would : take up more disk space, but it would be a lot faster for these kinds i'm not sure if it's really neccessary to store the data in a position centric manner, assuming we have a way to "seek" by position like i described above -- but then again i don't really know that what i described above is all that possible/practical/performant. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]