Hi Cedric, I think I'm beginning to get the point of the [10/5/2], and why you called that requirement a bit strange, see below.
To use both normal position info and paragraph position info you'll need two separate, one normal, and one for the paragraphs. To use the normal field to determine the matches, and the paragraph field to determine the weightings of these matches the TermPositions of both fields will have to be advanced completely in sync. That is possible, but not really nice to do. If Lucene had multiple positions for an indexed term, it would be straightforward. But as long as that is not the case, you'll either have to advance the two TermPositions in sync, or use payloads with the paragraph numbers. Or you could relax the paragraph numbering requirement into a positional requirement, and use the modified SpanFirstQuery. That could be done by using an avarage paragraph length to determine the weight at the matching position. As this is easy to implement, I'd first implement this and try to sell it to the users :) At that marketing moment you might as well ask the users what they think of matches that cross paragraph borders. Do you already have a firm requirement for that case? SpanNotQuery can be used to prevent matches over paragraph borders when these are indexed as such, but I would not expect that you would need those, given the fuzzyness of the [10/5/2]. Regards, Paul Elschot Op Friday 15 February 2008 09:45:58 schreef Cedric Ho: > Hi Paul, > > Do you mean the following? > > e.g. to index this: "first second third <paragraphBorder> forth fifth six" > > originally it would be indexed as: > (first,0) (second,1) (third,2) (forth,3) (fifth,4) (six,5) > > now it will be: > (first,0) (second,0) (third,0) (forth,1) (fifth,1) (six,1) > > Then those Query classes that depends on the positional information > (PhraseQuery, SpanQueries) won't work then? unfortunately I'll need > those Query classes as well. > > Cedric > > > > For each word in the input stream make sure that the position > > at which it is indexed in an extra field is the same as the paragraph > > number. That will involve only allowing a position increment at > > a paragraph border during indexing. > > Call this extra field the paragraph field if you will. > > > > Then, during search, search for a Term in paragraph field, and > > use the position from that field, i.e. the paragraph number > > to find a weight for the found term. > > Have a look at PhraseQuery on how to use term positions during > > search. It computes relative positions, but it works on the absolute > > positions that it gets from the index. > > > > SpanFirstQuery also allows to do that, it's a bit more involved, but > > in the end it works from the same absolute positions from the index. > > The version at the jira issue will even allow to use the length of the > > matching spans as the absolute paragraph number, which, in turn, > > allows the use of a Similarity for the paragraph weights [10/5/2]. > > > > There is nothing special about indexed term positions; any term can > > be indexed at any position in a field. Lucene will take advantage of > > the incremental nature of positions by storing only compressed > > differences of positions in the index, but during search the original > > positions are directly available, You can do the same with payloads, > > but why reimplement something that is already available? > > > > Payloads have better uses than positional info, for one they are > > great to avoid disjunctions. For example for verbs, one could > > index only the stem and use a payload for the actual inflected > > form (singular/plural, past/present, first/second/third person, etc). > > > > Regards, > > Paul Elschot > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]