See https://issues.apache.org/jira/browse/LUCENE-1224

Do people have an opinion on what positions ngrams should be output at? For instance, given 1-grams on "abc fgh", these are currently output as: a, b, c, f, g,h all with a position increment of 1. That seems somewhat reasonable, but it has tradeoffs, namely you can't query for something like: "a f" without some amount of slop, which I think is a reasonable thing to do (but don't have an actual use case for at the moment.) An alternative way might be to output a, b, c all at the same position, then increment for f and then put g and h at the same position.

I am _wondering_ whether it makes more sense to add an option to the NGram token streams such that we could have the choice of either outputting the n-grams within a "token" at the same position or at successive positions (to be back-compatible.) It isn't clear to me which is correct, or if there is even a notion of correctness here, in so much as they are both correct if that is the functionality you want in your application. As DM Smith noted, if Lucene supported the notion of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for the example above, but that capability doesn't exist in Lucene right now, AFAIK.

-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to