NGrams and positions

Grant Ingersoll Thu, 15 May 2008 05:48:21 -0700

See https://issues.apache.org/jira/browse/LUCENE-1224

Do people have an opinion on what positions ngrams should be outputat? For instance, given 1-grams on "abc fgh", these are currentlyoutput as: a, b, c, f, g,h all with a position increment of 1. Thatseems somewhat reasonable, but it has tradeoffs, namely you can'tquery for something like: "a f" without some amount of slop, which Ithink is a reasonable thing to do (but don't have an actual use casefor at the moment.) An alternative way might be to output a, b, c allat the same position, then increment for f and then put g and h at thesame position.

I am _wondering_ whether it makes more sense to add an option to theNGram token streams such that we could have the choice of eitheroutputting the n-grams within a "token" at the same position or atsuccessive positions (to be back-compatible.) It isn't clear to mewhich is correct, or if there is even a notion of correctness here, inso much as they are both correct if that is the functionality you wantin your application. As DM Smith noted, if Lucene supported thenotion of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.band 2.c for the example above, but that capability doesn't exist inLucene right now, AFAIK.


-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

NGrams and positions

Reply via email to