The conventional use of ngrams when searching is not to treat them as a
set but a sequence. Thus, for "foola" you could index the sequence
["_f", "fo", "oo", "ol", "la", "a_"], and then search for the phrase
["oo", "ol"] to find all occurences of "ool". This is useful in
languages that use logograms without spaces, like Japanese and Chinese,
and in other cases (e.g., Nutch uses word-ngrams to optimize searches
for phrases containing very common terms).
Do you have a use-case for the alternative, where n-grams are treated as
a set, rather than a sequence?
Doug
Grant Ingersoll wrote:
See https://issues.apache.org/jira/browse/LUCENE-1224
Do people have an opinion on what positions ngrams should be output at?
For instance, given 1-grams on "abc fgh", these are currently output as:
a, b, c, f, g,h all with a position increment of 1. That seems somewhat
reasonable, but it has tradeoffs, namely you can't query for something
like: "a f" without some amount of slop, which I think is a reasonable
thing to do (but don't have an actual use case for at the moment.) An
alternative way might be to output a, b, c all at the same position,
then increment for f and then put g and h at the same position.
I am _wondering_ whether it makes more sense to add an option to the
NGram token streams such that we could have the choice of either
outputting the n-grams within a "token" at the same position or at
successive positions (to be back-compatible.) It isn't clear to me
which is correct, or if there is even a notion of correctness here, in
so much as they are both correct if that is the functionality you want
in your application. As DM Smith noted, if Lucene supported the notion
of "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c for
the example above, but that capability doesn't exist in Lucene right
now, AFAIK.
-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]