Re: NGrams and positions

Doug Cutting Thu, 15 May 2008 09:55:27 -0700

The conventional use of ngrams when searching is not to treat them as aset but a sequence. Thus, for "foola" you could index the sequence["_f", "fo", "oo", "ol", "la", "a_"], and then search for the phrase["oo", "ol"] to find all occurences of "ool". This is useful inlanguages that use logograms without spaces, like Japanese and Chinese,and in other cases (e.g., Nutch uses word-ngrams to optimize searchesfor phrases containing very common terms).

Do you have a use-case for the alternative, where n-grams are treated asa set, rather than a sequence?


Doug

Grant Ingersoll wrote:

See https://issues.apache.org/jira/browse/LUCENE-1224
Do people have an opinion on what positions ngrams should be output at?For instance, given 1-grams on "abc fgh", these are currently output as:a, b, c, f, g,h all with a position increment of 1. That seems somewhatreasonable, but it has tradeoffs, namely you can't query for somethinglike: "a f" without some amount of slop, which I think is a reasonablething to do (but don't have an actual use case for at the moment.) Analternative way might be to output a, b, c all at the same position,then increment for f and then put g and h at the same position.
I am _wondering_ whether it makes more sense to add an option to theNGram token streams such that we could have the choice of eitheroutputting the n-grams within a "token" at the same position or atsuccessive positions (to be back-compatible.) It isn't clear to mewhich is correct, or if there is even a notion of correctness here, inso much as they are both correct if that is the functionality you wantin your application. As DM Smith noted, if Lucene supported the notionof "sub" positions, one could output 1.a, 1.b, 1.c, 2.a, 2.b and 2.c forthe example above, but that capability doesn't exist in Lucene rightnow, AFAIK.
-Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: NGrams and positions

Reply via email to