I can only speak to the " avoid matching stemmed or canonical forms" part...
Yes, but you've got to do some fancy dancing when you index, something like adding a special signifier to, say, the original token. I'll ignore the canonical part of your question for the sake of brevity. Consider indexing "running" You'd index "run" and "running$". Now, whenever you care about the original token, you append the '$' to the term and search for that. This has one other advantage. Say you index the term "run" with the above. If you don't do something like adding the $ to the original, you can't distinguish between getting a hit on the stem or not. That is, you can't distinguish between getting a hit where the original word was "run" and one where the original was "running". This may be important for "exact match". Best Erick On 5/25/07, Enis Soztutar <[EMAIL PROTECTED]> wrote:
Hi, In nutch we have a use case in which we need to store tokens with their original text plus their stemmed form plus their canonical form(through some asciifization). From my understanding of lucene, it makes sense to write a tokenstream which generates several tokens for each "word", but place all the tokens for the "word" at the same position with Token#setPositionIncrement(0). This way we could be able to search over this field using any form(stemmed, canonical, original) of the "word". Actually i have two questions here. First is that is there any way to avoid matching stemmed or canonical forms to a phrase query. Moreover it seems that adding multiple forms of the "word"s alters statistical calculations for scoring, especially for tf and idf, because the frequency of the root form of the word is incremented at each word with that root form. Is there any way that we could avoid it? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]