Anders Nielsen wrote:
>Can't you just keep 2 fields, one with the stemmed version of the text used
>for indexing purposes (index but not stored) and a second field with the
>original text (un-indexed but stored). Then when you know you got a match on
>the nth term in the stemmed version, you can use the same Analyzer but
>without the stemming on the stored text field, and take the nth term from
>that?
>
Yes, that is an option in some applications. Unfortunately, what I need
to do involves collation of terms from many documents (those selected by
some query). The implementation I've been using stored information in
the document itself and then retrieved documents, re-parsed the
information, and proceeded to collate the terms. The problem is that
retrieving documents is comparatively slow and especially if they
contain large amounts of data. As a result, this solution is not
workable beyound say 1500 documents or so for real-time queries. So I'm
looking for a better option.
What I may be able to do is to add term vector storage for documents and
then have two fields: one indexed and tvstored with stemmed terms and
another not indexed but tvstored with original words. This might workout
because (hopefully) retrieval of termvectors would be faster than
retrieval of documents.
>
>The only trouble I can see with that is if the stemmer either skips terms or
>makes two terms into one.
>
I've thought about this and the conclusion I came to is that we might
want to separate term re-writing from stemming and treat them as
distinct phases of the analizer's process. This would provide a nice
framework for being able to handle languages that use composit words. An
example would be in German (and I'm not myself a German speaker) when
someone wants to say "black pen" they say it as one word. However, when
searching for a black pen, they might search for "pen", regardless of
the color of its ink. So, I'm thinking that the term re-writing phase
would output the original term and any other terms that can be derived
from it (using a dictionary lookup of some sort).
This stuff is longer term for me though, because our apps first priority
is English, where these things occur but not as often.
>
>
>regards,
>Anders Nielsen
>
btw, I used to work with someone named Andrew Nelson :)