OK, I've been processing things for a while. I came up with an idea that I want your advice on -- is there a way I could stem the Hebrew words in my analyzer yet keep a note of some sort of the original term which was assembled by this stem, WITHOUT affecting frequency/proximity data? This is I guess you can say a "must" for Hebrew words searches if one wants to make them quick efficient and precise, but can be very useful for stem-based indexing in other languages.
Here are some more details: Since stemming in Hebrew is ugly - loads of abiguities, and a final stem resulting in a 3 letters word (which can be constructed in many other forms later -- and that's in the good case where the stemmer is decided on the correct stem), I've been looking for some alternative. So it was either stemming differently (finding a way to stem to larger, more descriptive stems), or not stemming at all and using query inflation (which should yield much better, more relevant hits, but have perfomance impact, plus can grow to more than a few hundreds terms per word). I'm still brainstorming on the best way to achieve top-notch Hebrew indexing and searching with Lucene. While doing so, I've had an idea -- assuming the following is a the IndexReader workflow (more or less): 1. Access Index terms list, find the reference to the records for each word in Terms 2. Access the Term record in the Index, load all occurences of it (Doc ID, Field ID along with Freq and Prox data) and do the math and scoring according to the Query logic. Is it possible so that for every occurrence of a word in that array mentioned above (you probably have a name for it...) I would store the original word used to form that stem of which this hit referrs to? So for example if I look for "got" it becomes "get" when passed through a stemmer and then can be found by using any of the words built from the stem "get" -- "gotten", "got", "gets" and whatever. Now, looking at that array of occurrences I know the stem "get" was found in a specific document, but if I wanted to score the original word higher than its sibling words built from the same stem, having the original word stored along with the occurrence data will certainly help (not only for scoring though - also for specific searches where I will require the exact word to be found only). As I said, for Hebrew this will greatly help. Is it possible using Lucene as it is? If not, I guess this will require file format change? And if so, can this be done using some sort of deriviation of existing classes, or it needs to be hacked in the existing code? If you have any idea on how to achieve similar results in more convinient ways, please with all means do let me know... Itamar. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]