RE: Lucene, HTML and Hebrew

Itamar Syn-Hershko Wed, 30 Jan 2008 12:10:43 -0800

OK, I've been processing things for a while. I came up with an idea that I
want your advice on -- is there a way I could stem the Hebrew words in my
analyzer yet keep a note of some sort of the original term which was
assembled by this stem, WITHOUT affecting frequency/proximity data? This is
I guess you can say a "must" for Hebrew words searches if one wants to make
them quick efficient and precise, but can be very useful for stem-based
indexing in other languages.


Here are some more details:

Since stemming in Hebrew is ugly - loads of abiguities, and a final stem
resulting in a 3 letters word (which can be constructed in many other forms
later -- and that's in the good case where the stemmer is decided on the
correct stem), I've been looking for some alternative. So it was either
stemming differently (finding a way to stem to larger, more descriptive
stems), or not stemming at all and using query inflation (which should yield
much better, more relevant hits, but have perfomance impact, plus can grow
to more than a few hundreds terms per word).

I'm still brainstorming on the best way to achieve top-notch Hebrew indexing
and searching with Lucene. While doing so, I've had an idea -- assuming the
following is a the IndexReader workflow (more or less):
1. Access Index terms list, find the reference to the records for each word
in Terms
2. Access the Term record in the Index, load all occurences of it (Doc ID,
Field ID along with Freq and Prox data) and do the math and scoring
according to the Query logic.

Is it possible so that for every occurrence of a word in that array
mentioned above (you probably have a name for it...) I would store the
original word used to form that stem of which this hit referrs to? So for
example if I look for "got" it becomes "get" when passed through a stemmer
and then can be found by using any of the words built from the stem "get" --
"gotten", "got", "gets" and whatever. Now, looking at that array of
occurrences I know the stem "get" was found in a specific document, but if I
wanted to score the original word higher than its sibling words built from
the same stem, having the original word stored along with the occurrence
data will certainly help (not only for scoring though - also for specific
searches where I will require the exact word to be found only).

As I said, for Hebrew this will greatly help.

Is it possible using Lucene as it is? If not, I guess this will require file
format change? And if so, can this be done using some sort of deriviation of
existing classes, or it needs to be hacked in the existing code?

If you have any idea on how to achieve similar results in more convinient
ways, please with all means do let me know...

Itamar.



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene, HTML and Hebrew

Reply via email to