[EMAIL PROTECTED] wrote:
Grant,
considering the answer from Karl, it seems that we have to choice to put all
the documents in one index or use an index for each language. You are using
an index for each language. We are currently discussing the pros and cons
for both solutions. Thus we would be very interested to find out about your
reasons to use a separate index for each language.
Hmmm, it was a while ago and I am not 100% convinced I would make the
same decision again, but I seem to recall a few reasons:
1. I thought it would be easier to manage them separately, as they all
come from distinct collections. We can easily take one language/index
off line w/o affecting the other indexes. I think this has been proved
out over time in our case. We often index the same set of documents
several different ways (different stemmers, adding proper nouns,
phrases, transliteration, translation, etc.), trying to evaluate which
one gives us the best results. Having them all in the same collection
makes this harder. What works for you probably depends on how often you
update/delete, etc.
2. I am not certain of what it means to have an English query match
against Arabic documents that contain English in them (or any other
language). On the surface this seems fine, since it is a term match,
but I am not sure if it is meaningful when it comes to meeting a user's
information need. This is just a hunch and I am not sure it is a
correct one. I think it would probably warrant more study from a user's
perspective. I would like to hear other's opinions on this.
3. Logically, to me, our collections our separate. They come from
different sources, they are in different languages, they have different
styles, different authors, etc. It just _feels_ like they should be
kept separate. You may not care about this distinction.
One question I have always had is whether storing everything in the same
index skews the IDF values by giving a term more importance than it
really warrants. My guess is it doesn't b/c this would happen evenly
for all terms. However, some of the times you have terms that occur in
both collections, so the IDF for these may be smaller than it would be
relative to having indexed the collection separately. Is this valid or
am I talking crazy?
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]