[EMAIL PROTECTED] wrote:
Grant,

considering the answer from Karl, it seems that we have to choice to put all
the documents in one index or use an index for each language. You are using
an index for each language. We are currently discussing the pros and cons
for both solutions. Thus we would be very interested to find out about your
reasons to use a separate index for each language.

Hmmm, it was a while ago and I am not 100% convinced I would make the same decision again, but I seem to recall a few reasons:

1. I thought it would be easier to manage them separately, as they all come from distinct collections. We can easily take one language/index off line w/o affecting the other indexes. I think this has been proved out over time in our case. We often index the same set of documents several different ways (different stemmers, adding proper nouns, phrases, transliteration, translation, etc.), trying to evaluate which one gives us the best results. Having them all in the same collection makes this harder. What works for you probably depends on how often you update/delete, etc.

2. I am not certain of what it means to have an English query match against Arabic documents that contain English in them (or any other language). On the surface this seems fine, since it is a term match, but I am not sure if it is meaningful when it comes to meeting a user's information need. This is just a hunch and I am not sure it is a correct one. I think it would probably warrant more study from a user's perspective. I would like to hear other's opinions on this.

3. Logically, to me, our collections our separate. They come from different sources, they are in different languages, they have different styles, different authors, etc. It just _feels_ like they should be kept separate. You may not care about this distinction.

One question I have always had is whether storing everything in the same index skews the IDF values by giving a term more importance than it really warrants. My guess is it doesn't b/c this would happen evenly for all terms. However, some of the times you have terms that occur in both collections, so the IDF for these may be smaller than it would be relative to having indexed the collection separately. Is this valid or am I talking crazy?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to