At 4:47 PM +0800 11/2/00, Mathias K–rber wrote:
>a) index pages which may occur in any of 2 or more languages

Well, sure.

>b) automatically identify which language the files are in (no,
>there is no identifier, this is an email archive which has
>mails in English, German and a few other languages)

No, I'm afraid not. There isn't much "intelligence" in this regard. 
Even so, you ask a difficult problem--the code would need to 
"recognize" from the text which is one of the harder problems in text 
processing. The HTML standard offers several methods for indicating 
the language of a document, which would help but from what you say, 
these are not used on your pages.

>c) use more than one .aff file, the correct one for each language?

Certainly it would help if ht://Dig kept some metadata for the 
language of a document--this would enable language-specific searches 
and language-specific fuzzy matching as you describe. But this would 
likely be dependent on the META information available in the 
documents themselves.

>The FAQ seems to say that I should create a subdir $COMMON/german
>and install the german language files there, but that would make the
>English ones unused, no?

That is correct. Of course you can perform searches on all languages 
at the same time--the only restriction is that most fuzzy algorithms 
won't work well.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to