At 4:47 PM +0800 11/2/00, Mathias K–rber wrote:
>a) index pages which may occur in any of 2 or more languages
Well, sure.
>b) automatically identify which language the files are in (no,
>there is no identifier, this is an email archive which has
>mails in English, German and a few other languages)
No, I'm afraid not. There isn't much "intelligence" in this regard.
Even so, you ask a difficult problem--the code would need to
"recognize" from the text which is one of the harder problems in text
processing. The HTML standard offers several methods for indicating
the language of a document, which would help but from what you say,
these are not used on your pages.
>c) use more than one .aff file, the correct one for each language?
Certainly it would help if ht://Dig kept some metadata for the
language of a document--this would enable language-specific searches
and language-specific fuzzy matching as you describe. But this would
likely be dependent on the META information available in the
documents themselves.
>The FAQ seems to say that I should create a subdir $COMMON/german
>and install the german language files there, but that would make the
>English ones unused, no?
That is correct. Of course you can perform searches on all languages
at the same time--the only restriction is that most fuzzy algorithms
won't work well.
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives: <http://www.htdig.org/mail/menu.html>
FAQ: <http://www.htdig.org/FAQ.html>