According to Markus Fabritius:
> I'm using htdig version 3.1.6 on solaris, my problem is that htdig should
> index a German website which is completely utf-8 encoded. Htdig creates
> db.wordlist but since htdig is not familiar with the German umlauts, it
> just splits the words, e.g. "europ�isch" is split in "europ" and "isch". So
> you cannot search for words with German umlauts.
>
> I used the htdig.conf with:
> translate_latin1: false
> translate_lt_gt: false
> translate_quot: false
> translate_amp: false
> locale: de.UTF-8@euro
>
> The locale on solaris is also set to de.UTF-8@euro
>
> The htdig website shows on its TODO list that they are working on "Better
> Internationalization - Support for UTF-8". It is not possible to switch the
> website to ISO-8859-1.
>
> Is there anyone who had the same problem and solved it??
Good question. Not to my knowledge, anyway. Most people indexing
western European languages seem to be using ISO-8859-1, not UTF-8.
Support for UTF-8 is certainly something we want added to htdig, but
there is no developer currently working on it.
The only workaround I can think of, right at the moment, is to write
an external converter to map all UTF-8 characters for western European
languages to their ISO-8859-1 equivalents, and map anything else to
some specific punctuation character (e.g. "?"). You could add it to
your config like this:
locale: de_DE
external_parsers: text/html->text/html-internal \
/usr/local/bin/utf8tolatin1
The locale would have to be set to one that uses ISO-8859-1, not UTF-8.
Of course, htsearch's output would then be in ISO-8859-1, not UTF-8,
but if you really need the latter for search results output, you could
write a wrapper program for htsearch that does the reverse mapping.
The translate_* attributes should all be set to true for this.
Any other external parsers or converters you add to external_parsers
should also produce latin1 output, unless they produce UTF-8 HTML which
can then be refiltered by your converter. Any converter that outputs
latin1 HTML should be indicated as producing "text/html-internal" rather
than "text/html", so it goes right to the interal parser without being
filtered through utf8tolatin1.
I hope this helps. I'd be interested in knowing if anyone has found
another solution, or has spotted problems with what I'm suggesting here.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
-------------------------------------------------------
This sf.net email is sponsored by: Dice - The leading online job board
for high-tech professionals. Search and apply for tech jobs today!
http://seeker.dice.com/seeker.epl?rel_code1
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html