According to Markus Fabritius:
> I'm using htdig version 3.1.6 on solaris, my problem is that htdig should
> index a German website which is completely utf-8 encoded. Htdig creates
> db.wordlist but since htdig is not familiar with the German umlauts, it
> just splits the words, e.g. "europ�isch" is split in "europ" and "isch". So
> you cannot search for words with German umlauts.
> 
> I used the htdig.conf with:
> translate_latin1:  false
> translate_lt_gt:   false
> translate_quot:    false
> translate_amp:     false
> locale:            de.UTF-8@euro
> 
> The locale on solaris is also set to de.UTF-8@euro
> 
> The htdig website shows on its TODO list that they are working on "Better
> Internationalization - Support for UTF-8". It is not possible to switch the
> website to ISO-8859-1.
> 
> Is there anyone who had the same problem and solved it??

Good question.  Not to my knowledge, anyway.  Most people indexing
western European languages seem to be using ISO-8859-1, not UTF-8.
Support for UTF-8 is certainly something we want added to htdig, but
there is no developer currently working on it.

The only workaround I can think of, right at the moment, is to write
an external converter to map all UTF-8 characters for western European
languages to their ISO-8859-1 equivalents, and map anything else to
some specific punctuation character (e.g. "?").  You could add it to
your config like this:

locale:                 de_DE
external_parsers:       text/html->text/html-internal \
                        /usr/local/bin/utf8tolatin1

The locale would have to be set to one that uses ISO-8859-1, not UTF-8.
Of course, htsearch's output would then be in ISO-8859-1, not UTF-8,
but if you really need the latter for search results output, you could
write a wrapper program for htsearch that does the reverse mapping.
The translate_* attributes should all be set to true for this.

Any other external parsers or converters you add to external_parsers
should also produce latin1 output, unless they produce UTF-8 HTML which
can then be refiltered by your converter.  Any converter that outputs
latin1 HTML should be indicated as producing "text/html-internal" rather
than "text/html", so it goes right to the interal parser without being
filtered through utf8tolatin1.

I hope this helps.  I'd be interested in knowing if anyone has found
another solution, or has spotted problems with what I'm suggesting here.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: Dice - The leading online job board
for high-tech professionals. Search and apply for tech jobs today!
http://seeker.dice.com/seeker.epl?rel_code1
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to