According to Gilles Detillieux:
>According to Lennart Almkvist:
>> Some more testing gave the following results:
>>
>> The german flower words "Stiefmütterchen" and the islandic
>> "þrenningarfjóla" are treated different in meta content
>> and in the body or title part of an html document.
>>
>> When in the body or in the title, the "ü", "þ" and "ó "
>> are decoded to a one byte character in the .wordlist and .words.db files.
>>
>> In meta content however, these words are decoded to "stiefmuuml;t"
>> and "thorn;rennin" in the .wordlist and .words.db file. That is the "&" is
>> removed and the rest is kept as letters ("&" is in valid_punctuation but
>> the ";" is not, by default).
>>
>> Should not they be decoded as the title or body is ?
>
>Here's a patch for 3.1.2 that should do what you want. Please give it a
>try and let us know if it fixes this bug.
[...]
> }
>
Something else is going wrong now..
Seems that you strip off one character after the entity, too
somewhere (not everywhere, but in most cases).
e.g. instead of "�ber" I'll get "�er"
cheers,
Torsten
--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14 Tel: +49-4101-403605
D-25474 Ellerbek Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED] Internet: http://www.inwise.de
------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.