According to Gilles Detillieux:
>According to Lennart Almkvist:
>> Some more testing gave the following results:
>> 
>> The german flower words "Stiefmütterchen" and the islandic
>> "þrenningarfjóla" are treated different in meta content
>> and in the body or title part of an html document.
>> 
>> When in the body or in the title,  the  "ü", "þ" and "ó "
>> are decoded to a one byte character in the .wordlist and .words.db files.
>> 
>> In meta content however, these  words are decoded to "stiefmuuml;t"
>> and "thorn;rennin" in the .wordlist and .words.db file. That is the "&" is
>> removed and the rest is kept as letters ("&" is in valid_punctuation but
>> the ";"  is not, by default).
>> 
>> Should not they be decoded as the title or body is ?
>
>Here's a patch for 3.1.2 that should do what you want.  Please give it a
>try and let us know if it fixes this bug.
[...]
> }
>

Something else is going wrong now..

Seems that you strip off one character after the entity, too
somewhere (not everywhere, but in most cases).

e.g. instead of "�ber" I'll get "�er"


cheers,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]            Internet: http://www.inwise.de

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.

Reply via email to