https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8272

--- Comment #9 from Sidney Markowitz <sid...@sidney.com> ---
After looking at the comment thread in bug 7126 it seems the logic is pretty
much as I described. The "last resort" is settling for garbage-in / garbage-out
by decoding with can't-fail to produce something even if it is garbage
Windows-1252.

If the data has an explicitly declared charset of UTF-8, I think it is more
likely that it really is UTF-8 with some small number of errors than that it
really is Windows-1252. Decoding it as UTF-8 without fail on error would result
in only the bad bytes (plus up to 3 more bytes per error byte) decoding as
garbage. Decoding such a string as Windows-1252 would turn every multibyte
character into garbage.

So I propose that before getting to the "last resort" we add that if the
charset is declared as UTF-8 we decode as UTF-8 without the FB_CROAK flag.

I see from bug 7126 that at that time Mark Martinec had the most understanding
of the issues and had run tests of the results of decoding in many mails. Mark,
that's from 9 years ago, but do you by chance have any thoughts to weigh in on
this?

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to