Steven wrote: > I routinely use mhfixmsg to clean up incoming messages, using this command > in a shell script invoked through procmail: > > mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 \ > -reformat -fixcte -fixboundary -noreplacetextplain \ > -fixtype application/octet-stream -noverbose -file - \ > -outfile $destination < $source
> original message: > > Veuillez ne pas r=E9 > > This should decode to the following (represented in UTF-8): > > Veuillez ne pas ré > > ...but mhfixmsg turns that into > > Veuillez ne pas ré (I truncated the examples to focus on the first errant conversion, see below.) > My questions are then: > > 1) Is this a bug in mhfixmsg, or am I just using it incorrectly? > > 2) If the former, is there further information I can supply to help track > this down, or further tests I can conduct on the message in question? > > 3) ...or if the latter, what am I doing wrong, and what should I be doing > instead? Good questions, and thank you for your detailed report. Looking at the first 8-bit character in the excerpt, E9 in iso8859-1, that should have been converted to C3A9 in UTF-8. iconv correctly does that: $ printf '\xE9' | iconv -f iso-8859-1 -t utf-8 | hexdump -C 00000000 c3 a9 |..| Instead, it got converted to C383C2A9. I'm not sure why. I expect that your environment is close enough to: $ iconv --version iconv (GNU libc) 2.34 $ locale LANG=en_CA.utf8 LC_CTYPE="en_CA.utf8" LC_NUMERIC="en_CA.utf8" LC_TIME="en_CA.utf8" LC_COLLATE="en_CA.utf8" LC_MONETARY="en_CA.utf8" LC_MESSAGES="en_CA.utf8" LC_PAPER="en_CA.utf8" LC_NAME="en_CA.utf8" LC_ADDRESS="en_CA.utf8" LC_TELEPHONE="en_CA.utf8" LC_MEASUREMENT="en_CA.utf8" LC_IDENTIFICATION="en_CA.utf8" With this small example: $ cat 3 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="mime-boundary" Content-Transfer-Encoding: 8bit --mime-boundary Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 =E9 --mime-boundary Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 é --mime-boundary-- I see correct conversion of the quoted-printable E9 to UTF-8 C3A9: $ mhfixmsg -decodetext 8bit -decodetypes text -textcharset UTF-8 -reformat -fixcte -fixboundary -noreplacetextplain -fixtype application/octet-stream -noverbose -file - -out - < 3 | hexdump -C | egrep a9 000000c0 65 74 3d 22 55 54 46 2d 38 22 0a 0a c3 a9 0a 0a |et="UTF-8"......| Does adding -verbose to your mhfixmsg invocation provide any clues? mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, decode text/plain; charset=iso-8859-1 mhfixmsg: /tmp/mhfixmsgUgtVK1 part 1, decode text/html; charset=iso-8859-1 mhfixmsg: /tmp/mhfixmsgUgtVK1 part 2, convert iso-8859-1 to UTF-8 David
