New submission from Martijn Pieters <m...@python.org>: The From header in the following email headers is not correctly decoded; both the subject and from headers contain UTF-8 encoded data encoded with RFC2047 encoded-words, in both cases a multi-byte UTF-8 codepoint has been split between the two encoded-word tokens:
>>> msgdata = '''\ From: =?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?= =?utf-8?b?seGbiw==?= <mart...@example.com> Subject: =?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?= =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?= ''' >>> from io import StringIO >>> from email.parser import Parser >>> from email import policy >>> msg = Parser(policy=policy.default).parse(StringIO(msgdata)) >>> print(msg['Subject']) # correct sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ >>> print(msg['From']) # incorrect ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖ� �ᛋ <mart...@example.com> Note the two FFFD placeholders in the From line. The issue is that the raw value of the From and Subject contain the folding space at the start of the continuation lines: >>> for name, value in msg.raw_items(): ... if name in {'Subject', 'From'}: ... print(name, repr(value)) ... >From '=?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=\n >=?utf-8?b?seGbiw==?= <mart...@example.com>' Subject '=?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=\n =?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?=' For the Subject header, _header_value_parser.get_unstructured is used, which *expects* there to be spaces between encoded words; it inserts EWWhiteSpaceTerminal tokens in between which are turned into empty strings. But for the From header, AddressHeader parser does not, the space at the start of the line is retained, and the surrogate escapes at the end of one encoded-word and the start start of the next encoded-word never ajoin, so the later handling of turning surrogates back into proper data fails. Since unstructured header parsing doesn't mind if a space is missing between encoded-word atoms, the work-around is to explicitly remove the space at the start of every line; this can be done in a custom policy: import re from email.policy import EmailPolicy class UnfoldingHeaderEmailPolicy(EmailPolicy): def header_fetch_parse(self, name, value): # remove any leading whitespace from header lines # before further processing value = re.sub(r'(?<=[\n\r])([\t ])', '', value) return super().header_fetch_parse(name, value) custom_policy = UnfoldingHeaderEmailPolicy() after which the From header comes out without placeholders: >>> msg = Parser(policy=custom_policy).parse(StringIO(msgdata)) >>> msg['from'] 'ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖᚱᛋ <mart...@example.com>' >>> msg['subject'] 'sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ' This issue was found by way of https://stackoverflow.com/q/53868584/100297 ---------- messages: 332243 nosy: mjpieters priority: normal severity: normal status: open title: email.parser / email.policy does correctly handle multiple RFC2047 encoded-word tokens across RFC5322 folded headers _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue35547> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com