New submission from Martijn Pieters <m...@python.org>:

The From header in the following email headers is not correctly decoded; both 
the subject and from headers contain UTF-8 encoded data encoded with RFC2047 
encoded-words, in both cases a multi-byte UTF-8 codepoint has been split 
between the two encoded-word tokens:

>>> msgdata = '''\
From: =?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=
 =?utf-8?b?seGbiw==?= <mart...@example.com>
Subject: =?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=
 
=?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?=
'''
>>> from io import StringIO
>>> from email.parser import Parser
>>> from email import policy
>>> msg = Parser(policy=policy.default).parse(StringIO(msgdata))
>>> print(msg['Subject'])  # correct
sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ
>>> print(msg['From'])  # incorrect
ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖ� �ᛋ <mart...@example.com>

Note the two FFFD placeholders in the From line.

The issue is that the raw value of the From and Subject contain the folding 
space at the start of the continuation lines:

>>> for name, value in msg.raw_items():
...     if name in {'Subject', 'From'}:
...         print(name, repr(value))
...
>From '=?utf-8?b?4ZuX4Zqr4Zqx4ZuP4ZuB4ZuD4Zq+4ZuI4ZuB4ZuW4ZuP4ZuW4Zo=?=\n 
>=?utf-8?b?seGbiw==?= <mart...@example.com>'
Subject '=?utf-8?b?c8qHdcSxb2THnXBvyZQgOC3ihLLiiqXiiKkgx53Kh8qOcS3E?=\n 
=?utf-8?b?scqHyoNuya8gyaXKh8Sxyo0gx53Gg8mQc3PHncmvIMqHc8edyocgybnHncaDdW/Kgw==?='

For the Subject header, _header_value_parser.get_unstructured is used, which 
*expects* there to be spaces between encoded words; it inserts 
EWWhiteSpaceTerminal tokens in between which are turned into empty strings. But 
for the From header,  AddressHeader parser does not, the space at the start of 
the line is retained, and the surrogate escapes at the end of one encoded-word 
and the start start of the next encoded-word never ajoin, so the later handling 
of turning surrogates back into proper data fails.

Since unstructured header parsing doesn't mind if a space is missing between 
encoded-word atoms, the work-around is to explicitly remove the space at the 
start of every line; this can be done in a custom policy:

import re
from email.policy import EmailPolicy

class UnfoldingHeaderEmailPolicy(EmailPolicy):
    def header_fetch_parse(self, name, value):
        # remove any leading whitespace from header lines
        # before further processing
        value = re.sub(r'(?<=[\n\r])([\t ])', '', value)
        return super().header_fetch_parse(name, value)

custom_policy = UnfoldingHeaderEmailPolicy()

after which the From header comes out without placeholders:

>>> msg = Parser(policy=custom_policy).parse(StringIO(msgdata))
>>> msg['from']
'ᛗᚫᚱᛏᛁᛃᚾᛈᛁᛖᛏᛖᚱᛋ <mart...@example.com>'
>>> msg['subject']
'sʇuıodǝpoɔ 8-Ⅎ⊥∩ ǝʇʎq-ıʇʃnɯ ɥʇıʍ ǝƃɐssǝɯ ʇsǝʇ ɹǝƃuoʃ'

This issue was found by way of https://stackoverflow.com/q/53868584/100297

----------
messages: 332243
nosy: mjpieters
priority: normal
severity: normal
status: open
title: email.parser / email.policy does correctly handle multiple RFC2047 
encoded-word tokens across RFC5322 folded headers

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue35547>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to