[issue34424] Unicode names break email header

Jens Troeger Fri, 17 Aug 2018 16:04:34 -0700

New submission from Jens Troeger <[email protected]>:

See also this comment and ensuing conversation: 
https://bugs.python.org/issue24218?#msg322761


Consider an email message with the following:

message = EmailMessage()
message["From"] = Address(addr_spec="[email protected]", display_name="Jens Troeger")
message["To"] = Address(addr_spec="[email protected]", display_name="Martín Córdoba")

It’s important here that the email itself is `ascii` encodable, but the names 
are not. Flattening the object 
(https://github.com/python/cpython/blob/master/Lib/smtplib.py#L964) incorrectly 
inserts multiple linefeeds, thus breaking the email header, thus mangling the 
entire email:

flatmsg: b'From: Jens Troeger <[email protected]>\r\nTo: Fernando 
=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <[email protected]>\r\r\r\r\r\nSubject:\r\n 
Confirmation: …\r\n…'

After an initial investigation into the BytesGenerator (used to flatten an 
EmailMessage object), here is what’s happening.

Flattening the body and attachments of the EmailMessage object works, and 
eventually _write_headers() is called to flatten the headers which happens 
entry by entry 
(https://github.com/python/cpython/blob/master/Lib/email/generator.py#L417-L418).
 Flattening a header entry is a recursive process over the parse tree of the 
entry, which builds the flattened and encoded final string by descending into 
the parse tree and encoding & concatenating the individual “parts” (tokens of 
the header entry).

Given the parse tree for a header entry like "Martín Córdoba <[email protected]>" 
eventually results in the correct flattened string:

    '=?utf-8?q?Mart=C3=ADn_C=C3=B3rdoba?= <[email protected]>\r\n'

at the bottom of the recursion for this “Mailbox” part. The recursive callstack 
is then:

    _refold_parse_tree _header_value_parser.py:2687
    fold [Mailbox] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Address] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [AddressList] _header_value_parser.py:144
    _refold_parse_tree _header_value_parser.py:2630
    fold [Header] _header_value_parser.py:144
    fold [_UniqueAddressHeader] headerregistry.py:258
    _fold [EmailPolicy] policy.py:205
    fold_binary [EmailPolicy] policy.py:199
    _write_headers [BytesGenerator] generator.py:418
    _write [BytesGenerator] generator.py:195

The problem now arises from the interplay of 

    # 
https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2629
    encoded_part = part.fold(policy=policy)[:-1] # strip nl

which strips the '\n' from the returned string, and

    # 
https://github.com/python/cpython/blob/master/Lib/email/_header_value_parser.py#L2686
    return policy.linesep.join(lines) + policy.linesep

which adds the policy’s line separation string linesep="\r\n" to the end of the 
flattened string upon unrolling the recursion.

I am not sure about a proper fix here, but considering that the linesep policy 
can be any string length (in this case len("\r\n") == 2) a fixed truncation of 
one character [:-1] seems wrong. Instead, using:

    encoded_part = part.fold(policy=policy)[:-len(policy.linesep)] # strip nl

seems to work for entries with and without Unicode characters in their display 
names.

----------
components: email
messages: 323686
nosy: _savage, barry, r.david.murray
priority: normal
severity: normal
status: open
title: Unicode names break email header
type: behavior
versions: Python 3.7

_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34424>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue34424] Unicode names break email header

Reply via email to