Hi, I just stumbled over some curious behaviour of the stdlib email parsing APIs which accept strings rather than bytes. It appears that you can't parse an 8-bit UTF-8 message you have as a str without first encoding it.
The docs <https://docs.python.org/3/library/email.parser.html#feedparser-api> do mention some problems (which I saw after the fact): > class email.parser.FeedParser(_factory=None, *, policy=policy.compat32) > > Works like BytesFeedParser except that the input to the feed() method > must be a string. This is of limited utility, since the only way for such a > message to be valid is for it to contain only ASCII text or, if utf8 is True, > no binary attachments. > > Changed in version 3.3: Added the policy keyword. Okay, cool - let's try parsing a message with text only (no attachments, no BINARYMIME), with a UTF-8 Content-Type, and a policy with utf8=True. Python 3.7.1rc2 (default, Oct 14 2018, 15:27:05) [GCC 8.2.1 20180831 [gcc-8-branch revision 264010]] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import email.parser, email.policy >>> pol = email.policy.SMTPUTF8 >>> pol.utf8 True >>> pol.cte_type '8bit' >>> msg = '''MIME-Version: 1.0 ... Content-Type: text/plain; charset="utf-8" ... Content-Transfer-Encoding: 8bit ... Subject: ¿Will it parse? Нет. ... ... ¡This message contains two (٢) non-ASCII characters! ... ''' >>> fp = email.parser.FeedParser(policy=pol) >>> fp.feed(msg) >>> msg_obj = fp.close() >>> msg_obj <email.message.EmailMessage object at 0x7ff028012e10> >>> print(msg_obj.get_content()) �This message contains two (\u0662) non-ASCII characters! >>> print(msg_obj['Subject']) ¿Will it parse? Нет. I don't know WHAT it's doing with the body there... It doesn't look like utf8 mode actually did anything. Interesting that the subject header survived! Maybe this is what the utf8=True does? >>> email.policy.default.utf8 False >>> fp2 = email.parser.FeedParser(policy=email.policy.default) >>> fp2.feed(msg) >>> msg_obj2 = fp2.close() >>> print(msg_obj2['Subject']) ¿Will it parse? Нет. Nope. Apparently, contrary to what my reading of the docs suggests, the utf8 flag does nothing at all when parsing. Just to check that this was in fact a perfectly valid email: >>> bfp = email.parser.BytesFeedParser(policy=pol) >>> bfp.feed(msg.encode('utf-8')) >>> msg_objb = bfp.close() >>> print(msg_objb.get_content()) ¡This message contains two (٢) non-ASCII characters! >>> print(msg_objb['Subject']) ¿Will it parse? Нет. BytesFeedParser is happy. Question: Is this a bug? Am I missing something? Does the clause in the docs about utf8 mean anything? Cheers Thomas -- https://mail.python.org/mailman/listinfo/python-list