New submission from John Howroyd <[email protected]>:
>From the library documentation, it is an intended feature that an email part
>with content_maintype == "message" is treated as multipart. This does not
>seem to be compliant to MIME specification nor expected semantics. The
>attached email (from the dnc wikileaks collection) is a good example where
>this logic breaks down.
Code:
import pathlib
pth = "16155.eml" # path to example
import email
import email.parser
parser = email.parser.BytesParser()
fl = pth.open("rb")
eml = parser.parse(fl)
pts = [p for p in eml.walk()]
len(pts) # returns 52
Here pts[0] is the root part of content_type 'multipart/report'.
Then pts[1] has content_type 'multipart/alternative' containing the
'text/plain' pts[2] and the 'text/html' pts[3] (which most email clients would
consider the message (of this email). All good so far.
The problem is that pts[4] of content_type 'message/delivery-status' which has
pts[4].is_multipart() [True] and contains 46 sub parts as returned by
pts[4].get_payload(): these are pts[5], ... , pts[50]. Finally, pts[51] has
content_type 'text/rfc822-headers' which is fine.
Each of the subparts of pts[4] (pts[5:51]) have "" returned by
pts[n].get_payload() as their content is treated as headers. Where as
pts[4].as_bytes includes the header (and separating blank line) for that part;
namely, b'Content-Type: message/delivery-status\n\n'.
Looking at the raw file and in particular the MIME boundary makers it would
seem to me that pts[4] should not be considered multipart and that there is no
indication from a content-type of 'message/delivery-status' should or even
could be considered an (rfc822) email.
Moreover, as the main developer of a system to forensically analyse large
(million +) corpora of emails this behaviour of treating parts even of the
specific content-type 'message/rfc822' is undisarable; typically, these occur
as part of bounce messages and have their content-disposition set to
'attachment'. As a developer what would seem more natural in the case that
this behaviour is wanted would be to test parts for the content-type
'message/rfc822' and pass the .get_payload(decode=True) to the bytes parser
parser.parse() method.
I appreciate the need to support backwards compatibility, so presumably this
would require an addition to email.policy to govern which parts should be
treated as multipart. I would be more than happy to submit a patch for this
but fear it would be rejected out of hand (as the original intent is clearly to
parse out contained emails).
----------
components: Library (Lib)
files: 16155.eml
messages: 405091
nosy: jdhowroyd
priority: normal
severity: normal
status: open
title: Email part with content type message is multipart.
type: behavior
versions: Python 3.8
Added file: https://bugs.python.org/file50396/16155.eml
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45626>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com