So, long story short, we're using an Archiver as a solution to parse
attachments from email message contents and into file storage for
indexing by a separate service.
However, when we're trying to parse the message (i.e. inside our class)
when we look at the message and try to read headers, etc. for it, it
simply hard-fails with none types or similar.
This goes along with a customized 'thread ID' system that we also made
and deployed that utilizes Mailman's handlers and pipelines to implement
custom Thread IDs on each message thread, tracking it via subject line
and custom headers and such. Suffice it to say, that has worked well
for over a year.
The following is an example of where we hit problems (note: assume all
import statements are here, i'm not including them all for brevity):
P = re.compile(r'%\d*d')
TID_PATTERN = re.compile(r'[[{<][a-zA-Z0-9_-]+-\d+[]}>]')
NUMERIC_PATTERN = re.compile(r'\d+')
BASE_PATH = "/opt/mailman/attachment_extraction"
@public
@implementer(IArchiver)
class AttachmentArchiver:
"""
Third party archiver class that extracts mailing list message
attachments
to external storage locations, and logs in a Database.
"""
name = 'attachment_archiver'
is_enabled = False
@staticmethod
def list_url(mlist):
return None
@staticmethod
def permalink(mlist, msg) -> None:
return None
@staticmethod
def archive_message(mlist, msg):
# msg data here
message_sender = msg.get('From', 'UNKNOWN SENDER')
message_subject = msg.get('Subject', '(No Subject)')
dt = dateparser.parse(msg.get('Date'))
# MList data here
listserv_name = mlist.posting_address
prefix = mlist.subject_prefix
# Calculated or Computed data (regex extracted thread ID for
example)
prefix_pattern = re.escape(prefix)
# Unescape '%'
prefix_pattern = '%'.join(prefix_pattern.split(r'\%'))
if P.search(prefix, 1):
prefix_pattern = P.sub(r'\\s*\\d+\\s*', prefix_pattern)
try:
thread_id = TID_PATTERN.search(message_subject)[0]
except TypeError:
thread_id = None
# If there's no thread ID in the subject, then this
archiver was misconfigured.
log.error(f"[attachment_extractor] ERROR: No Thread-ID
found in message archived for mailing list "
f"'{mlist.list_name} - if this mailing list
doesn't have threading, DISABLE attachment_extractor "
f"in that list's archivers; if it does, then this
is an Improperly Handled Message Error.")
raise
thread_storage_path = f"{BASE_PATH}/{thread_id}"
with open(f"/tmp/test/{msg.get('Message-Id')}.eml", mode="wb")
as f:
f.write(msg.as_bytes())
When we get to the TID_PATTERN.search(message_subject)[0] line though,
we get an error about it expecting a string or bytes like object.
So, I need some information (hint: type hinting in your examples, etc.
would be wonderful):
(1) What is the datatype of `msg` in the archive_message class? Is it
an email.message.EmailMessage or email.message.Message or some Mailman
datatype representation of a message?
(2) If the msg datatype is not of email.message.Message or
email.message.EmailMessage, how should we go about getting data and
headers *out* of the message for the process?
Thomas
_______________________________________________
Mailman-Developers mailing list -- mailman-developers@python.org
To unsubscribe send an email to mailman-developers-le...@python.org
https://mail.python.org/mailman3/lists/mailman-developers.python.org/
Mailman FAQ: https://wiki.list.org/x/AgA3
Security Policy: https://wiki.list.org/x/QIA9