So, long story short, we're using an Archiver as a solution to parse attachments from email message contents and into file storage for indexing by a separate service.

However, when we're trying to parse the message (i.e. inside our class) when we look at the message and try to read headers, etc. for it, it simply hard-fails with none types or similar.

This goes along with a customized 'thread ID' system that we also made and deployed that utilizes Mailman's handlers and pipelines to implement custom Thread IDs on each message thread, tracking it via subject line and custom headers and such.  Suffice it to say, that has worked well for over a year.

The following is an example of where we hit problems (note: assume all import statements are here, i'm not including them all for brevity):

P = re.compile(r'%\d*d')
TID_PATTERN = re.compile(r'[[{<][a-zA-Z0-9_-]+-\d+[]}>]')
NUMERIC_PATTERN = re.compile(r'\d+')
BASE_PATH = "/opt/mailman/attachment_extraction"

@public
@implementer(IArchiver)
class AttachmentArchiver:
    """
    Third party archiver class that extracts mailing list message attachments
    to external storage locations, and logs in a Database.
    """

    name = 'attachment_archiver'
    is_enabled = False


    @staticmethod
    def list_url(mlist):
        return None

    @staticmethod
    def permalink(mlist, msg) -> None:
        return None

    @staticmethod
    def archive_message(mlist, msg):
        # msg data here
        message_sender = msg.get('From', 'UNKNOWN SENDER')
        message_subject = msg.get('Subject', '(No Subject)')
        dt = dateparser.parse(msg.get('Date'))

        # MList data here
        listserv_name = mlist.posting_address
        prefix = mlist.subject_prefix

        # Calculated or Computed data (regex extracted thread ID for example)
        prefix_pattern = re.escape(prefix)
        # Unescape '%'
        prefix_pattern = '%'.join(prefix_pattern.split(r'\%'))
        if P.search(prefix, 1):
            prefix_pattern = P.sub(r'\\s*\\d+\\s*', prefix_pattern)
        try:
            thread_id = TID_PATTERN.search(message_subject)[0]
        except TypeError:
            thread_id = None
            # If there's no thread ID in the subject, then this archiver was misconfigured.             log.error(f"[attachment_extractor] ERROR: No Thread-ID found in message archived for mailing list "                       f"'{mlist.list_name} - if this mailing list doesn't have threading, DISABLE attachment_extractor "                       f"in that list's archivers; if it does, then this is an Improperly Handled Message Error.")
            raise

        thread_storage_path = f"{BASE_PATH}/{thread_id}"
        with open(f"/tmp/test/{msg.get('Message-Id')}.eml", mode="wb") as f:
            f.write(msg.as_bytes())

When we get to the TID_PATTERN.search(message_subject)[0] line though, we get an error about it expecting a string or bytes like object.

So, I need some information (hint: type hinting in your examples, etc. would be wonderful):

(1) What is the datatype of `msg` in the archive_message class?  Is it an email.message.EmailMessage or email.message.Message or some Mailman datatype representation of a message?

(2) If the msg datatype is not of email.message.Message or email.message.EmailMessage, how should we go about getting data and headers *out* of the message for the process?


Thomas
_______________________________________________
Mailman-Developers mailing list -- mailman-developers@python.org
To unsubscribe send an email to mailman-developers-le...@python.org
https://mail.python.org/mailman3/lists/mailman-developers.python.org/
Mailman FAQ: https://wiki.list.org/x/AgA3

Security Policy: https://wiki.list.org/x/QIA9

Reply via email to