[Mailman-Developers] Some Confusion over Archivers and data types

Thomas Ward via Mailman-Developers Thu, 29 May 2025 09:51:04 -0700

So, long story short, we're using an Archiver as a solution to parseattachments from email message contents and into file storage forindexing by a separate service.

However, when we're trying to parse the message (i.e. inside our class)when we look at the message and try to read headers, etc. for it, itsimply hard-fails with none types or similar.

This goes along with a customized 'thread ID' system that we also madeand deployed that utilizes Mailman's handlers and pipelines to implementcustom Thread IDs on each message thread, tracking it via subject lineand custom headers and such. Suffice it to say, that has worked wellfor over a year.

The following is an example of where we hit problems (note: assume allimport statements are here, i'm not including them all for brevity):


P = re.compile(r'%\d*d')
TID_PATTERN = re.compile(r'[[{<][a-zA-Z0-9_-]+-\d+[]}>]')
NUMERIC_PATTERN = re.compile(r'\d+')
BASE_PATH = "/opt/mailman/attachment_extraction"

@public
@implementer(IArchiver)
class AttachmentArchiver:
    """

Third party archiver class that extracts mailing list messageattachments

    to external storage locations, and logs in a Database.
    """

    name = 'attachment_archiver'
    is_enabled = False


    @staticmethod
    def list_url(mlist):
        return None

    @staticmethod
    def permalink(mlist, msg) -> None:
        return None

    @staticmethod
    def archive_message(mlist, msg):
        # msg data here
        message_sender = msg.get('From', 'UNKNOWN SENDER')
        message_subject = msg.get('Subject', '(No Subject)')
        dt = dateparser.parse(msg.get('Date'))

        # MList data here
        listserv_name = mlist.posting_address
        prefix = mlist.subject_prefix

# Calculated or Computed data (regex extracted thread ID forexample)

        prefix_pattern = re.escape(prefix)
        # Unescape '%'
        prefix_pattern = '%'.join(prefix_pattern.split(r'\%'))
        if P.search(prefix, 1):
            prefix_pattern = P.sub(r'\\s*\\d+\\s*', prefix_pattern)
        try:
            thread_id = TID_PATTERN.search(message_subject)[0]
        except TypeError:
            thread_id = None

# If there's no thread ID in the subject, then thisarchiver was misconfigured. log.error(f"[attachment_extractor] ERROR: No Thread-IDfound in message archived for mailing list " f"'{mlist.list_name} - if this mailing listdoesn't have threading, DISABLE attachment_extractor " f"in that list's archivers; if it does, then thisis an Improperly Handled Message Error.")

            raise

        thread_storage_path = f"{BASE_PATH}/{thread_id}"

with open(f"/tmp/test/{msg.get('Message-Id')}.eml", mode="wb")as f:

            f.write(msg.as_bytes())

When we get to the TID_PATTERN.search(message_subject)[0] line though,we get an error about it expecting a string or bytes like object.

So, I need some information (hint: type hinting in your examples, etc.would be wonderful):

(1) What is the datatype of `msg` in the archive_message class? Is itan email.message.EmailMessage or email.message.Message or some Mailmandatatype representation of a message?

(2) If the msg datatype is not of email.message.Message oremail.message.EmailMessage, how should we go about getting data andheaders *out* of the message for the process?



Thomas
_______________________________________________
Mailman-Developers mailing list -- mailman-developers@python.org
To unsubscribe send an email to mailman-developers-le...@python.org
https://mail.python.org/mailman3/lists/mailman-developers.python.org/
Mailman FAQ: https://wiki.list.org/x/AgA3

Security Policy: https://wiki.list.org/x/QIA9

[Mailman-Developers] Some Confusion over Archivers and data types

Reply via email to