Hi list,

This post follows on from something I brought here up almost 2 years ago: http://mail.python.org/pipermail/email-sig/2004-November/000181.html

I finally have time to look at this again and have some (hopefully) better ideas on how to accomplish custom email payload storage. Where I work we need to be able to handle huge email messages which don't always fit in RAM. We are using some pretty awful hacks on the Python email libs to store payloads on disk instead of in memory.

The attached patch against 4.0a2 is rough sketch of a relatively clean way to solve the problem. It is rough and incomplete; I've posted it here to get some feedback before I head too far down the path of implementing a particular solution. A simple demo script is also included.

The Message class has been modified so it can handle payloads that are either a string (as now) or an instance of a new Payload class. A iter_payload() method has been added to Message to allow streaming out of payload data (regardless of the payload type underneath).

I've included 2 sample Payload classes. One is a simple memory store. The other caches payloads to temporary files on disk; the payload doesn't sit in RAM. Future payload classes could:

- use mixed memory/disk storage, storing only large payloads on disk so there's minimal I/O overhead for small payloads

- cache the decoded copy of a payload so that decoding is only done once if the decoded payload is required multiple times

- do crazy things like storing payloads across a network.

The possibilities are endless :)

More work is required on the parsing side. The FeedParser needs to accept an optional Payload factory class and generate payloads of that type as it parses. This should be an easy change.

The Generator class also needs to be modified. It should use the new iter_payload() method so that payloads are not loaded into RAM if the payload is stored in a memory efficient way.

These changes are backwards compatible with the existing email API.

Thoughts/questions/flames?

Regards,
Menno Smits
diff -Naur --exclude '*.pyc' email-4.0a2-orig/email/message.py email-4.0a2/email/message.py
--- email-4.0a2-orig/email/message.py	2006-03-05 19:58:33.000000000 +0000
+++ email-4.0a2/email/message.py	2006-07-14 13:11:00.000000000 +0100
@@ -7,15 +7,13 @@
 __all__ = ['Message']
 
 import re
-import uu
-import binascii
-import warnings
 from cStringIO import StringIO
 
 # Intrapackage imports
 import email.charset
 from email import utils
 from email import errors
+from email import payloads
 
 SEMISPACE = '; '
 
@@ -180,34 +178,15 @@
         is returned.
         """
         if i is None:
-            payload = self._payload
+            return ''.join(self.iter_payload(decode))
         elif not isinstance(self._payload, list):
             raise TypeError('Expected list, got %s' % type(self._payload))
         else:
-            payload = self._payload[i]
-        if decode:
-            if self.is_multipart():
+            # Multipart container
+            if decode:
                 return None
-            cte = self.get('content-transfer-encoding', '').lower()
-            if cte == 'quoted-printable':
-                return utils._qdecode(payload)
-            elif cte == 'base64':
-                try:
-                    return utils._bdecode(payload)
-                except binascii.Error:
-                    # Incorrect padding
-                    return payload
-            elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
-                sfp = StringIO()
-                try:
-                    uu.decode(StringIO(payload+'\n'), sfp, quiet=True)
-                    payload = sfp.getvalue()
-                except uu.Error:
-                    # Some decoding problem
-                    return payload
-        # Everything else, including encodings with 8bit or 7bit are returned
-        # unchanged.
-        return payload
+            else:
+                return self._payload[i]
 
     def set_payload(self, payload, charset=None):
         """Set the payload to the given value.
@@ -215,10 +194,29 @@
         Optional charset sets the message's default character set.  See
         set_charset() for details.
         """
+        if isinstance(payload, payloads.Payload):
+            if payload.encoding:
+                if self.has_key('Content-Transfer-Encoding'):
+                    self.replace_header('Content-Transfer-Encoding', payload.encoding)
+                else:
+                    self['Content-Transfer-Encoding'] = payload.encoding
+
         self._payload = payload
         if charset is not None:
             self.set_charset(charset)
 
+    def iter_payload(self, decode=False):
+        #XXX: document
+        if isinstance(self._payload, str):
+            payload = payloads.MemoryPayload(self._payload)
+        elif isinstance(self._payload, payloads.Payload):
+            payload = self._payload
+        else:
+            raise TypeError('unsupported payload type for iterating')
+
+        for buf in payload.iter(decode):
+            yield buf
+
     def set_charset(self, charset):
         """Set the charset of the payload to a given character set.
 
diff -Naur --exclude '*.pyc' email-4.0a2-orig/email/payloads.py email-4.0a2/email/payloads.py
--- email-4.0a2-orig/email/payloads.py	1970-01-01 01:00:00.000000000 +0100
+++ email-4.0a2/email/payloads.py	2006-07-14 13:10:00.000000000 +0100
@@ -0,0 +1,120 @@
+import tempfile
+import binascii
+import quopri
+import uu
+import base64
+from email import utils
+
+class Payload:
+    def __init__(self, encoding=None, value=''):
+        self.encoding = encoding
+        self.set(value)
+
+    def set(self, buffer):
+        raise NotImplementedError
+
+    def get(self, decode):
+        return ''.join(self.iter(decode))
+
+    def add(self, chunk):
+        raise NotImplementedError
+
+    def iter(self, decode):
+        raise NotImplementedError
+
+class MemoryPayload(Payload):
+
+    def set(self, value):
+        self._buffer = [value]
+
+    def add(self, chunk):
+        self._buffer.append(chunk)
+
+    def iter(self, decode):
+        if decode:
+            decoder = get_string_decoder(self.encoding)
+        else:
+            decoder = None
+
+        # Just yield the payload all at once since it's all in memory already
+        if decoder:
+            yield decoder(''.join(self._buffer))
+        else:
+            yield ''.join(self._buffer)
+
+class FilePayload(Payload):
+
+    chunk_size = 8192
+
+    def __init__(self, encoding=None, value=''):
+        self._f = None
+        Payload.__init__(self, encoding, value)
+
+    def __del__(self):
+        self._close()
+
+    def _close(self):
+        if self._f:
+            self._f.close()
+
+    def set(self, buf):
+        self._close()
+        self._f = tempfile.TemporaryFile()
+        self._f.write(buf)
+
+    def add(self, chunk):
+        self._f.seek(0, 2)
+        self._f.write(chunk)
+
+    def iter(self, decode):
+        if decode:
+            decoder = get_file_decoder(self.encoding)
+        else:
+            decoder = None
+
+        if decoder:
+            # Decode into a separate file first to ensure there are no decoding
+            # errors. 
+            fout = tempfile.TemporaryFile()
+            self._f.seek(0, 0)
+            decoder(self._f, fout)
+        else:
+            fout = self._f
+
+        # Feed out the payload in chunks
+        fout.seek(0, 0)
+        while 1:
+            buf = fout.read(self.chunk_size)
+            if buf:
+                yield buf
+            else:
+                break
+
+        # If decoding occurred, close the temporary file for the decoded version
+        if fout != self._f:
+            fout.close()
+            
+def get_string_decoder(cte):
+    cte = cte.lower()
+
+    if cte == 'quoted-printable':
+        return utils._qdecode
+    elif cte == 'base64':
+        return utils._bdecode
+    elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
+        return binascii.a2b_uu
+    else:
+        return None
+
+def get_file_decoder(cte):
+    cte = cte.lower()
+
+    if cte == 'quoted-printable':
+        return quopri.decode
+    elif cte == 'base64':
+        return base64.decode
+    elif cte in ('x-uuencode', 'uuencode', 'uue', 'x-uue'):
+        return uu.decode
+    else:
+        return None
+
from email.message import Message
from email.payloads import FilePayload
import sys

p = FilePayload('quoted-printable')

p.add('Hello world. =\nThis is some text.\n')
p.add('More stuff.')

msg = Message()
msg.set_type('text/plain')
msg.set_payload(p)

for x in msg.iter_payload(decode=True):
    sys.stdout.write(x)

print
print '-'*70
print msg.as_string()

_______________________________________________
Email-SIG mailing list
Email-SIG@python.org
Your options: 
http://mail.python.org/mailman/options/email-sig/archive%40mail-archive.com

Reply via email to