Re: [Mailman-Users] privacy options, SPAM, regex

Mark Sapiro Thu, 27 Nov 2008 12:21:24 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Mark Sapiro wrote:
> Helmut Schneider wrote:
>> Interesting, with "^subject:.*Declined.*"
>>
>> Subject: Declined: [Somelist] Invitation to workshop on 13rd Dec. 2008
>>
>> matches while
>>
>> Subject: [Somelist] Declined:  Invitation to workshop on 13rd Dec. 2008
>>
>> does not. Huh?!
> 
> 
> It turns out that RFC 2047 encoded headers are not decoded before
> matching against the regexps. Is that the issue here? What do the raw
> headers look like?
> 
> I think that the headers should be decoded, but I wonder if people are
> currently working around this with regexps that match encoded headers
> and wouldn't match decoded headers.



I have developed a patch for SpamDetect.py which will decode RFC 2047
encoded headers. This is somewhat problematic because the decoded
headers will presumably contain non-ascii characters, and while the
character sets of the headers are known (and there can be different
headers or even different parts of a single header encoded in different
character sets), the character set of the regexps in header_filter_rules
is not known.

The patch creates a unicode object containing all the headers unfolded
and RFC 2047 decoded with one complete header per line and then encodes
it into the character set of the list's preferred_language, and this
result is what the regexps will search. As long as the regexps contain
only ascii and the raw headers contain no non-ascii characters, this
should give expected results. If the regexps contain non-ascii
characters or the headers contain non-ascii not RFC 2047 encoded,
results may be unexpected.

If in fact, the original issue is due to RFC 2047 encoded headers, try
the patch and let us know how it works.

- --
Mark Sapiro <[EMAIL PROTECTED]>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)

iD8DBQFJLwEfVVuXXpU7hpMRArKTAKCiDYtwz3VENF8Qww1tEw3lUMzUnQCgoGNh
K8vySqy57Vn8w0EHpj6LeJM=
=0pk1
-----END PGP SIGNATURE-----

--- f:/test-mailman-2.2/Mailman/Handlers/SpamDetect.py  2007-07-17 
11:06:14.000000000 -0700
+++ f:/test-mailman/Mailman/Handlers/SpamDetect.py      2008-11-27 
11:53:59.468750000 -0800
@@ -26,9 +26,8 @@
 """
 
 import re
-from cStringIO import StringIO
 
-from email.Generator import Generator
+from email.Header import decode_header
 
 from Mailman import mm_cfg
 from Mailman import Errors
@@ -60,34 +59,21 @@
 
 
 
-class Tee:
-    def __init__(self, outfp_a, outfp_b):
-        self._outfp_a = outfp_a
-        self._outfp_b = outfp_b
-
-    def write(self, s):
-        self._outfp_a.write(s)
-        self._outfp_b.write(s)
-
-
-# Class to capture the headers separate from the message body
-class HeaderGenerator(Generator):
-    def __init__(self, outfp, mangle_from_=True, maxheaderlen=78):
-        Generator.__init__(self, outfp, mangle_from_, maxheaderlen)
-        self._headertxt = ''
-
-    def _write_headers(self, msg):
-        sfp = StringIO()
-        oldfp = self._fp
-        self._fp = Tee(oldfp, sfp)
-        try:
-            Generator._write_headers(self, msg)
-        finally:
-            self._fp = oldfp
-        self._headertxt = sfp.getvalue()
+def getDecodedHeaders(msg, cset='utf-8'):
+    """Returns a string containing all the headers of msg, unfolded and
+    RFC 2047 decoded and encoded in cset.
+    """
 
-    def header_text(self):
-        return self._headertxt
+    headers = ''
+    for h, v in msg.items():
+        uvalue = u''
+        v = decode_header(re.sub('\n\s', ' ', v))
+        for frag, cs in v:
+            if not cs:
+                cs = 'us-ascii'
+            uvalue += unicode(frag, cs, 'replace')
+        headers += '%s: %s\n' % (h, uvalue.encode(cset, 'replace'))
+    return headers
 
 
 
@@ -106,13 +92,10 @@
     # TK: Collect headers in sub-parts because attachment filename
     # extension may be a clue to possible virus/spam.
     headers = ''
+    # Get the character set of the lists preferred language for headers
+    cset = mm_cfg.LC_DESCRIPTIONS[mlist.preferred_language][1]
     for p in msg.walk():
-        g = HeaderGenerator(StringIO())
-        g.flatten(p)
-        headers += g.header_text()
-    # Now reshape headers (remove extra CR and connect multiline).
-    headers = re.sub('\n+', '\n', headers)
-    headers = re.sub('\n\s', ' ', headers)
+        headers += getDecodedHeaders(p, cset)
     for patterns, action, empty in mlist.header_filter_rules:
         if action == mm_cfg.DEFER:
             continue

------------------------------------------------------
Mailman-Users mailing list
Mailman-Users@python.org
http://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-users/archive%40jab.org

Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Users] privacy options, SPAM, regex

Reply via email to