On 10/12/2011 6:58 PM, William Yardley wrote: > Does Mailman base64 decode the subject before applying a regex, and if > so, can I use UTF-8 character names in the regex to match various > types of 8-bit characters?
No. header filter rules regexps are matched against the raw headers. If a header is RFC2047 encoded, it is not decoded. > Say, for example, that I want to block messages with "电话卡" somewhere > in the subject line. > > Obviously, the actual raw Subject header will be more like: > > Subject: =?GB2312?B?[encoded stuff here]?= > Subject: =?utf-8?B?[encoded stuff here]?= > > I tried putting in a regex to hold messages matching: > Subject: .*\u7535\u8bdd\u5361 > > And that didn't seem to work. As far as I can tell, there is no way to > find a substring that will always match when the Subject header is > base64 encoded. I think this is correct. Each 3 bytes which are base64 encoded result in a 4-character base64 substring. If the characters you are looking for are encoded as a multiple of 3 bytes and begin on a 3-byte boundary, they will encode to a unique base64 string, but if they don't begin and end on a 3-byte boundary the base64 substring will be affected by what comes before and/or after. Thus, I don't think you can reliably match, even if you are only dealing with a single character set. -- Mark Sapiro <[email protected]> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan ------------------------------------------------------ Mailman-Users mailing list [email protected] http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org
