[Bug 7249] Decode MIME-encoded filenames in attachments

bugzilla-daemon Mon, 05 Oct 2015 17:03:06 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249


--- Comment #6 from Mark Martinec <[email protected]> ---
> The "real" filename in this example is 'документы для отдела кадров.pdf'.
> But spamassassin does not decode such filenames. This bug/feature_missing
> leads to Bayes missing actual spam words used in filename and PDFInfo plugin
> completely ignoring such attachments because it does not find .pdf extension
> in MIME-encoded version of filename.

These words are also missing from Bayes tokens, although the code
path there is different: header decoding goes through decoding in
Message::Node::_decode_header, which intentionally avoids decoding
MIME-words in Content-* header fields. The reason is probably in
RFC 2047, which explicitly excluded the use of MIME-words there,
although a later RFC 2184 introduced such encodings.

Will see what can be done with __decode_header() and _normalize()
to get such names decoded.

Interestingly some time in the far past it seems to have been decided
that Encode::decode("MIME-Header",...) may not be the best choice,
but have implemented own decoding (Mail::SpamAssassin::Util::qp_decode,
Mail::SpamAssassin::Util::base64_decode, __decode_header). Not sure
what was the rationale, possibly some bug in the Encode::MIME::Header
back then. Seems suboptional now to use two different decoding
implementations for decoding of the same header field in two places.

>> Btw, the MIME encoding in the provided sample is incorrect, it breaks
>> the RFC 2047 section 5 requirement:
>
> This example was taken from real spam message which was created by some
> non-rfc-compliant software.

I made some modifications to my copy of Message::Node.pm to better
deal with it: just mangle the split character instead of giving up
on UTF-8 decoding entirely and falling back to Windows 1250, which
yields true gibberish. Needs some more testing.

> Do you recommend reverting the change?

It can definitely stay in trunk/4.0 but needs more work to deal with
such case elsewhere in code. I'm slowly crunching at the characters
vs. octets choices, and this is one more welcome piece of the puzzle.

On a quick look it seems the $msg->{'name'} is hardly used anywhere
except in the PDFInfo plugin, so a change there for the 3.4 branch
will likely only have a local effect in this plugin, so it is probably
alright. It might be safer to encode the obtained characters into
UTF-8 octets for the 3.4 branch, so that octets stay octets.


> Perhaps a normalize_charset config true check encapsulating the
> change can help then?

The normalize_charset is not involved in this code path, so it should
not matter whether it is on or off.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7249] Decode MIME-encoded filenames in attachments

Reply via email to