https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7249
--- Comment #6 from Mark Martinec <[email protected]> --- > The "real" filename in this example is 'документы для отдела кадров.pdf'. > But spamassassin does not decode such filenames. This bug/feature_missing > leads to Bayes missing actual spam words used in filename and PDFInfo plugin > completely ignoring such attachments because it does not find .pdf extension > in MIME-encoded version of filename. These words are also missing from Bayes tokens, although the code path there is different: header decoding goes through decoding in Message::Node::_decode_header, which intentionally avoids decoding MIME-words in Content-* header fields. The reason is probably in RFC 2047, which explicitly excluded the use of MIME-words there, although a later RFC 2184 introduced such encodings. Will see what can be done with __decode_header() and _normalize() to get such names decoded. Interestingly some time in the far past it seems to have been decided that Encode::decode("MIME-Header",...) may not be the best choice, but have implemented own decoding (Mail::SpamAssassin::Util::qp_decode, Mail::SpamAssassin::Util::base64_decode, __decode_header). Not sure what was the rationale, possibly some bug in the Encode::MIME::Header back then. Seems suboptional now to use two different decoding implementations for decoding of the same header field in two places. >> Btw, the MIME encoding in the provided sample is incorrect, it breaks >> the RFC 2047 section 5 requirement: > > This example was taken from real spam message which was created by some > non-rfc-compliant software. I made some modifications to my copy of Message::Node.pm to better deal with it: just mangle the split character instead of giving up on UTF-8 decoding entirely and falling back to Windows 1250, which yields true gibberish. Needs some more testing. > Do you recommend reverting the change? It can definitely stay in trunk/4.0 but needs more work to deal with such case elsewhere in code. I'm slowly crunching at the characters vs. octets choices, and this is one more welcome piece of the puzzle. On a quick look it seems the $msg->{'name'} is hardly used anywhere except in the PDFInfo plugin, so a change there for the 3.4 branch will likely only have a local effect in this plugin, so it is probably alright. It might be safer to encode the obtained characters into UTF-8 octets for the 3.4 branch, so that octets stay octets. > Perhaps a normalize_charset config true check encapsulating the > change can help then? The normalize_charset is not involved in this code path, so it should not matter whether it is on or off. -- You are receiving this mail because: You are the assignee for the bug.
