https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8129
Bug ID: 8129
Summary: Subject gets UTF-8 encoded twice in some circumstances
Product: Spamassassin
Version: 4.0.0
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Libraries
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: Undefined
Created attachment 5884
--> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5884&action=edit
sample email - Simplified Chinese
In the process of developing a new plugin, I believe I found a bug in SA. As
you're aware, the first line of the body text is actually the message subject.
However, there are cases where the first line of body text gets mangled because
the subject is already UTF-8 encoded and then it gets UTF-8 encoded again when
the whole body gets encoded. This causes body rules to not find matches in the
subject.
It doesn't affect messages where the subject is 7-bit ASCII since UTF-8
encoding a 7-bit ASCII string is a no-op. I've included a patch along with a
test case in Simplified Chinese.
To reproduce, create the following two rules:
header SUBJ_TEST Subject =~
/\xE5\xA4\x96\xE8\xB4\xB8\xE5\xAE\xA2\xE6\x88\xB7\xE5\xBC\x80\xE5\x8F\x91/
body BODY_TEST
/\xE5\xA4\x96\xE8\xB4\xB8\xE5\xAE\xA2\xE6\x88\xB7\xE5\xBC\x80\xE5\x8F\x91/
Run SA with the attached email. Both rules should fire but only the SUBJ_TEST
rule fires.
I validated the patch with the full test suite and it passes. But I would
appreciate some feedback to make sure I'm doing it right. For example, is it
safe to assume that the subject always UTF-8 encoded or is the call to
utf8::is_utf8 really necessary? From what I've read, the is_utf8 function is
deprecated and shouldn't be used anymore.
Thanks
Kent
--
You are receiving this mail because:
You are the assignee for the bug.