[Bug 8129] New: Subject gets UTF-8 encoded twice in some circumstances

bugzilla-daemon Thu, 11 May 2023 17:53:08 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8129


            Bug ID: 8129
           Summary: Subject gets UTF-8 encoded twice in some circumstances
           Product: Spamassassin
           Version: 4.0.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: Undefined

Created attachment 5884
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5884&action=edit
sample email - Simplified Chinese

In the process of developing a new plugin, I believe I found a bug in SA. As
you're aware, the first line of the body text is actually the message subject.
However, there are cases where the first line of body text gets mangled because
the subject is already UTF-8 encoded and then it gets UTF-8 encoded again when
the whole body gets encoded. This causes body rules to not find matches in the
subject. 

It doesn't affect messages where the subject is 7-bit ASCII since UTF-8
encoding a 7-bit ASCII string is a no-op. I've included a patch along with a
test case in Simplified Chinese. 

To reproduce, create the following two rules:

  header SUBJ_TEST Subject =~
/\xE5\xA4\x96\xE8\xB4\xB8\xE5\xAE\xA2\xE6\x88\xB7\xE5\xBC\x80\xE5\x8F\x91/
  body   BODY_TEST
/\xE5\xA4\x96\xE8\xB4\xB8\xE5\xAE\xA2\xE6\x88\xB7\xE5\xBC\x80\xE5\x8F\x91/

Run SA with the attached email. Both rules should fire but only the SUBJ_TEST
rule fires.

I validated the patch with the full test suite and it passes. But I would
appreciate some feedback to make sure I'm doing it right. For example, is it
safe to assume that the subject always UTF-8 encoded or is the call to
utf8::is_utf8 really necessary? From what I've read, the is_utf8 function is
deprecated and shouldn't be used anymore.

Thanks
Kent

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 8129] New: Subject gets UTF-8 encoded twice in some circumstances

Reply via email to