https://bz.apache.org/SpamAssassin/show_bug.cgi?id=8310
Bug ID: 8310 Summary: Issue with Matching UTF-8 Anchor Text in URIDetail plugin Product: Spamassassin Version: 4.0.2 Hardware: All OS: Linux Status: NEW Severity: major Priority: P2 Component: Plugins Assignee: dev@spamassassin.apache.org Reporter: thana...@gmail.com Target Milestone: Undefined Created attachment 5997 --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5997&action=edit sample email Bug in SpamAssassin's uri_detail plugin related to matching Unicode characters using \x{00} notation within anchor text. While ASCII hex escapes (\x6f) and non-hex escapes (\s) work, Unicode hex escapes (\x{E0}) fail within uri_detail rules. However, these Unicode hex escapes work correctly in regular body rules. The issue is specific to the uri_detail context. The (?^aa: prefix in the regex might be related, but removing it doesn't solve the problem. Even pasting the raw Unicode character directly into the regex fails. uri_detail UNICODE_LINK_TEXT text =~ /\\x{E0}\\x{B8}\\x{97}\\x{E0}\\x{B8}\\x{B1}\\x{E0}\\x{B8}\\x{99}\\x{E0}\\x{B8}\\x{97}\\x{E0}\\x{B8}\\x{B5}/ The anchor text is \x{E0}\x{B8}\x{95}\x{E0}\x{B9}\x{88}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{AD}\x{E0}\x{B8}\x{B2}\x{E0}\x{B8}\x{A2}\x{E0}\x{B8}\x{B8}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B1}\x{E0}\x{B8}\x{99}\x{E0}\x{B8}\x{97}\x{E0}\x{B8}\x{B5} -- You are receiving this mail because: You are the assignee for the bug.