https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727
Bug ID: 7727 Summary: New Plugin TesseractOcr Product: Spamassassin Version: unspecified Hardware: PC OS: Linux Status: NEW Severity: normal Priority: P2 Component: Plugins Assignee: dev@spamassassin.apache.org Reporter: john.me...@mailcleaner.net Target Milestone: Undefined Created attachment 5663 --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5663&action=edit TesseractOcr plugin Greetings, On behalf of Fastnet SA/MailCleaner Software, Inc., I have developed a new OCR plugin. The primary reasons behind this are the poor performance of FuzzyOCR in its default configuration and the need to maintain a separate set of rules with that plugin. For those who are not directly familiar, Fuzzy has a separate wordlist with a value for how approximate the match can be which is fairly opaque. When it finds a match, it simply passes the hit rule back. TesseractOcr performs very well and efficiently. Instead of having a separate wordlist, the plugin I have written passed the parsed text back to the parent SpamAssassin process where the content can be matched by regular Body rules. Fuzzy-matching can then be handled through regular expressions which are already largely written with false-positives in mind and which are much easier to debug. The plugin is currently operating on several MailCleaner machines without issue. Given the performance overhead of any OCR plugin, this is not proposed to be enabled by default. I look forward to any feedback. Regards, John Mertz john.me...@mailcleaner.net -- You are receiving this mail because: You are the assignee for the bug.