https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727

            Bug ID: 7727
           Summary: New Plugin TesseractOcr
           Product: Spamassassin
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Plugins
          Assignee: dev@spamassassin.apache.org
          Reporter: john.me...@mailcleaner.net
  Target Milestone: Undefined

Created attachment 5663
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5663&action=edit
TesseractOcr plugin

Greetings,

On behalf of Fastnet SA/MailCleaner Software, Inc., I have developed a new OCR
plugin.

The primary reasons behind this are the poor performance of FuzzyOCR in its
default configuration and the need to maintain a separate set of rules with
that plugin. For those who are not directly familiar, Fuzzy has a separate
wordlist with a value for how approximate the match can be which is fairly
opaque. When it finds a match, it simply passes the hit rule back.

TesseractOcr performs very well and efficiently. Instead of having a separate
wordlist, the plugin I have written passed the parsed text back to the parent
SpamAssassin process where the content can be matched by regular Body rules.
Fuzzy-matching can then be handled through regular expressions which are
already largely written with false-positives in mind and which are much easier
to debug.

The plugin is currently operating on several MailCleaner machines without
issue. Given the performance overhead of any OCR plugin, this is not proposed
to be enabled by default.

I look forward to any feedback.

Regards,
John Mertz
john.me...@mailcleaner.net

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to