https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727

--- Comment #15 from John Mertz <john.me...@mailcleaner.net> ---
Hello all,

As mentioned in the previous comment, a fairly major revision to the plugin has
been made which adds image preprocessing with OpenCV. I've been testing
succesfully for more than a week on a handful of machines and am currently
looking at deploying to some larger installations.

The code is now on our GitHub:

https://github.com/MailCleaner/TesseractOcr

Note that there are two significant branches. master is configured to work
correctly with Tesseract version 4 and above which uses training data which
provides significantly better results than earlier versions. Using this version
is encouraged unless your system does not support it. This also provides
additional configuration variables to define the training data location and
languages to be passed when executing Tesseract.

If you system does not provide Tesseract version 4 or above, there is a branch
called 3.00 which will continue to support that.

On the issue of adopting the plugin into the core distribution or advertising
it as a 3rd party plugin. I don't mind either way now that we've moved
distribution of the plugin to GitHub. I would note that the dependency on
tesseract-ocr and libopencv-dev does add over 100MB to the plugin's otherwise
modest size. Including this bloat for a disabled-by-default plugin probably is
not great. We can close this thread and I will announce further updates to the
relevant mailing list if no one has any additional input.

For emails with many/large images, the impact on scantimes can be significant.
As it is, the plugin has configuration variables for various time, size and
dimension limits, so the outlier messages cannot be too catastrophic to overall
performance, but the next feature is likely to be caching to help reduce some
additional load.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to