[Bug 7727] New Plugin TesseractOcr

bugzilla-daemon Tue, 25 Jun 2019 08:37:40 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727


John Mertz <john.me...@mailcleaner.net> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |john.me...@mailcleaner.net

--- Comment #4 from John Mertz <john.me...@mailcleaner.net> ---
(In reply to Henrik Krohns from comment #1)
> Hi there, some comments. I'm always blunt, so keep that in mind. :-)

Understood. I'm here for the advice. :)

> - Quite hesitant about $p->set_rendered() use and the whole concept of
>   adding "random crap" to rendered body, which could have many side
>   effects.

Well the idea is that it not be "random crap", but crap that has always
been in the message, but never considered to be part of the message in
the past. The email's reader doesn't really distinguish between a jpeg
of a word and the word in plaintext, and so the filter shouldn't really
either to the extent possible (from the point of view that it's all
text, not that the presence of an image is not an indicator for other
reasons).

In the testing I have done, very little content that is not text gets
added. This is mostly things like pipes being added where an actual
vertical line is being used as a separator. The characters matched are
not always perfect, of course, but I have not seen it add anything
completely unexpected.

>   Nothing uses set_rendered() in current codebase, there
>   needs to be some investigation if it actually works properly, how
>   module priorities should be set, whether stuff end up in Bayes and
>   other plugins that read rendered body etc.

I'm aware that set_rendered is not used elsewhere, it took quite some
time to discover exactly how to do the bits because I couldn't find any
other use of it for reference and there is limited documentation in
this area. My intent is that this content IS part of the text of the
message and should be treated the same as the literal text. This would
mean bayes consideration, etc.

>   Of course this is also
>   bad performance wise if all tests have to wait for OCR to complete.
>   An async model would be much more generally usable, but it would
>   have to have own rules (FuzzyOCR style?). Or perhaps body rules
>   could be tflagged with "ocr" so they can be run independently
>   against that data.

Of course. Given the intention that the content be treated as if it is
part of the body, the Fuzzy method isn't really possible. My
understanding of the tflags would be that the rules would run later but
that they would all be done after the message was completely parsed. I
didn't intend to have ocr-specific rules, so I don't think that
tflagging is constructive (I have limited understanding here).

> - Absolutely no reason to use a 9 year old Image::OCR::Tesseract
>   module, from an author who seems to pretty much left CPAN business
>   long time ago. The modules only purpose seems to be running
>   tesseract/convert shell commands, for which a proper SA way is using
>   $pms->enter_helper_run_mode() / Timeout / helper_app_pipe_open - see
>   Pyzor.pm for example.

This was a simple case of not wanting to duplicate effort. If there is
a proper SA way to do things I'll be happy to make that change.

> - It is _horrible_ security wise to use mime provided $fname for a
>   filesystem name. Just use the static counter, and $ext if needed to
>   construct a fixed safe name. This also removes need for the quite
>   iffy chr(65+$unique++) loop..

That part I stole directly from Fuzzy. I didn't see the need to use the
filename, but assumed there was a reason Fuzzy did it, so I did too.
Because fuzzy did it, I assumed that SA was sanitizing the $fname for
me. I'll be more than happy to switch to a counter.

> - Unneeded dependency File::Which, should use
>   Mail::SpamAssassin::Util::find_executable_in_env_path

Happy to do that as well. Thanks for bringing this to my attention.

> - Proper way to combine path and filename string:
>   File::Spec->catfile($tmpdir, $fname)

Thanks.

> - "print PICT" should also check for errors (disk could be full?)

Absolutely right. Will do.

> - Should not use parse_config to accept any random tocr_* line, use
>   the set_config() stanza that defines exact supported options, and
>   sets their defaults and types, validated

Ah. Yes. I was crude about this initially and neglected to improve it.
I'll do that.

> - tesseract_type() has unused $w,$h (remove?)

Correct. _skip was originally part of _type until being give it's own
function. I'll clean that up too.

> - should use more common dbg method (also use info() only for serious
>   errors, not for "found attachment" things)
>
> sub dbg {
>  my $msg = shift;
>  Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_);
>}

Great. I'll re-jigger this too.

> - A bit shame to lock into tesseract, as people had good success with
>  gocr previously. Would have been nice to be able to select engine(s).
>  Did you try gocr?

I was having exceedingly poor results with Fuzzy when it was set up
with gocr. To the point that bold, unobscured names of certain
pharmaceuticals were not registering at all despite a very loose
matching valiue. When I configured it to use Tesseract, the results
were much better on the same images, so when I wanted to make a plugin
with differing operation Tesseract was the obvious choice.

I had considered making the OCR package configurable as in Fuzzy, but
was dissuaded because it adds unnecessary development and user
complexity. I certainly have nothing against others using gocr, if they
prefer it, but in my opinion a fork for each is simpler. To make
either/all engines usable in the same module either requires complexity
for handling differences within the module itself, or complexity
configuration for the user, or both. I'm not opposed to putting the
burden on the module to properly handle all of the differences, but
more complexity also means more potential bugs.

I'll address the comments mentioned above and provide an updated
tarball.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7727] New Plugin TesseractOcr

Reply via email to