[Bug 7727] New Plugin TesseractOcr

bugzilla-daemon Tue, 25 Jun 2019 12:18:01 -0700

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727


--- Comment #8 from Henrik Krohns <apa...@hege.li> ---
(In reply to John Mertz from comment #4)
> In the testing I have done, very little content that is not text gets
> added. This is mostly things like pipes being added where an actual
> vertical line is being used as a separator. The characters matched are
> not always perfect, of course, but I have not seen it add anything
> completely unexpected.

Ok sounds promising, I might try extracting images from my corpus and see what
tesseract is outputting.

> I'm aware that set_rendered is not used elsewhere, it took quite some
> time to discover exactly how to do the bits because I couldn't find any
> other use of it for reference and there is limited documentation in
> this area. My intent is that this content IS part of the text of the
> message and should be treated the same as the literal text. This would
> mean bayes consideration, etc.

I didn't even know set_rendered exists, and I've been at this for quite a
while. ;-) Just a matter of checking that it does things properly and is good
for possible other use cases too.

> Of course. Given the intention that the content be treated as if it is
> part of the body, the Fuzzy method isn't really possible. My
> understanding of the tflags would be that the rules would run later but
> that they would all be done after the message was completely parsed. I
> didn't intend to have ocr-specific rules, so I don't think that
> tflagging is constructive (I have limited understanding here).

Thinking about it more, flagging would be pointless. Just scan it like it would
be a textual mime part.

> This was a simple case of not wanting to duplicate effort. If there is
> a proper SA way to do things I'll be happy to make that change.

We should always try to use modules that are packaged in most common OS
repositories. And ones that actually do something worthwhile to take the risk
of version updates breaking something. I'm really wary of manually CPANning
these days, sometimes it's a dependency nightmare and hard verifying that stuff
doesn't contain backdoors etc (just have a look at the recent npm event-stream
bitcoin hack..).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7727] New Plugin TesseractOcr

Reply via email to