[Bug 7727] New Plugin TesseractOcr

bugzilla-daemon Thu, 12 Dec 2019 04:17:50 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727


spamassas...@arcsin.de changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |spamassas...@arcsin.de

--- Comment #10 from spamassas...@arcsin.de ---
This thread has been mentioned on the users mailing list, so I gave the second
attached version a try.

0. I like the idea to provide a more general approach of passing recognized
text back to SA.
1. There is a call to cleanup() where it should be clean_up().
2. There is a call to kill_pid() which is undefined.
3. Some tests:
3.1:
I trained a bayes database from a personal ham and a spam corpus without
TesseractOcr. Then I compared the classification of enabled vs disabled
TesseractOcr.
3.1.1. A run against the sample provided in [1]:
3.1.1.1. Without ocr it hits BAYES_40.
3.1.1.2. With ocr it hits BAYES_50 and additionally FUZZY_BROWSER.
3.1.2. A run against a current "Deutsche Burger werden reich" sample:
3.1.2.1. Without ocr it hits BAYES_99/BAYES_999.
3.1.2.2. With ocr it hits BAYES_95 and provides nothing additional, so the
total score actually decreased.

3.2:
I trained a new bayes database from the same corpora with TesseractOcr and made
the same quick tests.
3.2.1. A run against the sample provided in [1] provided same results as in
3.1.1.
3.2.2. A run against a current "Deutsche Burger werden reich" sample provides
identical test results, i.e. the bayes scores match. This is good, as one can
improve the situation with custom rules.

Some my takeaway is, that one should probably retrain bayes.

[1]
https://mail-archives.apache.org/mod_mbox/spamassassin-users/201912.mbox/browser

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7727] New Plugin TesseractOcr

Reply via email to