https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727
spamassas...@arcsin.de changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |spamassas...@arcsin.de --- Comment #10 from spamassas...@arcsin.de --- This thread has been mentioned on the users mailing list, so I gave the second attached version a try. 0. I like the idea to provide a more general approach of passing recognized text back to SA. 1. There is a call to cleanup() where it should be clean_up(). 2. There is a call to kill_pid() which is undefined. 3. Some tests: 3.1: I trained a bayes database from a personal ham and a spam corpus without TesseractOcr. Then I compared the classification of enabled vs disabled TesseractOcr. 3.1.1. A run against the sample provided in [1]: 3.1.1.1. Without ocr it hits BAYES_40. 3.1.1.2. With ocr it hits BAYES_50 and additionally FUZZY_BROWSER. 3.1.2. A run against a current "Deutsche Burger werden reich" sample: 3.1.2.1. Without ocr it hits BAYES_99/BAYES_999. 3.1.2.2. With ocr it hits BAYES_95 and provides nothing additional, so the total score actually decreased. 3.2: I trained a new bayes database from the same corpora with TesseractOcr and made the same quick tests. 3.2.1. A run against the sample provided in [1] provided same results as in 3.1.1. 3.2.2. A run against a current "Deutsche Burger werden reich" sample provides identical test results, i.e. the bayes scores match. This is good, as one can improve the situation with custom rules. Some my takeaway is, that one should probably retrain bayes. [1] https://mail-archives.apache.org/mod_mbox/spamassassin-users/201912.mbox/browser -- You are receiving this mail because: You are the assignee for the bug.