https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727
Henrik Krohns <apa...@hege.li> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |apa...@hege.li --- Comment #1 from Henrik Krohns <apa...@hege.li> --- Hi there, some comments. I'm always blunt, so keep that in mind. :-) - Quite hesitant about $p->set_rendered() use and the whole concept of adding "random crap" to rendered body, which could have many side effects. Nothing uses set_rendered() in current codebase, there needs to be some investigation if it actually works properly, how module priorities should be set, whether stuff end up in Bayes and other plugins that read rendered body etc. Of course this is also bad performance wise if all tests have to wait for OCR to complete. An async model would be much more generally usable, but it would have to have own rules (FuzzyOCR style?). Or perhaps body rules could be tflagged with "ocr" so they can be run independently against that data. - Absolutely no reason to use a 9 year old Image::OCR::Tesseract module, from an author who seems to pretty much left CPAN business long time ago. The modules only purpose seems to be running tesseract/convert shell commands, for which a proper SA way is using $pms->enter_helper_run_mode() / Timeout / helper_app_pipe_open - see Pyzor.pm for example. - It is _horrible_ security wise to use mime provided $fname for a filesystem name. Just use the static counter, and $ext if needed to construct a fixed safe name. This also removes need for the quite iffy chr(65+$unique++) loop.. - Unneeded dependency File::Which, should use Mail::SpamAssassin::Util::find_executable_in_env_path - Proper way to combine path and filename string: File::Spec->catfile($tmpdir, $fname) - "print PICT" should also check for errors (disk could be full?) - Should not use parse_config to accept any random tocr_* line, use the set_config() stanza that defines exact supported options, and sets their defaults and types, validated - tesseract_type() has unused $w,$h (remove?) - should use more common dbg method (also use info() only for serious errors, not for "found attachment" things) sub dbg { my $msg = shift; Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_); } - A bit shame to lock into tesseract, as people had good success with gocr previously. Would have been nice to be able to select engine(s). Did you try gocr? I would not delay 3.4.3 any further, there's really no need to include this there in this state. Just improve it in trunk, if voted so? I'm still +-0 because of all the architectural issues. Of course having an experimental and disabled by default plugin there is no problem. -- You are receiving this mail because: You are the assignee for the bug.