https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7727
John Mertz <john.me...@mailcleaner.net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |john.me...@mailcleaner.net --- Comment #4 from John Mertz <john.me...@mailcleaner.net> --- (In reply to Henrik Krohns from comment #1) > Hi there, some comments. I'm always blunt, so keep that in mind. :-) Understood. I'm here for the advice. :) > - Quite hesitant about $p->set_rendered() use and the whole concept of > adding "random crap" to rendered body, which could have many side > effects. Well the idea is that it not be "random crap", but crap that has always been in the message, but never considered to be part of the message in the past. The email's reader doesn't really distinguish between a jpeg of a word and the word in plaintext, and so the filter shouldn't really either to the extent possible (from the point of view that it's all text, not that the presence of an image is not an indicator for other reasons). In the testing I have done, very little content that is not text gets added. This is mostly things like pipes being added where an actual vertical line is being used as a separator. The characters matched are not always perfect, of course, but I have not seen it add anything completely unexpected. > Nothing uses set_rendered() in current codebase, there > needs to be some investigation if it actually works properly, how > module priorities should be set, whether stuff end up in Bayes and > other plugins that read rendered body etc. I'm aware that set_rendered is not used elsewhere, it took quite some time to discover exactly how to do the bits because I couldn't find any other use of it for reference and there is limited documentation in this area. My intent is that this content IS part of the text of the message and should be treated the same as the literal text. This would mean bayes consideration, etc. > Of course this is also > bad performance wise if all tests have to wait for OCR to complete. > An async model would be much more generally usable, but it would > have to have own rules (FuzzyOCR style?). Or perhaps body rules > could be tflagged with "ocr" so they can be run independently > against that data. Of course. Given the intention that the content be treated as if it is part of the body, the Fuzzy method isn't really possible. My understanding of the tflags would be that the rules would run later but that they would all be done after the message was completely parsed. I didn't intend to have ocr-specific rules, so I don't think that tflagging is constructive (I have limited understanding here). > - Absolutely no reason to use a 9 year old Image::OCR::Tesseract > module, from an author who seems to pretty much left CPAN business > long time ago. The modules only purpose seems to be running > tesseract/convert shell commands, for which a proper SA way is using > $pms->enter_helper_run_mode() / Timeout / helper_app_pipe_open - see > Pyzor.pm for example. This was a simple case of not wanting to duplicate effort. If there is a proper SA way to do things I'll be happy to make that change. > - It is _horrible_ security wise to use mime provided $fname for a > filesystem name. Just use the static counter, and $ext if needed to > construct a fixed safe name. This also removes need for the quite > iffy chr(65+$unique++) loop.. That part I stole directly from Fuzzy. I didn't see the need to use the filename, but assumed there was a reason Fuzzy did it, so I did too. Because fuzzy did it, I assumed that SA was sanitizing the $fname for me. I'll be more than happy to switch to a counter. > - Unneeded dependency File::Which, should use > Mail::SpamAssassin::Util::find_executable_in_env_path Happy to do that as well. Thanks for bringing this to my attention. > - Proper way to combine path and filename string: > File::Spec->catfile($tmpdir, $fname) Thanks. > - "print PICT" should also check for errors (disk could be full?) Absolutely right. Will do. > - Should not use parse_config to accept any random tocr_* line, use > the set_config() stanza that defines exact supported options, and > sets their defaults and types, validated Ah. Yes. I was crude about this initially and neglected to improve it. I'll do that. > - tesseract_type() has unused $w,$h (remove?) Correct. _skip was originally part of _type until being give it's own function. I'll clean that up too. > - should use more common dbg method (also use info() only for serious > errors, not for "found attachment" things) > > sub dbg { > my $msg = shift; > Mail::SpamAssassin::Plugin::dbg("TesseractOct: $msg", @_); >} Great. I'll re-jigger this too. > - A bit shame to lock into tesseract, as people had good success with > gocr previously. Would have been nice to be able to select engine(s). > Did you try gocr? I was having exceedingly poor results with Fuzzy when it was set up with gocr. To the point that bold, unobscured names of certain pharmaceuticals were not registering at all despite a very loose matching valiue. When I configured it to use Tesseract, the results were much better on the same images, so when I wanted to make a plugin with differing operation Tesseract was the obvious choice. I had considered making the OCR package configurable as in Fuzzy, but was dissuaded because it adds unnecessary development and user complexity. I certainly have nothing against others using gocr, if they prefer it, but in my opinion a fork for each is simpler. To make either/all engines usable in the same module either requires complexity for handling differences within the module itself, or complexity configuration for the user, or both. I'm not opposed to putting the burden on the module to properly handle all of the differences, but more complexity also means more potential bugs. I'll address the comments mentioned above and provide an updated tarball. -- You are receiving this mail because: You are the assignee for the bug.