Here we have to do with plenty documents with lots of white space. A usual headache are patches (lines) of streaky noises, in between valid text lines. I have now started to write a filter; using -x and then extracting the line numbers from there, and store the line numbers with low height. Then I split the OCR-ed text into its lines to purge those lines from the text. Cumbersome, I thought. And then I had the impression that this might be done much easier within ocrad; with an option, somewhat like ocrad -h <height> that simply suppresses the output of lines with a height of <height> or lower. In my humble opinion, the file created with -x should still show everything. But text output would be much cleaner from dirt and dust, when any 'line' of a height below a certain threshold is simply dropped when using this option. I am well aware, that this would as well drop a straight, horizontal line, though, but would not matter in our case. We fight much more with patches and dots of dirt on the scanner surface, that usually screw up inter-line white-space; adding dots, dashes and underscores into the text.
Uwe _______________________________________________ Bug-ocrad mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-ocrad
