Am 02.12.18 um 11:21 schrieb Lorenzo Bolzani:
Why do expect any difference between the images? If you scan the same
page 10 times in a row I would expect no more than a couple of pixels of
significant difference.
OCR can stumble over seemingly insignificant differences.
> Does the acquisition happen in an extremely
dusty place? Large dots will likely be present on the paper and will be
the same for all the pages.
I see two kinds of artifacts: Color noise from the sensors, and speckles
from dust that vary in place from scan to scan.
I'm not 100% what's happening, but the OCR results of a standard scan
were bad. Like dozens of errors per page.
Ocr are very robust to this kind of small noise, this kind of "frame
subtraction" won't make much difference and a fine tuned ocr
despeckle/denoise won't ruin the text much.
Issue is that I'd need to fine-tune the OCR for each book.
At 20,000 books, that's no more feasible for a single person.
(The books are all paperbacks, but between 5 and 50 years old. Printing
technology changed, paper changed, paper aged. So I expect very
different kinds of finetuning.)
Acquisition and processing are different steps, you can experiment with
different denoise settings and see what works best without risk.
You can also ocr the 10 pages and look for differences/majority at the
text level.
I tried, a few sample pages a dozen times or so, and OCRed the stuff.
I'll have to retry and check out whether the OCR gives me different
results for each scan of the same page; I've been assuming it's
different because the misrecognition rate was unexpectedly high, but
assumptions are indeed not what one should go by in such matters :-)
This looks like a programming task to me,
I'm a software engineer and shy away from the sheer amount of work that
would ge even into a modest script :-)
> I would use opencv and numpy
(I work on ocr programming). And tesseract 4.x, maybe with custom fine
tuning for each book or for the weird ones.
Is there any hope of decent OCR results without custom fine tuning?
I mean, some of the OCR results were simply ludicrous. Which essentially
means that my mental model of what OCR actually does must be very
different from that OCR *actually* does, not that Tesseract is bad :-)
Tesseract 4.0 happens to be the software I was working with.
But if the scans are good, the font is reasonable and you add a good
spell checker you'll get very good results even with standard settings.
Try this first on a few sample pages.
Well, it's a document-feed scanner. Flatbed would be better but even
turning 400,000 pages would take far too much time, let alone manually
checking the quality of each result.
Thanks for the feedback!
Regards,
Jo
____________________________________________________________________________
darktable user mailing list
to unsubscribe send a mail to [email protected]