Am 02.12.18 um 11:21 schrieb Lorenzo Bolzani:
Why do expect any difference between the images? If you scan the same page 10 times in a row I would expect no more than a couple of pixels of significant difference.

OCR can stumble over seemingly insignificant differences.

> Does the acquisition happen in an extremely
dusty place? Large dots will likely be present on the paper and will be the same for all the pages.

I see two kinds of artifacts: Color noise from the sensors, and speckles from dust that vary in place from scan to scan. I'm not 100% what's happening, but the OCR results of a standard scan were bad. Like dozens of errors per page.

Ocr are very robust to this kind of small noise, this kind of "frame subtraction" won't make much difference and a fine tuned ocr despeckle/denoise won't ruin the text much.

Issue is that I'd need to fine-tune the OCR for each book.
At 20,000 books, that's no more feasible for a single person.
(The books are all paperbacks, but between 5 and 50 years old. Printing technology changed, paper changed, paper aged. So I expect very different kinds of finetuning.)

Acquisition and processing are different steps, you can experiment with different denoise settings and see what works best without risk.

You can also ocr the 10 pages and look for differences/majority at the text level.

I tried, a few sample pages a dozen times or so, and OCRed the stuff.
I'll have to retry and check out whether the OCR gives me different results for each scan of the same page; I've been assuming it's different because the misrecognition rate was unexpectedly high, but assumptions are indeed not what one should go by in such matters :-)

This looks like a programming task to me,

I'm a software engineer and shy away from the sheer amount of work that would ge even into a modest script :-)

> I would use opencv and numpy
(I work on ocr programming). And tesseract 4.x, maybe with custom fine tuning for each book or for the weird ones.

Is there any hope of decent OCR results without custom fine tuning?
I mean, some of the OCR results were simply ludicrous. Which essentially means that my mental model of what OCR actually does must be very different from that OCR *actually* does, not that Tesseract is bad :-)

Tesseract 4.0 happens to be the software I was working with.

But if the scans are good, the font is reasonable and you add a good spell checker you'll get very good results even with standard settings. Try this first on a few sample pages.

Well, it's a document-feed scanner. Flatbed would be better but even turning 400,000 pages would take far too much time, let alone manually checking the quality of each result.

Thanks for the feedback!

Regards,
Jo
____________________________________________________________________________
darktable user mailing list
to unsubscribe send a mail to [email protected]

Reply via email to