Re: [darktable-user] Is Darktable for "multi-image moved-object removal"?

Joachim Durchholz Sun, 02 Dec 2018 10:13:47 -0800

Am 02.12.18 um 11:21 schrieb Lorenzo Bolzani:

Why do expect any difference between the images? If you scan the samepage 10 times in a row I would expect no more than a couple of pixels ofsignificant difference.


OCR can stumble over seemingly insignificant differences.

> Does the acquisition happen in an extremely

dusty place? Large dots will likely be present on the paper and will bethe same for all the pages.

I see two kinds of artifacts: Color noise from the sensors, and specklesfrom dust that vary in place from scan to scan.I'm not 100% what's happening, but the OCR results of a standard scanwere bad. Like dozens of errors per page.

Ocr are very robust to this kind of small noise, this kind of "framesubtraction" won't make much difference and a fine tuned ocrdespeckle/denoise won't ruin the text much.


Issue is that I'd need to fine-tune the OCR for each book.
At 20,000 books, that's no more feasible for a single person.

(The books are all paperbacks, but between 5 and 50 years old. Printingtechnology changed, paper changed, paper aged. So I expect verydifferent kinds of finetuning.)

Acquisition and processing are different steps, you can experiment withdifferent denoise settings and see what works best without risk.
You can also ocr the 10 pages and look for differences/majority at thetext level.


I tried, a few sample pages a dozen times or so, and OCRed the stuff.

I'll have to retry and check out whether the OCR gives me differentresults for each scan of the same page; I've been assuming it'sdifferent because the misrecognition rate was unexpectedly high, butassumptions are indeed not what one should go by in such matters :-)

This looks like a programming task to me,

I'm a software engineer and shy away from the sheer amount of work thatwould ge even into a modest script :-)


> I would use opencv and numpy

(I work on ocr programming). And tesseract 4.x, maybe with custom finetuning for each book or for the weird ones.


Is there any hope of decent OCR results without custom fine tuning?

I mean, some of the OCR results were simply ludicrous. Which essentiallymeans that my mental model of what OCR actually does must be verydifferent from that OCR *actually* does, not that Tesseract is bad :-)


Tesseract 4.0 happens to be the software I was working with.

But if the scans are good, the font is reasonable and you add a goodspell checker you'll get very good results even with standard settings.Try this first on a few sample pages.

Well, it's a document-feed scanner. Flatbed would be better but eventurning 400,000 pages would take far too much time, let alone manuallychecking the quality of each result.


Thanks for the feedback!

Regards,
Jo
____________________________________________________________________________
darktable user mailing list
to unsubscribe send a mail to [email protected]

Re: [darktable-user] Is Darktable for "multi-image moved-object removal"?

Reply via email to