Re: [ol-discuss] Recording the quality of a book's OCR

Ralf Stephan Thu, 05 Jan 2012 00:08:48 -0800

On Jan 4, 2012, at 4:29 PM, Lars Aronsson wrote:

> On 01/04/2012 08:06 AM, Ralf Stephan wrote:
>> Regarding a subproblem that nevertheless makes books
>> unreadable:
>> 
>> Are language-other-than-English (LOTE) books OCR'd
>> with English set? Could they be re-OCR'd when the language
>> option is changed? Can a user trigger re-OCR?
> 
> A few years back (2008?) the Internet Archive switched to
> ABBYY Finereader, where the language is set when a user
> uploads scanned images. After this, OCR quality is quite good.


There are still optimizations left as it looks to me. Yesterday, I uploaded 
http://www.archive.org/details/ZurOntogenieDerKnochenfische

While I was surprised about the OCR quality, it simply ignored
non-ASCII characters in text and tried to emulate them via ASCII.
So, I don't think the OCR always adapts to the language.

What's more, this recent german Fraktur upload
http://www.archive.org/details/DasProtoplasma
would be much better off with tesseract OCR as said before.

> One problem is if older scans were OCRed with older
> software and worse results. Should one go back and
> run a new OCR on these? Perpetually every 5 years?

What's easier? Replace the OCR or write rules that only catch
the quirks of a specific OCR software+language+font combination?
Clearly the former, IMHO.

ralf
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to