On Jan 4, 2012, at 4:29 PM, Lars Aronsson wrote: > On 01/04/2012 08:06 AM, Ralf Stephan wrote: >> Regarding a subproblem that nevertheless makes books >> unreadable: >> >> Are language-other-than-English (LOTE) books OCR'd >> with English set? Could they be re-OCR'd when the language >> option is changed? Can a user trigger re-OCR? > > A few years back (2008?) the Internet Archive switched to > ABBYY Finereader, where the language is set when a user > uploads scanned images. After this, OCR quality is quite good.
There are still optimizations left as it looks to me. Yesterday, I uploaded http://www.archive.org/details/ZurOntogenieDerKnochenfische While I was surprised about the OCR quality, it simply ignored non-ASCII characters in text and tried to emulate them via ASCII. So, I don't think the OCR always adapts to the language. What's more, this recent german Fraktur upload http://www.archive.org/details/DasProtoplasma would be much better off with tesseract OCR as said before. > One problem is if older scans were OCRed with older > software and worse results. Should one go back and > run a new OCR on these? Perpetually every 5 years? What's easier? Replace the OCR or write rules that only catch the quirks of a specific OCR software+language+font combination? Clearly the former, IMHO. ralf _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
