I've been wondering about methods for indicating quality claims for electronic book transcription.
Let's say we have an OCR'd PDF, such as this one: http://ia700408.us.archive.org/24/items/spoonriveranthol00mastiala/spoonriveranthol00mastiala.pdf The text on the title page is easy for me to type in: ---- COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY. COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY. Set up and electrotyped. Published April, 1915. Norwood Press J. S. Cushing Co. — Berwick & Smith Co. Norwood, Mass., U.S.A. ---- But the text as copied from OS X Preview, is this: ---- COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY. COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY. up and electrotyped. Published April, 1915. NortoonU tyrezs J. 8. Gushing Co. Berwick & Smith C. Norwood, Maas., U.S.A. ---- This seems to me to be pretty poor. The publisher information is barely recognizable. An entire word ("Set") has been ignored. Blackletter type has totally confused the OCR. Line breaks are missed. Is there any standard practice for measuring the quality of an OCR transcription? Or any other transcription? For example, a random full page of text could be proofread and given a score, which could be tagged onto the digital text. OCR engine makers would have a handy library of problematic texts. It would at least be good to be able to mark those texts that have been thoroughly checked - something that any important edition surely deserves. And it would be good to mark those which have failed, such as this: http://books.google.com/books?id=IrY9AAAAcAAJ&pg=PT41#v=onepage&q&f=false Note that Google doesn't seem to understand the long s - ſ - transcribing it as f. Search that book above for ipſe, ipse and ipfe. Any thoughts? - L _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
