I've been wondering about methods for indicating quality claims for electronic 
book transcription.

Let's say we have an OCR'd PDF, such as this one:

http://ia700408.us.archive.org/24/items/spoonriveranthol00mastiala/spoonriveranthol00mastiala.pdf

The text on the title page is easy for me to type in:

----
COPYRIGHT, 1914 AND 1915,
BY WILLIAM MARION REEDY.
COPYRIGHT, 1915 AND 1916,
BY THE MACMILLAN COMPANY.
Set up and electrotyped. Published April, 1915.
Norwood Press
J. S. Cushing Co. — Berwick & Smith Co.
Norwood, Mass., U.S.A.
----

But the text as copied from OS X Preview, is this:

----
COPYRIGHT, 1914 AND 1915, BY WILLIAM MARION REEDY.
COPYRIGHT, 1915 AND 1916, BY THE MACMILLAN COMPANY.
up and electrotyped. Published April, 1915.
NortoonU tyrezs J. 8. Gushing Co.       Berwick & Smith C. Norwood, Maas., 
U.S.A.
----

This seems to me to be pretty poor. The publisher information is barely 
recognizable. An entire word ("Set") has been ignored. Blackletter type has 
totally confused the OCR. Line breaks are missed.

Is there any standard practice for measuring the quality of an OCR 
transcription? Or any other transcription? For example, a random full page of 
text could be proofread and given a score, which could be tagged onto the 
digital text. OCR engine makers would have a handy library of problematic texts.

It would at least be good to be able to mark those texts that have been 
thoroughly checked - something that any important edition surely deserves.

And it would be good to mark those which have failed, such as this:

http://books.google.com/books?id=IrY9AAAAcAAJ&pg=PT41#v=onepage&q&f=false

Note that Google doesn't seem to understand the long s - ſ - transcribing it as 
f. Search that book above for ipſe, ipse and ipfe.

Any thoughts?

- L

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to