On 01/05/2012 02:19 PM, Laurence Penney wrote: > With a human proofreading UI, it seems essential to be able to “approve” > pages even if no errors are found, to dissuade future OCR from making changes.
In my little Scandinavian e-text archive, this graph shows what we did in 2011, http://runeberg.org/admin/klart-2011.svg The green line at the bottom shows 22,000 scanned pages were proofread. That's 60 pages per day, at an even pace. The blue line is pages we scanned, which were only 10,000 in the first 5 months and then went to 65,000 pages by the end of the year. That is a typical year. Every year, we scan and OCR some 40,000 pages more than we have volunteers to proofread. The Internet Archive scans and OCRs far more than this, but proofreading takes so much longer per page and the number of volunteers is limited. So there will be no shortage of untouched old OCR, that can be replaced by improved OCR some years later. In my project, since we have always run OCR semi-manually with proofreading in mind, I don't think we need to redo our 5 year old OCR. But perhaps some that we did 10 years ago. Some books printed in blackletter/Fraktur could need a new OCR though, if some good software comes around. -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
