Re: [ol-discuss] Recording the quality of a book's OCR

Tom Morris Tue, 03 Jan 2012 15:50:39 -0800

On Tue, Jan 3, 2012 at 5:49 PM, Laurence Penney <[email protected]> wrote:


> I see in the Thoreau that there are numerous cases where ‘ll’ is mistaken for 
> ‘U’. It would be splendid if, after just a few of these were fixed manually, 
> something could suggest performing numerous other replacements — particularly 
> cases where ‘ll’ was already a candidate for the OCR of that word-part. Is 
> this something that Abbyy can be induced to do?
>

The best place to incorporate this feedback is the training process
for the recognition engine, so that it can use all the other
information that it has available at that point to improve the
recognition process.

Is there a description of the scanning, image processing, recognition,
and text post-processing pipeline anywhere?  It was described as open
source at introduction, but the referenced source repository
(http://sourceforge.net/projects/scribesw/) hasn't been touched in 5+
years, so it seems pretty unlikely that it represents the software
which is actually in use.

There's a blog post discussing correction of IA texts from last summer:
http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html

The Firefox plugin could be used directly if the files were stored in
hOCR format instead of ABBYY's proprietary XML, but it's a
straightforward conversion process.

Tom
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to