[Reply sent directly to Lars by mistake, meant to send it to the list]

On 2011-12-30 11:33, Lars Aronsson wrote:
> Hi Edward,
>> A while back I built a prototype for correcting OCR errors in Internet
>> Archive scanned books.
>>
>> http://edwardbetts.com/correct
>
> Very interesting! I tried to add
> http://edwardbetts.com/correct/item/8T20601NOR
> but the text comes one page ahead of the images. (This was scanned
> in Paris, where they use extra rulers along the edges of the page.)

This is one of the bugs.

> A page that I proofread in 20,000 Leagues Under the Sea
> wasn't saved. When I go back, my corrections are not there.
> Was this because I wasn't logged in?

The prototype is incomplete. Yes you need to login to save changes. I'm 
not sure if saving is working right now.

>
>> It shows a page at a time and lets you see the lines of text as images
>> and text. You can click on a word to correct it. The prototype is very
>> rough, it is ugly, incomplete and contains bugs.
>
> Sorry to focus on the bugs. This is worth more work. What if
> the OCR software made mistakes in segmentation, could the
> proofreader correct this by drawing text boxes manually?

This is a good question. The OCR can identify images. I think it might 
have the box around the word already, but think it is an image. I should 
add this to the display somehow with a button to switch it from being an 
image to being a word. That way we already know the coordinates.

Maybe if we have people click on the first letter of the word and type 
it in that should be enough. We probably know the height of the text 
from other words on the line and the width of the word from the number 
of letters. The highlighting doesn't need to be exact. This doesn't work 
if the OCR misses a line entirely.

I forgot to mention that the other place the highlighting is used is in 
the read aloud mode. As the book is read the sentences are highlighted.

I can think of other places that we might use the coordinates, to let 
people select text in the image and get plain text out. Or click a word 
to look it up in the dictionary or Wikipedia.

Code is here: https://github.com/EdwardBetts/corrections
-- 
Edward.
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to