Re: [ol-discuss] Recording the quality of a book's OCR

Edward Betts Tue, 03 Jan 2012 10:46:36 -0800

I agree we need a way to correct our books. I find it difficult to read 
the ePub or Kindle versions of scanned books because of the OCR errors. 
I'm sure the OCR errors are irritating when using text-to-speech.


On 2011-12-30 16:28, Roger Loran Bailey wrote:
> As a matter of fact, I happen to be blind and do read the Daisy books
> that I can download from Open Library. I am only one person, but I don't
> have much interest in those images myself. I need something that my text
> to speech software can read. I am not a Braille reader, but I am sure
> that those who are are interested in something that can be read by their
> text to Braille software. That includes searching the text too. I am
> vexed by the scanning errors in Open Library books because it makes it
> hard to read. As I said, I am only one person and the people who are
> interested in the photographic images might not be interested in the
> text. It would seem to me though that if the images can be searched just
> as easily by searching an underlying text then that should satisfy
> everyone. I also do not see why some people would be more interested in
> the photos than the text if it reads just the same anyway, but there
> might be reasons that I have not thought of. But since a sighted person
> can read text just as well as the photographs and since blind people
> cannot read the photographs then it would seem to me that priority
> should be given to making that text as readable and searchable as possible.
>
> On 12/30/2011 4:20 PM, Lee Passey wrote:
>> On Fri, December 30, 2011 10:35 am, Edward Betts wrote:
>>
>>> I'll explain scanned image page coordinates for words.
>>>
>>> In the book reader we show the scanned images. This is the image of the
>>> page captured by the book scanner. When you search with the book reader
>>> it highlights matched words. It does this by drawing a box around the
>>> word in the image. To draw the box we need to know the location of the
>>> word on the page, that is the x and y coordinates and the height and
>>> width of the word in pixels.
>>>
>>> This is the reason we're still looking for software to provide
>>> corrections for OCR. You're right, that one solution is to let people
>>> download the text, correct it and upload Word files, but it doesn't
>>> match our requirements.
>> I suspect that this is an unnecessary requirement.
>>
>> The purpose of word coordinates is to support text searching in the various
>> "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user 
>> is
>> not presented with digitized text, she is presented with the picture of a
>> page. Of course, one cannot search for text in a blob of pixels, so these
>> formats have a hidden layer containing text; when a user searches for a word,
>> the program searches the hidden text, gets the perceived coordinates of that
>> word on a certain picture, and then presents the picture with that rectangle
>> highlighted.
>>
>> For this purpose, crappy OCR is probably good enough. Garbage "words" are
>> irrelevant, as the end user never sees them. If a word is mis-recognized, and
>> it just happens to be a word that a user is searching for, the search
>> algorithm will not find that particular instance of the word, but in the big
>> picture that failure just doesn't matter much. So for the purpose of backing
>> "photo album" formats, improved text is probably unnecessary.
>>
>> This part of my message is where it gets important.
>>
>> <strong>The people who are asking for a method to improve text don't want to
>> use a "photo album," they want to use the OCRed text directly.</strong>
>>
>> For the purpose of improving text files, maintaining word coordinates is
>> unnecessary. There is no expectation that this improved text become the
>> backing text for a "photo album" format, nor is there any need for it to do
>> so. Indeed, the ".txt" files available at archive.org do not maintain word
>> coordinates, so I can see no reason why anyone would object to those files
>> being updated by proofreaders.
>>
>> As I understand it, ePub files are created by dynamically generating HTML 
>> from
>> the Abbyy xml format
>> (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then 
>> packaging
>> that HTML into the ePub container. I also understand that if an improved HTML
>> file already exists in the output directory then /that/ file will be used and
>> a new HTML file will not be generated. If this understanding is correct, then
>> there is no reason why the HTML file could not also be incrementally improved
>> without maintaining word coordinates.
>>
>> My suggestion is to move forward with some sort of incremental improvement
>> process for .txt and HTML files. There is no need to try and maintain word
>> coordinates for these files. The current method of developing the text 
>> backing
>> for the "photo album" formats can continue as it currently is without being
>> impacted by improvements in other formats.
>>
>> _______________________________________________
>> Ol-discuss mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
>> To unsubscribe from this mailing list, send email to 
>> [email protected]
> _______________________________________________
> Ol-discuss mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> To unsubscribe from this mailing list, send email to 
> [email protected]

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to