Re: [ol-discuss] Recording the quality of a book's OCR

Roger Loran Bailey Fri, 30 Dec 2011 16:28:36 -0800

As a matter of fact, I happen to be blind and do read the Daisy books 
that I can download from Open Library. I am only one person, but I don't 
have much interest in those images myself. I need something that my text 
to speech software can read. I am not a Braille reader, but I am sure 
that those who are are interested in something that can be read by their 
text to Braille software. That includes searching the text too. I am 
vexed by the scanning errors in Open Library books because it makes it 
hard to read. As I said, I am only one person and the people who are 
interested in the photographic images might not be interested in the 
text. It would seem to me though that if the images can be searched just 
as easily by searching an underlying text then that should satisfy 
everyone. I also do not see why some people would be more interested in 
the photos than the text if it reads just the same anyway, but there 
might be reasons that I have not thought of. But since a sighted person 
can read text just as well as the photographs and since blind people 
cannot read the photographs then it would seem to me that priority 
should be given to making that text as readable and searchable as possible.


On 12/30/2011 4:20 PM, Lee Passey wrote:
> On Fri, December 30, 2011 10:35 am, Edward Betts wrote:
>
>> I'll explain scanned image page coordinates for words.
>>
>> In the book reader we show the scanned images. This is the image of the
>> page captured by the book scanner. When you search with the book reader
>> it highlights matched words. It does this by drawing a box around the
>> word in the image. To draw the box we need to know the location of the
>> word on the page, that is the x and y coordinates and the height and
>> width of the word in pixels.
>>
>> This is the reason we're still looking for software to provide
>> corrections for OCR. You're right, that one solution is to let people
>> download the text, correct it and upload Word files, but it doesn't
>> match our requirements.
> I suspect that this is an unnecessary requirement.
>
> The purpose of word coordinates is to support text searching in the various
> "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user is
> not presented with digitized text, she is presented with the picture of a
> page. Of course, one cannot search for text in a blob of pixels, so these
> formats have a hidden layer containing text; when a user searches for a word,
> the program searches the hidden text, gets the perceived coordinates of that
> word on a certain picture, and then presents the picture with that rectangle
> highlighted.
>
> For this purpose, crappy OCR is probably good enough. Garbage "words" are
> irrelevant, as the end user never sees them. If a word is mis-recognized, and
> it just happens to be a word that a user is searching for, the search
> algorithm will not find that particular instance of the word, but in the big
> picture that failure just doesn't matter much. So for the purpose of backing
> "photo album" formats, improved text is probably unnecessary.
>
> This part of my message is where it gets important.
>
> <strong>The people who are asking for a method to improve text don't want to
> use a "photo album," they want to use the OCRed text directly.</strong>
>
> For the purpose of improving text files, maintaining word coordinates is
> unnecessary. There is no expectation that this improved text become the
> backing text for a "photo album" format, nor is there any need for it to do
> so. Indeed, the ".txt" files available at archive.org do not maintain word
> coordinates, so I can see no reason why anyone would object to those files
> being updated by proofreaders.
>
> As I understand it, ePub files are created by dynamically generating HTML from
> the Abbyy xml format
> (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then packaging
> that HTML into the ePub container. I also understand that if an improved HTML
> file already exists in the output directory then /that/ file will be used and
> a new HTML file will not be generated. If this understanding is correct, then
> there is no reason why the HTML file could not also be incrementally improved
> without maintaining word coordinates.
>
> My suggestion is to move forward with some sort of incremental improvement
> process for .txt and HTML files. There is no need to try and maintain word
> coordinates for these files. The current method of developing the text backing
> for the "photo album" formats can continue as it currently is without being
> impacted by improvements in other formats.
>
> _______________________________________________
> Ol-discuss mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to