As a matter of fact, I happen to be blind and do read the Daisy books that I can download from Open Library. I am only one person, but I don't have much interest in those images myself. I need something that my text to speech software can read. I am not a Braille reader, but I am sure that those who are are interested in something that can be read by their text to Braille software. That includes searching the text too. I am vexed by the scanning errors in Open Library books because it makes it hard to read. As I said, I am only one person and the people who are interested in the photographic images might not be interested in the text. It would seem to me though that if the images can be searched just as easily by searching an underlying text then that should satisfy everyone. I also do not see why some people would be more interested in the photos than the text if it reads just the same anyway, but there might be reasons that I have not thought of. But since a sighted person can read text just as well as the photographs and since blind people cannot read the photographs then it would seem to me that priority should be given to making that text as readable and searchable as possible.
On 12/30/2011 4:20 PM, Lee Passey wrote: > On Fri, December 30, 2011 10:35 am, Edward Betts wrote: > >> I'll explain scanned image page coordinates for words. >> >> In the book reader we show the scanned images. This is the image of the >> page captured by the book scanner. When you search with the book reader >> it highlights matched words. It does this by drawing a box around the >> word in the image. To draw the box we need to know the location of the >> word on the page, that is the x and y coordinates and the height and >> width of the word in pixels. >> >> This is the reason we're still looking for software to provide >> corrections for OCR. You're right, that one solution is to let people >> download the text, correct it and upload Word files, but it doesn't >> match our requirements. > I suspect that this is an unnecessary requirement. > > The purpose of word coordinates is to support text searching in the various > "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user is > not presented with digitized text, she is presented with the picture of a > page. Of course, one cannot search for text in a blob of pixels, so these > formats have a hidden layer containing text; when a user searches for a word, > the program searches the hidden text, gets the perceived coordinates of that > word on a certain picture, and then presents the picture with that rectangle > highlighted. > > For this purpose, crappy OCR is probably good enough. Garbage "words" are > irrelevant, as the end user never sees them. If a word is mis-recognized, and > it just happens to be a word that a user is searching for, the search > algorithm will not find that particular instance of the word, but in the big > picture that failure just doesn't matter much. So for the purpose of backing > "photo album" formats, improved text is probably unnecessary. > > This part of my message is where it gets important. > > <strong>The people who are asking for a method to improve text don't want to > use a "photo album," they want to use the OCRed text directly.</strong> > > For the purpose of improving text files, maintaining word coordinates is > unnecessary. There is no expectation that this improved text become the > backing text for a "photo album" format, nor is there any need for it to do > so. Indeed, the ".txt" files available at archive.org do not maintain word > coordinates, so I can see no reason why anyone would object to those files > being updated by proofreaders. > > As I understand it, ePub files are created by dynamically generating HTML from > the Abbyy xml format > (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then packaging > that HTML into the ePub container. I also understand that if an improved HTML > file already exists in the output directory then /that/ file will be used and > a new HTML file will not be generated. If this understanding is correct, then > there is no reason why the HTML file could not also be incrementally improved > without maintaining word coordinates. > > My suggestion is to move forward with some sort of incremental improvement > process for .txt and HTML files. There is no need to try and maintain word > coordinates for these files. The current method of developing the text backing > for the "photo album" formats can continue as it currently is without being > impacted by improvements in other formats. > > _______________________________________________ > Ol-discuss mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
