I agree we need a way to correct our books. I find it difficult to read the ePub or Kindle versions of scanned books because of the OCR errors. I'm sure the OCR errors are irritating when using text-to-speech.
On 2011-12-30 16:28, Roger Loran Bailey wrote: > As a matter of fact, I happen to be blind and do read the Daisy books > that I can download from Open Library. I am only one person, but I don't > have much interest in those images myself. I need something that my text > to speech software can read. I am not a Braille reader, but I am sure > that those who are are interested in something that can be read by their > text to Braille software. That includes searching the text too. I am > vexed by the scanning errors in Open Library books because it makes it > hard to read. As I said, I am only one person and the people who are > interested in the photographic images might not be interested in the > text. It would seem to me though that if the images can be searched just > as easily by searching an underlying text then that should satisfy > everyone. I also do not see why some people would be more interested in > the photos than the text if it reads just the same anyway, but there > might be reasons that I have not thought of. But since a sighted person > can read text just as well as the photographs and since blind people > cannot read the photographs then it would seem to me that priority > should be given to making that text as readable and searchable as possible. > > On 12/30/2011 4:20 PM, Lee Passey wrote: >> On Fri, December 30, 2011 10:35 am, Edward Betts wrote: >> >>> I'll explain scanned image page coordinates for words. >>> >>> In the book reader we show the scanned images. This is the image of the >>> page captured by the book scanner. When you search with the book reader >>> it highlights matched words. It does this by drawing a box around the >>> word in the image. To draw the box we need to know the location of the >>> word on the page, that is the x and y coordinates and the height and >>> width of the word in pixels. >>> >>> This is the reason we're still looking for software to provide >>> corrections for OCR. You're right, that one solution is to let people >>> download the text, correct it and upload Word files, but it doesn't >>> match our requirements. >> I suspect that this is an unnecessary requirement. >> >> The purpose of word coordinates is to support text searching in the various >> "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user >> is >> not presented with digitized text, she is presented with the picture of a >> page. Of course, one cannot search for text in a blob of pixels, so these >> formats have a hidden layer containing text; when a user searches for a word, >> the program searches the hidden text, gets the perceived coordinates of that >> word on a certain picture, and then presents the picture with that rectangle >> highlighted. >> >> For this purpose, crappy OCR is probably good enough. Garbage "words" are >> irrelevant, as the end user never sees them. If a word is mis-recognized, and >> it just happens to be a word that a user is searching for, the search >> algorithm will not find that particular instance of the word, but in the big >> picture that failure just doesn't matter much. So for the purpose of backing >> "photo album" formats, improved text is probably unnecessary. >> >> This part of my message is where it gets important. >> >> <strong>The people who are asking for a method to improve text don't want to >> use a "photo album," they want to use the OCRed text directly.</strong> >> >> For the purpose of improving text files, maintaining word coordinates is >> unnecessary. There is no expectation that this improved text become the >> backing text for a "photo album" format, nor is there any need for it to do >> so. Indeed, the ".txt" files available at archive.org do not maintain word >> coordinates, so I can see no reason why anyone would object to those files >> being updated by proofreaders. >> >> As I understand it, ePub files are created by dynamically generating HTML >> from >> the Abbyy xml format >> (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then >> packaging >> that HTML into the ePub container. I also understand that if an improved HTML >> file already exists in the output directory then /that/ file will be used and >> a new HTML file will not be generated. If this understanding is correct, then >> there is no reason why the HTML file could not also be incrementally improved >> without maintaining word coordinates. >> >> My suggestion is to move forward with some sort of incremental improvement >> process for .txt and HTML files. There is no need to try and maintain word >> coordinates for these files. The current method of developing the text >> backing >> for the "photo album" formats can continue as it currently is without being >> impacted by improvements in other formats. >> >> _______________________________________________ >> Ol-discuss mailing list >> [email protected] >> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss >> To unsubscribe from this mailing list, send email to >> [email protected] > _______________________________________________ > Ol-discuss mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
