On Fri, December 30, 2011 10:35 am, Edward Betts wrote: > I'll explain scanned image page coordinates for words. > > In the book reader we show the scanned images. This is the image of the > page captured by the book scanner. When you search with the book reader > it highlights matched words. It does this by drawing a box around the > word in the image. To draw the box we need to know the location of the > word on the page, that is the x and y coordinates and the height and > width of the word in pixels. > > This is the reason we're still looking for software to provide > corrections for OCR. You're right, that one solution is to let people > download the text, correct it and upload Word files, but it doesn't > match our requirements.
I suspect that this is an unnecessary requirement. The purpose of word coordinates is to support text searching in the various "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user is not presented with digitized text, she is presented with the picture of a page. Of course, one cannot search for text in a blob of pixels, so these formats have a hidden layer containing text; when a user searches for a word, the program searches the hidden text, gets the perceived coordinates of that word on a certain picture, and then presents the picture with that rectangle highlighted. For this purpose, crappy OCR is probably good enough. Garbage "words" are irrelevant, as the end user never sees them. If a word is mis-recognized, and it just happens to be a word that a user is searching for, the search algorithm will not find that particular instance of the word, but in the big picture that failure just doesn't matter much. So for the purpose of backing "photo album" formats, improved text is probably unnecessary. This part of my message is where it gets important. <strong>The people who are asking for a method to improve text don't want to use a "photo album," they want to use the OCRed text directly.</strong> For the purpose of improving text files, maintaining word coordinates is unnecessary. There is no expectation that this improved text become the backing text for a "photo album" format, nor is there any need for it to do so. Indeed, the ".txt" files available at archive.org do not maintain word coordinates, so I can see no reason why anyone would object to those files being updated by proofreaders. As I understand it, ePub files are created by dynamically generating HTML from the Abbyy xml format (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then packaging that HTML into the ePub container. I also understand that if an improved HTML file already exists in the output directory then /that/ file will be used and a new HTML file will not be generated. If this understanding is correct, then there is no reason why the HTML file could not also be incrementally improved without maintaining word coordinates. My suggestion is to move forward with some sort of incremental improvement process for .txt and HTML files. There is no need to try and maintain word coordinates for these files. The current method of developing the text backing for the "photo album" formats can continue as it currently is without being impacted by improvements in other formats. _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
