On 2011-12-30 13:20, Lee Passey wrote: > On Fri, December 30, 2011 10:35 am, Edward Betts wrote: > >> I'll explain scanned image page coordinates for words. >> >> In the book reader we show the scanned images. This is the image of the >> page captured by the book scanner. When you search with the book reader >> it highlights matched words. It does this by drawing a box around the >> word in the image. To draw the box we need to know the location of the >> word on the page, that is the x and y coordinates and the height and >> width of the word in pixels. >> >> This is the reason we're still looking for software to provide >> corrections for OCR. You're right, that one solution is to let people >> download the text, correct it and upload Word files, but it doesn't >> match our requirements. > > I suspect that this is an unnecessary requirement. > > The purpose of word coordinates is to support text searching in the various > "photo album" formats: FlipBook, DejaVu, and PDF. In these formats the user is > not presented with digitized text, she is presented with the picture of a > page. Of course, one cannot search for text in a blob of pixels, so these > formats have a hidden layer containing text; when a user searches for a word, > the program searches the hidden text, gets the perceived coordinates of that > word on a certain picture, and then presents the picture with that rectangle > highlighted. > > For this purpose, crappy OCR is probably good enough. Garbage "words" are > irrelevant, as the end user never sees them. If a word is mis-recognized, and > it just happens to be a word that a user is searching for, the search > algorithm will not find that particular instance of the word, but in the big > picture that failure just doesn't matter much. So for the purpose of backing > "photo album" formats, improved text is probably unnecessary. > > This part of my message is where it gets important. > > <strong>The people who are asking for a method to improve text don't want to > use a "photo album," they want to use the OCRed text directly.</strong> > > For the purpose of improving text files, maintaining word coordinates is > unnecessary. There is no expectation that this improved text become the > backing text for a "photo album" format, nor is there any need for it to do > so. Indeed, the ".txt" files available at archive.org do not maintain word > coordinates, so I can see no reason why anyone would object to those files > being updated by proofreaders. > > As I understand it, ePub files are created by dynamically generating HTML from > the Abbyy xml format > (http://www.abbyy.com/FineReader_xml/FineReader6-schema-v1.xml) then packaging > that HTML into the ePub container. I also understand that if an improved HTML > file already exists in the output directory then /that/ file will be used and > a new HTML file will not be generated. If this understanding is correct, then > there is no reason why the HTML file could not also be incrementally improved > without maintaining word coordinates. > > My suggestion is to move forward with some sort of incremental improvement > process for .txt and HTML files. There is no need to try and maintain word > coordinates for these files. The current method of developing the text backing > for the "photo album" formats can continue as it currently is without being > impacted by improvements in other formats.
Lee, you're right, we could drop our requirement for corrected text that works with "photo album" formats. There is still work that would need to be done to coordinate corrections. It would be good to have a web interface that lets people see the image and the text, make corrections, mark when a page is complete and keeps track of who has made the most corrections with a leader board. -- Edward. _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
