On 12/30/2011 10:20 PM, Lee Passey wrote: > For this purpose, crappy OCR is probably good enough. Garbage "words" are > irrelevant, as the end user never sees them. If a word is mis-recognized, and > it just happens to be a word that a user is searching for, the search > algorithm will not find that particular instance of the word, but in the big > picture that failure just doesn't matter much. So for the purpose of backing > "photo album" formats, improved text is probably unnecessary. > > This part of my message is where it gets important. > > <strong>The people who are asking for a method to improve text don't want to > use a "photo album," they want to use the OCRed text directly.</strong>
I think you and Edward have different use cases in mind. The actual layout and typography is very important in many cases, including illustrated books, books with multiple columns, encyclopedias, and newspapers. The plain e-text output of, say, Project Gutenberg is insufficient for these cases. Searching for individual words is most important for the less common names of people and places. This is also where OCR is most likely to fail, because OCR is guided by a dictionary that only contains the common words that nobody searches for. If someone takes the effort to correct an OCR error, this should be made useful for the next person that searches for just that word. And the search hit should be mapped to the coordinate on the page. This is more important for large pages (especially newspapers) where the layout is complex, print quality is seldom perfect, and OCR errors are more likely. Here is an example of a newspaper page, that has been proofread without preserving the word coordinates, http://sv.wikisource.org/wiki/Sida:Post-_och_Inrikes_Tidningar_1836-01-28.djvu/1 This page only has three columns, so it is quite small. But still, if you get a search hit for "Socker-raffineringsverken", you will have to look around the scanned image to find that word. It would have been far better if the word was highlighted directly in the image. In the diff, you can see that the original OCR did have two errors in this word, "Socker-rafjineringsvcrken" (f > j, e > c), http://sv.wikisource.org/w/index.php?title=Sida%3APost-_och_Inrikes_Tidningar_1836-01-28.djvu%2F1&action=historysubmit&diff=94668&oldid=72066 The diff shows that the original OCR had many such errors. -- Lars Aronsson ([email protected]) Aronsson Datateknik - http://aronsson.se Project Runeberg - free Nordic literature - http://runeberg.org/ _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
