I'll explain scanned image page coordinates for words. In the book reader we show the scanned images. This is the image of the page captured by the book scanner. When you search with the book reader it highlights matched words. It does this by drawing a box around the word in the image. To draw the box we need to know the location of the word on the page, that is the x and y coordinates and the height and width of the word in pixels.
This is the reason we're still looking for software to provide corrections for OCR. You're right, that one solution is to let people download the text, correct it and upload Word files, but it doesn't match our requirements. On 2011-12-30 08:29, Roger Loran Bailey wrote: > Let me say that when it comes to technical matters I am an ignoramus. So > when I asked earlier about collaborating with Bookshare in getting > Bookshare volunteers to proofread Open Library books and I was asked how > that was done, with a web form or something like Word and was asked > about word coordinates I did not exactly understand what was being asked > about. I said that Bookshare volunteers download a scanned book and > proofread it and then upload it with corrections and the corrected copy > goes into the collection. I said that Bookshare volunteers commonly use > Word to do this. Actually, Bookshare volunteers use whatever word > processing software they happen to have. They are not supplied with any. > However, Bookshare has its own set of tools that do conversions and that > make the book a Daisy book before it is ready for download. I do not > know what kind of tools these are, but on rereading the quoted material > in this email reply I noted something that I had not noticed before. > That was the comment about retaining word coordinates so that it would > be possible to search inside the book. As a matter of fact, Bookshare's > search engine does search inside the books. When you do a search of the > collection both titles and text from inside the books themselves are > returned in the results. I do not understand what word coordinates are > nor much about the other technical aspects of search engines or other > matters that make these books available in Daisy format, but if the > Bookshare search engine searches inside the text of the books then > perhaps Bookshare books are compatible with Open Library books after > all. My suggestion was that perhaps scanned Open Library books could be > supplied to Bookshare to be proofread by Bookshare volunteers and then > could be added to the Bookshare collection and a copy could then be > returned to Open Library to be added to the Open Library protected Daisy > collection. That way both Bookshare and Open Library would benefit. Does > the information that Bookshare books can be searched inside the text > make that sound a bit more feasible? Questions about word coordinates > are not something I would be able to answer though. > > On 12/30/2011 1:12 AM, Janusz S. Bień wrote: >> On Thu, 29 Dec 2011 Edward Betts<[email protected]> wrote: >> >>> We don't currently have a system for recording the quality of the OCR or >>> correcting mistakes. >>> >>> As you point out the OCR doesn't properly handle blackletter type. >> There is a solution to it, but it is expensive: >> >> http://www.frakturschrift.com/ >> >>> A system for correcting OCR is often requested, conceptually it is quite >>> simple. >> But not in practise... >> >>> Just a web page that shows the page image and a way to edit the >>> text. We keen to maintain page coordinate information for each word so >>> that we can highlight words in the book reader and search inside. This >>> makes the problem more difficult. >>> >>> We would like to build a correction system, but we don't have the resources. >> Building such a system seems to be a goal of several projects, but I >> haven't found yet anything satisfactory for my purposes. The IMPACT >> project developed a system that looks nice but again it is probably to >> be quite expensive: >> >> http://www.digitisation.eu/index.php?id=109 >> >> Best regards >> >> Janusz >> > _______________________________________________ > Ol-discuss mailing list > [email protected] > http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss > To unsubscribe from this mailing list, send email to > [email protected] _______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
