Re: [ol-discuss] Recording the quality of a book's OCR

Roger Loran Bailey Fri, 30 Dec 2011 08:29:36 -0800

Let me say that when it comes to technical matters I am an ignoramus. So 
when I asked earlier about collaborating with Bookshare in getting 
Bookshare volunteers to proofread Open Library books and I was asked how 
that was done, with a web form or something like Word and was asked 
about word coordinates I did not exactly understand what was being asked 
about. I said that Bookshare volunteers download a scanned book and 
proofread it and then upload it with corrections and the corrected copy 
goes into the collection. I said that Bookshare volunteers commonly use 
Word to do this. Actually, Bookshare volunteers use whatever word 
processing software they happen to have. They are not supplied with any. 
However, Bookshare has its own set of tools that do conversions and that 
make the book a Daisy book before it is ready for download. I do not 
know what kind of tools these are, but on rereading the quoted material 
in this email reply I noted something that I had not noticed before. 
That was the comment about retaining word coordinates so that it would 
be possible to search inside the book. As a matter of fact, Bookshare's 
search engine does search inside the books. When you do a search of the 
collection both titles and text from inside the books themselves are 
returned in the results. I do not understand what word coordinates are 
nor much about the other technical aspects of search engines or other 
matters that make these books available in Daisy format, but if the 
Bookshare search engine searches inside the text of the books then 
perhaps Bookshare books are compatible with Open Library books after 
all. My suggestion was that perhaps scanned Open Library books could be 
supplied to Bookshare to be proofread by Bookshare volunteers and then 
could be added to the Bookshare collection and a copy could then be 
returned to Open Library to be added to the Open Library protected Daisy 
collection. That way both Bookshare and Open Library would benefit. Does 
the information that Bookshare books can be searched inside the text 
make that sound a bit more feasible? Questions about word coordinates 
are not something I would be able to answer though.


On 12/30/2011 1:12 AM, Janusz S. Bień wrote:
> On Thu, 29 Dec 2011  Edward Betts<[email protected]>  wrote:
>
>> We don't currently have a system for recording the quality of the OCR or
>> correcting mistakes.
>>
>> As you point out the OCR doesn't properly handle blackletter type.
> There is a solution to it, but it is expensive:
>
>        http://www.frakturschrift.com/
>
>> A system for correcting OCR is often requested, conceptually it is quite
>> simple.
> But not in practise...
>
>> Just a web page that shows the page image and a way to edit the
>> text. We keen to maintain page coordinate information for each word so
>> that we can highlight words in the book reader and search inside. This
>> makes the problem more difficult.
>>
>> We would like to build a correction system, but we don't have the resources.
> Building such a system seems to be a goal of several projects, but I
> haven't found yet anything satisfactory for my purposes. The IMPACT
> project developed a system that looks nice but again it is probably to
> be quite expensive:
>
>       http://www.digitisation.eu/index.php?id=109
>
> Best regards
>
> Janusz
>
_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-discuss] Recording the quality of a book's OCR

Reply via email to