[ol-discuss] Updating OCR from handwritten texts

Ben Brumfield Fri, 15 Oct 2010 11:33:54 -0700

This is spawned from the OL blog post "Why Computers Can't Do The Job."


Over the past few years, I've been developing FromThePage, an open-source
system for crowdsourcing (or "nerd-sourcing") the transcription of
hand-written, free-form text.  A couple of months ago, I was approached by a
museum that wanted to use my software to transcribe a set of diaries they'd
scanned.  The material looked like a good fit for my software (
beta.fromthepage.com ), but when I started exploring, I learned that they
were already hosting the material on Archive.org.  The image scans were very
well done, but of course the OCR versions of the hand-written diary entries
were garbage.

To my delight, I found that the OpenLibrary BookReader was open-source, and
that Archive.org content was easily embeddable in third-party applications
like mine.  Furthermore, the dev team was responsive to questions, helpful,
and enthusiastic about my mash-up.  At this point, I think I've got
everything I need to get the museum up and running with my transcription
software and Archive.org's scanning and hosting.  Best of all, the
BookReader/Archive.org combination is far superior to my own hand-rolled
ajax/ImageMagick implementation of zoom, so my users' experience is improved
by switching to IA hosting.

I'd like to explore the possibilities here and would like to learn more
about the directions Archive.org is going.  In my dream world, my
application would get out of the business of storing and serving images
altogether.  People who wanted to use FromThePage to transcribe texts would
first upload them to the Internet Archive, then use FromThePage to
transcribe the text via BookReader.  When transcriptions were saved, they'd
be posted back to Archive.org to update the OCR text associated with each
page, which would allow all the spiffy e-book and PDF stuff to serve
transcriptions rather than garbage.

However, so far as I can tell the only mechanism for updating OCR is to
re-upload a work via Project Gutenberg.  This seems like it would likely not
be a good solution in the case of manuscript material, since ideally you
never divorce the transcription from the original handwritten text.  Are
there any plans to allow OCR correction -- especially in the case of
manuscripts?

Another obstacle to this distant goal seems to be that I haven't yet been
able to reproduce the creation of DjVu files by Archive.org when I upload
scanned pages of my own to test books in the community text collection.  I
suspect that this is a more technical question than strategic one, and would
appreciate a pointer to the right forum for that.

Finally, I wonder whether my vision squares with OL's plans for handwritten
material.  I'm entirely new to this list--and to the Internet Archive in
general--and so I worry that you may already have some plans for manuscripts
that this would clash with.

Ben Brumfield
http://manuscripttranscription.blogspot.com/
http://beta.fromthepage.com/

_______________________________________________
Ol-discuss mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss
To unsubscribe from this mailing list, send email to 
[email protected]

[ol-discuss] Updating OCR from handwritten texts

Reply via email to