This is spawned from the OL blog post "Why Computers Can't Do The Job."
Over the past few years, I've been developing FromThePage, an open-source system for crowdsourcing (or "nerd-sourcing") the transcription of hand-written, free-form text. A couple of months ago, I was approached by a museum that wanted to use my software to transcribe a set of diaries they'd scanned. The material looked like a good fit for my software ( beta.fromthepage.com ), but when I started exploring, I learned that they were already hosting the material on Archive.org. The image scans were very well done, but of course the OCR versions of the hand-written diary entries were garbage. To my delight, I found that the OpenLibrary BookReader was open-source, and that Archive.org content was easily embeddable in third-party applications like mine. Furthermore, the dev team was responsive to questions, helpful, and enthusiastic about my mash-up. At this point, I think I've got everything I need to get the museum up and running with my transcription software and Archive.org's scanning and hosting. Best of all, the BookReader/Archive.org combination is far superior to my own hand-rolled ajax/ImageMagick implementation of zoom, so my users' experience is improved by switching to IA hosting. I'd like to explore the possibilities here and would like to learn more about the directions Archive.org is going. In my dream world, my application would get out of the business of storing and serving images altogether. People who wanted to use FromThePage to transcribe texts would first upload them to the Internet Archive, then use FromThePage to transcribe the text via BookReader. When transcriptions were saved, they'd be posted back to Archive.org to update the OCR text associated with each page, which would allow all the spiffy e-book and PDF stuff to serve transcriptions rather than garbage. However, so far as I can tell the only mechanism for updating OCR is to re-upload a work via Project Gutenberg. This seems like it would likely not be a good solution in the case of manuscript material, since ideally you never divorce the transcription from the original handwritten text. Are there any plans to allow OCR correction -- especially in the case of manuscripts? Another obstacle to this distant goal seems to be that I haven't yet been able to reproduce the creation of DjVu files by Archive.org when I upload scanned pages of my own to test books in the community text collection. I suspect that this is a more technical question than strategic one, and would appreciate a pointer to the right forum for that. Finally, I wonder whether my vision squares with OL's plans for handwritten material. I'm entirely new to this list--and to the Internet Archive in general--and so I worry that you may already have some plans for manuscripts that this would clash with. Ben Brumfield http://manuscripttranscription.blogspot.com/ http://beta.fromthepage.com/
_______________________________________________ Ol-discuss mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-discuss To unsubscribe from this mailing list, send email to [email protected]
