A medium size museum might conservatively have 5000 boxes of archival material with lets say very conservatively 2000 pages per box, or 10 million pages to digitize, many of which are fragile or subject to copyright. Many organizations this size struggle to have folder level documentation on their collection of boxes. So let an intern start scanning... lets say they are amazing and they can scan 500 pages a day, 5 days a week. At that amazing rate you will be done in about 75 years if your institution does not create another 10 million pages in those 75 years.
However lets say you invest in 6 staff for 2 years and find the most important documents in the 10 million pages... lets say they find 50,000 of them. You scan them, organize them, catalog them, get permission to republish online. This results in something that the institution, scholars and the public can use immediately. I think the second scenario is how a museum person would approach the problem you are trying to solve. The best solution is probably somewhere in the middle but it inevitably requires staff support to succeed. Rich -----Original Message----- From: mcn-l-bounces at mcn.edu [mailto:[email protected]] On Behalf Of Christopher J. Mackie Sent: Friday, August 08, 2008 2:54 PM To: mcn-l at mcn.edu Subject: Re: [MCN-L] RESPONSES: Low-cost digitization rig (Cherry, Rich) Rich; responses to your questions/concerns inline. ---------------------------------------------------------------------- From: "Cherry, Rich" <[email protected]> <snip> Who leads the project (gets the institution behind it, finds the free labor source, finds space, organizes tasks and manages a schedule)? > One goal is to reduce the costs, space, and skill-demands substantially enough that this becomes far less challenging. The software has workflow and project management capabilities inherent. Who selects the material? Who moves the material to the location for scanning (is the free labor a security issue)? Who reviews the material to see if there are copyright issues? > All good questions. Remember that we're not trying to reproduce the Million Book Project; the goal is to help with lots of small collections, for which, taken individually, these questions are not impossibly intimidating. Who proofs the final product to see if errors were made? > The software supports real-time QA for common errors; some additional work might be required, presumably by a staffer. How much work that is depends on the quality of the source, etc. The new software should reduce the QA load as compared to anything else we've seen. If you're scanning ordinary books of reasonable quality, the staffer's effort should be minimal. Fixing OCR is, of course, another story. Where will the product live when the funding for online archives disappears? > The product will support one-button archiving online; if you're OK with Internet Archive as a host, this problem is solved. Proprietary content is your problem. If there is no cataloging for access other than the OCR is the only use a huge repository of unconnected individual pages or if its books and collections who catalogs them and connects them? > The software automatically structures documents, including books and collections; one of its improvements over commercial OCR (both accuracy and usability) is that it's *designed* for compound docs, as well as individual pages. How much human effort is required is a function of how much individualized metadata entry you want to do; the system will automate all the batch stuff, but if you want to markup each word, you can. I do think that the online archive piece might move a few organizations closer to doing it. It might even be more attractive if some of the OCR processing took place there as well. Is part of the plan to use something like Amazon Web Services for this? > We've been talking about this. It's possible one of the 'big' digitizers might be willing to do remote OCR--but we're focusing on small projects, and the OCR runs fine on a laptop, so I'm not sure why this is necessary? Remember that we're not trying to put Google out of business; we're trying to help with materials that wouldn't make it to Google or IA on their own. --Chris _______________________________________________ You are currently subscribed to mcn-l, the listserv of the Museum Computer Network (http://www.mcn.edu) To post to this list, send messages to: mcn-l at mcn.edu To unsubscribe or change mcn-l delivery options visit: http://toronto.mediatrope.com/mailman/listinfo/mcn-l
