Rich; responses to your questions/concerns inline.
----------------------------------------------------------------------
From: "Cherry, Rich" <[email protected]>
<snip>
Who leads the project (gets the institution behind it, finds the free
labor source, finds space, organizes tasks and manages a schedule)?
> One goal is to reduce the costs, space, and skill-demands
substantially enough that this becomes far less challenging. The
software has workflow and project management capabilities inherent.
Who selects the material? Who moves the material to the location for
scanning (is the free labor a security issue)? Who reviews the material
to see if there are copyright issues?
> All good questions. Remember that we're not trying to reproduce the
Million Book Project; the goal is to help with lots of small
collections, for which, taken individually, these questions are not
impossibly intimidating.
Who proofs the final product to see if errors were made?
> The software supports real-time QA for common errors; some additional
work might be required, presumably by a staffer. How much work that is
depends on the quality of the source, etc. The new software should
reduce the QA load as compared to anything else we've seen. If you're
scanning ordinary books of reasonable quality, the staffer's effort
should be minimal. Fixing OCR is, of course, another story.
Where will the product live when the funding for online archives
disappears?
> The product will support one-button archiving online; if you're OK
with Internet Archive as a host, this problem is solved. Proprietary
content is your problem.
If there is no cataloging for access other than the OCR is the only use
a huge repository of unconnected individual pages or if its books and
collections who catalogs them and connects them?
> The software automatically structures documents, including books and
collections; one of its improvements over commercial OCR (both accuracy
and usability) is that it's *designed* for compound docs, as well as
individual pages. How much human effort is required is a function of how
much individualized metadata entry you want to do; the system will
automate all the batch stuff, but if you want to markup each word, you
can.
I do think that the online archive piece might move a few organizations
closer to doing it. It might even be more attractive if some of the OCR
processing took place there as well. Is part of the plan to use
something like Amazon Web Services for this?
> We've been talking about this. It's possible one of the 'big'
digitizers might be willing to do remote OCR--but we're focusing on
small projects, and the OCR runs fine on a laptop, so I'm not sure why
this is necessary? Remember that we're not trying to put Google out of
business; we're trying to help with materials that wouldn't make it to
Google or IA on their own.
--Chris