[MCN-L] FW: RESPONSES: Low-cost digitization rig

[email protected] Sat, 9 Aug 2008 10:59:05 +0200

I've been following this discussion, and for what it's worth, I must say that I 
fully support those
who have emphasized the labor issues.  There's no lack of software (or 
hardware) solutions.  The REAL 
problem is capable manpower, and what it costs.  That goes for museums of any 
size. 
 
The "one-button, anyone can do it" idea is usually misleading. Before and after 
that "for dummies" methodology can be used, a lot of professional staff has to 
do a lot of work -- there is a huge "labor bottleneck" involved in any project 
of this kind.  That costs time and money, and that is the real problem in 
search of a funding solution.
 
 
Amalyah Keshet
Head of Image Resources & Copyright Management
The Israel Museum, Jerusalem
 
 
________________________________

From: [email protected] on behalf of Cherry, Rich
Sent: Fri 08/08/2008 00:49
To: Museum Computer Network Listserv
Subject: Re: [MCN-L] RESPONSES: Low-cost digitization rig

I think your answers address some of the labor issues but not all.  I
don't want to sound like I am against this (as I am not) but I think
there are some major practical issues that cannot be addressed with
interns and volunteers in most cases:

Who leads the project (gets the institution behind it, finds the free
labor source, finds space, organizes tasks and manages a schedule)?
Who selects the material?
Who moves the material to the location for scanning (is the free labor a
security issue?)?
Who reviews the material to see if there are copyright issues?
Who proofs the final product to see if errors were made?
Where will the product live when the funding for online archives
disappears?
If there is no cataloging for access other than the OCR is the only use
a huge repository of unconnected individual pages or if its books and
collections who catalogs them and connects them?

I think some institutions will find a way to accomplish the tasks above
and I think that the product described below would help them but my
guess is that if they can get the buy in required to expend the effort
listed above that very few would be stopped because of the software
being a little expensive or hard to use. 

To put it another way:  I don't think there are a lot of institutions
sitting on the fence waiting for a low cost software solution to the
problem.  They are trying to figure out the challenges listed above.

I do think that the online archive piece might move a few organizations
closer to doing it.  It might even be more attractive if some of the OCR
processing took place there as well.

Is part of the plan to use something like Amazon Web Services for this?

Rich

Rich Cherry
Director of Operations
Skirball Cultural Center
2701 N. Sepulveda Blvd.
Los Angeles, CA 90049
Work: (310) 440-4777
Fax: (310) 440-4595
rcherry at skirball.org

-----Original Message-----
From: mcn-l-bounces at mcn.edu [mailto:[email protected]] On Behalf Of
Christopher J. Mackie
Sent: Thursday, August 07, 2008 8:13 AM
To: mcn-l at mcn.edu
Subject: [MCN-L] RESPONSES: Low-cost digitization rig

Thanks to everyone who responded to my query yesterday about
digitization rigs. The responses were extremely helpful, and I am most
grateful.

There were so many responses on and offline, some of which include
several back-and-forths, that I'm not going to attempt a summary of the
questions and concerns; instead, let me respond by providing what I hope
will be a clearer and more comprehensive explanation of what we're
talking about (with my apologies for not making all of it clearer the
first time :-).

1) The focus of the project is on the software, not the hardware. We're
using off-the-shelf, consumer-grade hardware so that we don't have to
think about hardware at all; the whole reason to create the software
now, as some of you noted, is that the hardware has reached the point
where it's no longer the bottleneck. The sub-$2,000 configuration we
describe will produce 600dpi output, for example--and runs fast enough
that the page-turner, not the imaging hardware, is the throughput
bottleneck. (For those who really want to know, it's a two-camera,
stereo rig that shoots both pages of a book at once and uses the stereo
plus some other tricks to dewarp and align text.)

2) The software will be, literally, one-button. Anyone who can turn a
page and press a button will be able to use it--no training, no special
skills required. Someone will have to assemble the rig (we anticipate a
one-sheet instruction set, with room for many pictures :-), but once
it's up, it will keep itself in calibration and perform all of the
ancillary tasks of digitizing--dewarping, aligning, etc.--without user
input. It will auto-structure the text, but there will also be
opportunities for user interaction.

3) The PDF output is the only *final* output we have discussed, and it's
required for the 'flowing' of text for reader device independence. But
the system preserves interim formats as well, including the original
scanned TIFFs and the interim hOCR format, so getting data into other
formats will be relatively straightforward.

4) The metadata to be input will be at the discretion of the institution
doing the digitizing, and can be applied at the level of the individual
image or collection of images (e.g., the book, or even a collection of
books).

5) The software is SOA and also scriptable, so it can easily be
decomposed and extended (e.g., it will use the OCRopus open-source OCR
system, which can itself be implemented as a set of services). We're
going to build a seamless wrapper for institutions that don't have the
tech capacity to do their own customization, but that won't prevent
institutions that have the chops from doing whatever they want with it.
As you might suspect from this, by the way, the software will be
web-based and multi-user; this means that a larger institution could set
up several volunteer-operated rigs, all feeding one staffer who's doing
QA and adding/reviewing metadata.

6) The system will include automated open archiving online (i.e., not at
institutional expense) for both the raw images and the finished outputs
(perhaps also the interim formats), with a stable, permanent URL;
institutions that want to host locally will be able to do that
instead/in addition. To respond in particular to Joe's point about
Internet Archive; we've talked with IA already. I don't want to speak
for them, but it's safe to say that we've taken their views into account
and I think they'll be very pleased by what's delivered, should we move
forward.

7) The model we're thinking about is a "Long Tail" model; while it's
entirely possible that even 'big' digitizers might benefit from the
software, our primary purpose is to extend the capability to small
institutions with small collections. In the model we envision, these
smaller institutions won't be devoting staff--or anyone--full-time to
the project, but will be digitizing using opportunistic labor
(volunteers, interns, min-wage teens, etc.) as available. Staff will
most likely only engage in small quantities, for QA and metadata
purposes. This model trades time for money: it removes the labor
bottleneck by allowing an institution to pay nearly nothing as long as
it's willing to wait a while for the output, instead of getting results
quickly but paying a lot for them. In short, we're envisioning a very
different kind of 'scalability' than the one(s) that many of you
mentioned.

8) Several people inquired about the flowing PDFs. The technology will
be fully open-source, so it will also be usable by projects other than
this.

9) One thing nobody mentioned, because I didn't mention it in the first
place: because this project will use OCRopus, it has the potential to be
used for texts in many languages (including math and chem formulae), not
just English. How quickly and well that capability emerges will depend
on how big and multi-lingual a community forms around OCRopus, but at
least, unlike many commercial OCR products, there's hope....

As for the questions I asked originally, here's what I think I heard:

1) Setting aside the concerns about whether/how the technology will
work, many of you feel that this is likely to be quite valuable. Of
those who do, most of you think that small institutions will be willing
to publish most of the materials openly, provided they are supported
(with hosting, etc.) in doing so.

2) Quite a few of you remain skeptical that even a low-cost rig will
enable small institutions to digitize, because you believe that labor
remains the true bottleneck. I'd be interested to know if that remains
true even in the context of the "Long Tail" model I described in (7),
above, or whether you were thinking implicitly about a 'big'
digitization project on fixed timelines.

Again, thanks *very* much to everyone who took the trouble to respond.
We're still quite a ways from a final decision, so if you think of
anything else you'd like to share, please don't hesitate to get in
touch.

All best,  --Chris

Christopher J. Mackie
Associate Program Officer
Research in Information Technology
The Andrew W. Mellon Foundation
--
282 Alexander Rd.
Princeton, NJ 08540
--
140 E. 62nd St.
New York, NY 10065
--
+1 609.924.9424 (office: GMT - 5:00)
+1 609.933.1877 (mobile)
+1 646.274.6351 (fax)
cjmackie06 @ AIM
cjmackie5 @ Yahoo
--
http://rit.mellon.org; http://www.mellon.org

_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum
Computer Network (http://www.mcn.edu)

To post to this list, send messages to: mcn-l at mcn.edu

To unsubscribe or change mcn-l delivery options visit:
http://toronto.mediatrope.com/mailman/listinfo/mcn-l
_______________________________________________
You are currently subscribed to mcn-l, the listserv of the Museum Computer 
Network (http://www.mcn.edu)

To post to this list, send messages to: mcn-l at mcn.edu

To unsubscribe or change mcn-l delivery options visit:
http://toronto.mediatrope.com/mailman/listinfo/mcn-l

[MCN-L] FW: RESPONSES: Low-cost digitization rig

Reply via email to