Evaluating CouchDB

Peter Herndon Mon, 24 Nov 2008 09:25:22 -0800

Hi all,

I'm in the process of looking at various technologies to implement a
"digital object repository".  The concept, and our current
implementation, come from http://www.fedora-commons.org/.  A digital
object repository is a store for managing an object made up of an XML
file that describes the object's structure and includes object
metadata, plus one or more binary files as various datastreams.  As an
example, take an image object:  the FOXML (Fedora Object XML) file
details the location and kind of the datastreams, includes Dublin Core
and MODS metadata in namespaces, and includes some RDF/XML that
describes the object's relationship to other objects (e.g. isMemberOf
collection).  The datastreams for the image object include a
thumbnail-sized image, a screen-sized image (roughly 300 x 400), and
the original image in its full resolution.  Images are not the only
content type handled by the software, pretty much anything can be
managed by the repository, PDFs, audio, video, XML, text, MS Office
documents, whatever you want.


The repository software provides access control, and provides APIs
(both SOAP and, to a limited extent, REST) to manage objects, their
metadata, and their binary datastreams. The XML is stored locally on
the file system, and the datastreams can be either stored locally, or
referenced by HTTP.  The problem with the software is that it's got a
great architectural vision, but the implementation is of variable
quality.  There are lots of different little pieces, and many of them
are not written with good best practices in mind, or they have zero
exposure to real-world environments and the code reflects that, etc.
Plus, my days of slinging Java and enjoying it are long since past.

Our current implementation consists of a Java front end, plus the
repository on the back end.  We have approximately 40GB of images
stored in the repository at the moment, from our pilot project.  We
have four other departments wanting to use the software, either in a
group repository or in a dedicated repository of their own.  The most
intimidating project is one that currently has 20+ TB of images, and
anticipates creating and ingesting 240+ GB more per day, when in full
swing.  We don't really expect to ingest that much data directly into
the repository, as our network would be a major bottleneck -- the lab
that creates the data is physically located a good distance away from
our data center, and those images are already being transferred once
to a file share at the data center.  If we continue with our current
back-end, we'll likely stick a web server in front of the file share,
and use the HTTP reference, rather than transferring them again to the
repository's storage.

Anyway, that's my current use case, and my next use case.  I know that
CouchDB isn't finished yet, and hasn't been optimized yet, but does
anyone have any opinions on whether CouchDB would be a reasonable fit
for managing the metadata associated with each object?  And, likewise,
would CouchDB be a reasonable fit for managing the binary datastreams?
 Would it be practical to store the datastreams in CouchDB itself, and
up to what size limit/throughput limit?  Would it be better to store
the datastreams externally and use CouchDB to manage the metadata and
access control?  Also, looking down the road, are there plans for
CouchDB's development that would improve its fitness for this purpose
in the future?

Thanks very much for any insight you can share,

---Peter Herndon

Evaluating CouchDB

Reply via email to