Hi all, I'm in the process of looking at various technologies to implement a "digital object repository". The concept, and our current implementation, come from http://www.fedora-commons.org/. A digital object repository is a store for managing an object made up of an XML file that describes the object's structure and includes object metadata, plus one or more binary files as various datastreams. As an example, take an image object: the FOXML (Fedora Object XML) file details the location and kind of the datastreams, includes Dublin Core and MODS metadata in namespaces, and includes some RDF/XML that describes the object's relationship to other objects (e.g. isMemberOf collection). The datastreams for the image object include a thumbnail-sized image, a screen-sized image (roughly 300 x 400), and the original image in its full resolution. Images are not the only content type handled by the software, pretty much anything can be managed by the repository, PDFs, audio, video, XML, text, MS Office documents, whatever you want.
The repository software provides access control, and provides APIs (both SOAP and, to a limited extent, REST) to manage objects, their metadata, and their binary datastreams. The XML is stored locally on the file system, and the datastreams can be either stored locally, or referenced by HTTP. The problem with the software is that it's got a great architectural vision, but the implementation is of variable quality. There are lots of different little pieces, and many of them are not written with good best practices in mind, or they have zero exposure to real-world environments and the code reflects that, etc. Plus, my days of slinging Java and enjoying it are long since past. Our current implementation consists of a Java front end, plus the repository on the back end. We have approximately 40GB of images stored in the repository at the moment, from our pilot project. We have four other departments wanting to use the software, either in a group repository or in a dedicated repository of their own. The most intimidating project is one that currently has 20+ TB of images, and anticipates creating and ingesting 240+ GB more per day, when in full swing. We don't really expect to ingest that much data directly into the repository, as our network would be a major bottleneck -- the lab that creates the data is physically located a good distance away from our data center, and those images are already being transferred once to a file share at the data center. If we continue with our current back-end, we'll likely stick a web server in front of the file share, and use the HTTP reference, rather than transferring them again to the repository's storage. Anyway, that's my current use case, and my next use case. I know that CouchDB isn't finished yet, and hasn't been optimized yet, but does anyone have any opinions on whether CouchDB would be a reasonable fit for managing the metadata associated with each object? And, likewise, would CouchDB be a reasonable fit for managing the binary datastreams? Would it be practical to store the datastreams in CouchDB itself, and up to what size limit/throughput limit? Would it be better to store the datastreams externally and use CouchDB to manage the metadata and access control? Also, looking down the road, are there plans for CouchDB's development that would improve its fitness for this purpose in the future? Thanks very much for any insight you can share, ---Peter Herndon