Re: [fcrepo-user] Fedora Commons for Large Datasets with Thousands of Files

Chris Wilper Fri, 28 Jan 2011 08:38:06 -0800

Hi Jamey,

On Tue, Jan 25, 2011 at 6:12 PM, Wood, Jamey <jamey.w...@nrel.gov> wrote:
> Hello,
>
> I'm trying to understand how Fedora Commons might be applied to managing 
> datasets that:
>
>  * May each consist of several thousand individual files


Such datasets would be best modeled as multiple Fedora objects, with
relationships defined between them.  Fedora has no hard limit on the
number of datastreams that can be stored in each object, but there are
practical limits (e.g. memory) that suggest you should stick to a
relatively small number of datastreams per object.  Combined with the
modeling flexibility you get by following an "atomistic" approach, I'd
suggest keeping it down to less than a dozen per object.  One popular
approach is to have a single primary stream per object, with a few
datastreams that act as metadata for the object.

>  * May have files organized in some meaningful
> hierarchical directory structure (e.g. "type1/subtype1/file1.csv")

If you choose to have Fedora manage the datastreams (control group =
"M"), you lose control over how the paths are allocated in storage.
If tight control over the paths is needed, you can use externally
referenced datastreams instead (control group "E").  With control
group "E", the content (or just the location) can still be accessed
through the Fedora APIs, but you have to make any needed modifications
to it out of band.

>  * Would benefit from some form of "whole-object" versioning
> (along the lines used by the eSciDoc project [1])

I think the approach described in that paper has worked well for the
eSciDoc project.  As mentioned in the paper, Fedora's built-in
versioning is only at the datastream level, but more powerful,
higher-order versioning can be done through the use special
relationships that are understood by your application.

> [...]

> One possibility I'm wondering about would be to just create
> some kind of top-level Fedora Commons object that has a
> pointer to the top-level data location (URL), but doesn't
> attempt to track individual files within the dataset.  Then if
> a new revision of the dataset is published, that top-level
> URL pointer might be directed to some new location.
> Is this a reasonable approach?
> Or would it be considered bad practice?

That's certainly lightweight approach.  Whether it's appropriate
depends on how you plan to use Fedora.  If you just want Fedora to act
as a "registry" of your datasets (so you can describe and work with
them at the dataset level only), I think it sounds like a reasonable
approach.

On the other hand, if you want to be able to describe the individual
files within each dataset, I'd recommend having a Fedora object for
each (in which you can record fixity, format, and other important
metadata), and pointing to each using a URL (I'm assuming you'd opt
for the "E" control group).  Then they could each be related to the
dataset object via the RELS-EXT datastream.  This would open up more
options, but requires a bit more thought on how to model the
components of the dataset within Fedora, and also means you'll need to
come up with a strategy for updating the Fedora objects when/if the
individual files change (either in location or content).

- Chris

------------------------------------------------------------------------------
Special Offer-- Download ArcSight Logger for FREE (a $49 USD value)!
Finally, a world-class log management solution at an even better price-free!
Download using promo code Free_Logger_4_Dev2Dev. Offer expires 
February 28th, so secure your free ArcSight Logger TODAY! 
http://p.sf.net/sfu/arcsight-sfd2d
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Fedora Commons for Large Datasets with Thousands of Files

Reply via email to