Re: [fcrepo-user] Max number of datastreams of a object

Peter Cliff Tue, 21 Dec 2010 08:06:25 -0800

I do not know about M. Jallud's domain, but this brings out something the 
futureArch project at the University of Oxford have wrestled with. Here we 
needed to ingest disk images. Each image is itself a nice, self-contained file 
system. Some are small floppy disks, others are larger hard drive images and 
the like. The smallest of these (so far) is 2GB and contains several thousand 
files (I don't have the exact figure to hand, but suffice to say more than the 
number of data streams it'd be sensible to attach to a Fedora object - though 
many of those files are probably OS junk, etc.).


In an ideal world, each disk image would be appraised and the individual useful 
files extracted and put into a repository individually. In reality, sifting 
through a disk image of that number of files is about as onerous as sifting 
through a large number of boxes and so it can take time and staff and thus the 
disk image needs preserving until we get those resources to address it. 
Further, a bit-by-bit copy of the disk may contain useful research data in 
itself...

In deciding if we use Fedora as repository for these disk images, the question 
was how to model the image and its files in Fedora and we thought of two ways:

1) Ingest the disk image and add a datastream per file, as per this thread. As 
you can imagine, that isn't a great way to use Fedora...

2) Break the image up into files and ingest each and create a contents list 
with associated file system metadata, etc. with each file. This seems doable, 
but it seems a large overhead just to use Fedora.

Which led to the conclusion that Fedora probably wasn't the tool *for this 
particular job* (don't flame me - I'm well aware of the many good uses for 
Fedora!) but this has been bugging me ever since and perhaps we're just the 
victims of a "desire to map preexisting persistence architectures"... :-)

Pete Cliff
Bodleian Library

On 21 Dec 2010, at 15:15, <aj...@virginia.edu> wrote:

> That is the point at which I was getting-- I wonder if M. Jallud's domain is 
> being effectively and efficiently represented in Fedora.
> 
> Something I see a great deal in early use of Fedora is the desire to map 
> preexisting persistence architectures directly onto the repository. E.g. the 
> expectation that a "directory of files" will become an "object of 
> datastreams".
> 
> I don't know what M. Jallud is thinking and I don't mean to imply any 
> criticism, but I do wonder about any Fedora-based architecture featuring 
> objects with thousands of datastreams. It can be objectively said that such 
> an architecture is not at all idiomatic.
> 
> ---
> A. Soroka
> Digital Research and Scholarship R & D and Online Library Environment
> the University of Virginia Library
> 
> 
> 
> 
> On Dec 21, 2010, at 10:06 AM, Alex Rodriguez Lopez wrote:
> 
>> Hi.
>> 
>> Maybe I'm missing something here, but wouldn't be a better approach to 
>> create new objects (each with 1 (or some, but not 100s) datastream) for 
>> each file and have them relate to the primary object 
>> https://wiki.duraspace.org/display/FCR30/Digital+Object+Relationships ?
>> 
>> Instead of having 1 object with 1000s datastreams, you have 1 object 
>> linked to 1000s objects (each with one datastream).
>> 
>> Unless you *REALLY* need all to reside in one big XML...
>> 
>> Pierre-Yves JALLUD, 21-12-2010 14:52:
>>> Thanks for your answers. That conforts me in the idea that the objects I
>>> wanted to store in FedoraCommons are not adapted for this kind of
>>> system. I'll impose to the users to split there archives in an
>>> acceptable number of files. They used to have a maximum of 1000 or 2000
>>> datastreams (exceptionaly) and FC has correct answers' times. That will
>>> be the limit of my system.
>>> Thank you again and greetings
>>> 
>>>> I am wondering a little about the data model in play here. I may have
>>>> missed an earlier part of this conversation, but I wonder if you could
>>>> describe your domain problem a little, M. Jallud?
>>>> Perhaps we can find a more efficient and idiomatic way to use Fedora's
>>>> CMA than is now obvious to you... to have more than a few dozen
>>>> datastreams in a content model is very unusual and
>>>> implies the possibility of useful refactoring.
>>>> 
>>>> ---
>>>> A. Soroka
>>>> Digital Research and Scholarship R& D and Online Library Environment
>>>> the University of Virginia Library
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Dec 20, 2010, at 9:00 AM, Asger Askov Blekinge wrote:
>>>> 
>>>>> Sounds about right, but this is not a hard limit.
>>>>> 
>>>>> As you know, Fedora stores the datastreams in one big xml file.
>>>>> 
>>>>> What is the maximum size of xml files? How many elements can there
>>>> be in
>>>>> an xml list? How long do you want to wait for fedora to parse this
>>>>> object? Those are the relevant questions, and by answering them, you
>>>>> will have answered your original question.
>>>>> 
>>>>> Regards
>>>>> 
>>>>> 
>>>>> On Mon, 2010-12-20 at 14:54 +0100, Pierre-Yves JALLUD wrote:
>>>>>> Hi everyone,
>>>>>> I'm using 3.2.1 version of FedoraCommons. I wonder what is the maximum
>>>>>> number of datastreams that we can add in a single object. My
>>>> experiments
>>>>>> seem to demonstrate that this number is around 32000 (32768?...). Is
>>>>>> that true? Is that always true in the last versions?
>>>>>> 
>>>>>> Thanks for your answers.
>>>>>> Pierre-Yves
>>> 
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Lotusphere 2011
>>> Register now for Lotusphere 2011 and learn how
>>> to connect the dots, take your collaborative environment
>>> to the next level, and enter the era of Social Business.
>>> http://p.sf.net/sfu/lotusphere-d2d
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> Fedora-commons-users mailing list
>>> Fedora-commons-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
>> 
>> ------------------------------------------------------------------------------
>> Lotusphere 2011
>> Register now for Lotusphere 2011 and learn how
>> to connect the dots, take your collaborative environment
>> to the next level, and enter the era of Social Business.
>> http://p.sf.net/sfu/lotusphere-d2d
>> _______________________________________________
>> Fedora-commons-users mailing list
>> Fedora-commons-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users
> 
> 
> ------------------------------------------------------------------------------
> Lotusphere 2011
> Register now for Lotusphere 2011 and learn how
> to connect the dots, take your collaborative environment
> to the next level, and enter the era of Social Business.
> http://p.sf.net/sfu/lotusphere-d2d
> _______________________________________________
> Fedora-commons-users mailing list
> Fedora-commons-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fedora-commons-users


------------------------------------------------------------------------------
Forrester recently released a report on the Return on Investment (ROI) of
Google Apps. They found a 300% ROI, 38%-56% cost savings, and break-even
within 7 months.  Over 3 million businesses have gone Google with Google Apps:
an online email calendar, and document program that's accessible from your 
browser. Read the Forrester report: http://p.sf.net/sfu/googleapps-sfnew
_______________________________________________
Fedora-commons-users mailing list
Fedora-commons-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/fedora-commons-users

Re: [fcrepo-user] Max number of datastreams of a object

Reply via email to