Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Stuart Lewis Fri, 08 Oct 2010 11:32:14 -0700

Hi Hilton,

>>>> - Assetstore: random structure causes large overhead on filesystem for no 
>>>> real gain
>>> Are you able to expand on the overhead that is caused, and from your 
>>> profiling, explain how the structure could be improved?  My gut (and 
>>> uniformed) instinct would be that since asset store reads are completely 
>>> random depending on the items being viewed at the time, the layout of 
>>> directories would be irrelevant.  Writes may be slightly less efficient, 
>>> but since writes only tend to occur once, they are of less consequence.
>> Apologies for sounding cryptic; I was trying not to be too verbose in the 
>> template. :-)
>> 
>> This has mostly to do with back-ups. With about 600,000 files in random 
>> directories, it can be hard to find out what files have changed. We 
>> implemented an simple asset store structure that stores files by 
>> year/month/day. This means we can mirror new files very quickly, and only 
>> traverse the entire assetstore every other day to check if files have 
>> changed.
> 
> See: http://hdl.handle.net/10019.1/3161
> How strange, I also proposed such a thing !!


I've just read this paper and have a question.  You state the following:

----
At the moment, December 2009, the following two are the most widely used 
software packages for building and maintaining institutional repositories 
according the opendoar website.

•       http://www.dspace.org with 502 installations.
•       http://www.eprints.org with 261 installations. 

The digital objects and store are located as follows for the above:

•       DSpace => $DSPACE_HOME/assetstore
•       EPrints => $EPRINTS_HOME/disk0 

None of the above use a time/date based file system for storing digital 
objects. None of them use UUID's to create unique digital
objects and stores.

In one hundred years time how can any of the above satisfy a future researcher 
that the digital object is unique and has remained persistently so during the 
years to 2109.
----

Are you able to expand for us your reasoning that repositories that do not use 
datestamped directories and filenames containing UUIDs will not satisfy future 
researchers?

Just because a file is stored in that location with a UUID makes it no more or 
less likely that it has remained unique and persistent.  Filenames alone cannot 
guarantee this - it is up the repository to manage the integrity of the stored 
items, and the wider system to ensure that this is the case. This is where the 
notion of a 'trusted repository' comes into play - the fact the the repository 
pltform and the system as a whole is trusted to have maintained the integrity 
of the contents.

[A side note: You'll find a lot of the work that Tim has been leading recently 
regarding AIPs is of interest in this area. 
https://wiki.duraspace.org/display/DSPACE/AipBackupRestore ]

Cheers,


Stuart Lewis
IT Innovations Analyst and Developer
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Scalability issues report, DSpace@Cambridge

Reply via email to