Re: [Dspace-tech] Dspace 2.0 release timeline [Scalability / Performance Revisited]

Tom De Mulder Wed, 17 Jan 2007 10:07:04 -0800

On Mon, 18 Dec 2006, Robert Tansley wrote:

> First, I specified the number 200,000 items because that's the order
> of size of the biggest known DSpace instance, Cambridge's
> (www.dspace.cam.ac.uk).  That's not a magic size, I'm sure more is
> possible.  We are lacking in concrete performance data for DSpace
> (volunteers?) which has led to some speculative bad press.


When we did a recent big code update (to 1.4 with some minor tweaks), we 
disabled the dreadful object cache (by commenting out the two routines 
that implement it, it was only a quick hack). This fixed our main problem, 
ie. that DSpace would run out of memory because the object cache would 
just grow. It also gave us a noticeable performance boost (tested with 
siege), probably because the garbage collector doesn't have to kick in 
nearly as often.

> That said, I think no. of items is the significant factor, because the
> main culprits in slowdown are a) the browse code (for which only items
> are relevant) as opposed to the Lucene search engine which deals with
> the full-text indexing, and which scales far more handsomely; and b)
> the in-memory object cache growing out of control during
> import/re-indexing, which as of 1.4 should be able to use constant
> memory regardless of repo size.  (May need a couple of code tweaks to
> fix this -- ping the list tomorrow and I'll check).

I wish you'd either get rid of the object cache or use an open source 
cache implementation. However, given the nature of DSpace (and the fact 
that most of the time you won't get the same item being accessed quickly 
in succession) I don't think it needs an object cache. And, as I just 
said, disabling it makes it *faster*.

Currently, I see no problems with the pure speed of the web application. 
However, both the importer and indexer still get slower and slower over 
time with big imports (although to solve that we run the indexer in 
batches of a few hundred items at a time, then flush the index-this 
doesn't get rid of the slowdown but it does lessen it; it also means the 
indexer doesn't run out of memory).

Another issue is backups - when you have as many files as we do, it gets 
hard to find out what's changed in the assetstore when making backups (we 
use rsync so we can backup only the changes - copying the entire 
assetstore across each time would be too much of a hit, even on our 
dedicated network link to our offsite backup servers).

This is just a quick braindump, because I happened to see Rob's post 
scroll past, and by no means exhaustive, but I think it covers our current 
main performance-related issues, such as they exist. My current concerns 
lie far more with the authentication/authorization system...


For reference, our webapp runs on a dual, dual-core CPU machine with 8GB 
of memory, with the database on a separate (similar, but with very fast 
disk) machine. The assetstores sit on a 4Gb (fibrechannel) SAN.


Regards,

--
Tom De Mulder <[EMAIL PROTECTED]> - Cambridge University Computing Service
+44 1223 3 31843 - New Museums Site, Pembroke Street, Cambridge CB2 3QH
-> 17/01/2007 : The Moon is Waning Crescent (14% of Full)

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Dspace 2.0 release timeline [Scalability / Performance Revisited]

Reply via email to