On 4 Oct 2010, at 15:00, Graham Triggs wrote:

> On 29 September 2010 14:17, Tom De Mulder <[email protected]> wrote:
> I know you like to talk down the problem, but that really isn't  
> helping.
>
> This isn't about talking down the problem - it's about finding where  
> the real problems are and not just patching the immediate concerns.  
> And considering the interests of nearly 1000 DSpace instances that  
> are registered on dspace.org - many of whom will probably be more  
> worried about rampant resource usage for small repositories from  
> adding overhead to cover up the problems of larger repositories.

Which nobody has requested, making this a massive red herring. I fail  
to see how cutting back on unnecessary and redundant database access  
constitutes "overhead to cover up the problems of larger  
repositories". Any repository, regardless of size, will see  
improvements with this kind of optimisation, at least one example of  
which I have already highlighted (and had my arguments shouted down -  
this is also, incidentally, why I haven't bothered to open any other  
JIRA tickets on other performance issues we've seen. What would be the  
point?)

> We run 5 DSpace instances, three of these are systems with hundreds  
> of thousands of items, and it's dog slow and immensely resource- 
> intensive. And yes, we want these to be single systems. Why  
> shouldn't we?
>
> Surely the more pertinent question is why wouldn't you want to be  
> able to run a multi-node solution? I'm sure I don't need to tell you  
> that no matter how good a job you do of making the system perform  
> better with larger datasets, there will always be a finite limit to  
> how large the repository can be, how many users you can service, and  
> how quickly it will process requests for any given hardware  
> allocation.

The pertinent question for me is why, whenever the issue of  
performance comes up, is one of these "theoretical future of  
repositories" screeds pulled out and slammed down in front of the  
conversation? People are reporting problems with the systems they have  
*right now*. Or rather, they were. And yes, it is true that there is a  
finite limit to what the hardware is capable of, but the quality of  
the software plays a significant role in how quickly that limit is  
reached. But we've had this conversation before. I don't really expect  
it to end any better this time than it did then.

> Yes, DSpace can do a better job than it currently does, but it's  
> just postponing the inevitable. How much in technology relies on  
> just making things bigger/faster? Even our single system hardware is  
> generally made of multiple identical components - CPUs with multiple  
> cores, memory consisting of multiple 'sticks', each consisting of  
> multiple storage chips, storage combining multiple hard drives each  
> having multiple platters.

Any method of increasing the processing capabilities of a system,  
either through more powerful hardware or improvements in the software,  
is "postponing the inevitable" for any repository with continued  
growth. The difference is in how much cost there is to any individual  
repository in each of those methods. Our system, with the changes  
we've made to it, struggles at around 300,000 items. People are  
reporting problems (presumably running stock 1.6.2) at around 50,000,  
from what I can gather. That means that the optimum size for a single  
repository running unmodified 1.6.2 is less than 50,000 items, or more  
than six separate DSpace instances for the number of items we hold.  
That's at least a sixfold increase in hardware and operational costs.  
Even in a situation where higher education funding had not just been  
significantly cut, that amount of money would be rather difficult to  
come by. In a situation where people are able to point to  
significantly better performance from other systems on similar  
hardware, it would become substantially more difficult.

> And much of our dependencies are going the same way - Oracle  
> database clusters, Solr is designed to get scalability from running  
> over multiple shards, even Postgres has taken a major step towards  
> clustering / replication with it's 9.0 release.
>
> Either way, you will always hit a hard limit with keeping things on  
> a single system - so at some point, something has to give, whether  
> it's separating out DSpace application, Solr and Postgres instances  
> to separate machines, or accepting this reality in the repository  
> and building it to scale across multiple nodes itself. This in turn  
> would bring benefits to how easily you can scale (in theory, a lot  
> easier to scale at the repository level than scaling each of it's  
> individual components), as well as potentially better preservation  
> and federation capabilities.

Leaving aside any theoretical ideal futures for the moment, it seems  
to me that the gist of this conversation is "DSpace does not support  
single-instance repositories over a certain size". That being the  
case, I think it would be only fair to make that lack of support  
explicit in the documentation and PR materials for the software, in  
order that all of the relevant information is readily available for  
anyone making decisions about the future of their repository.

With regard to building the repository to scale across multiple nodes,  
I think it's an excellent idea. But until it appears on a road map for  
the software, an idea is all it is.

--
Simon Brown <[email protected]> - Cambridge University Computing Service
+44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH



------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to