On 4 Oct 2010, at 15:00, Graham Triggs wrote: > On 29 September 2010 14:17, Tom De Mulder <[email protected]> wrote: > I know you like to talk down the problem, but that really isn't > helping. > > This isn't about talking down the problem - it's about finding where > the real problems are and not just patching the immediate concerns. > And considering the interests of nearly 1000 DSpace instances that > are registered on dspace.org - many of whom will probably be more > worried about rampant resource usage for small repositories from > adding overhead to cover up the problems of larger repositories.
Which nobody has requested, making this a massive red herring. I fail to see how cutting back on unnecessary and redundant database access constitutes "overhead to cover up the problems of larger repositories". Any repository, regardless of size, will see improvements with this kind of optimisation, at least one example of which I have already highlighted (and had my arguments shouted down - this is also, incidentally, why I haven't bothered to open any other JIRA tickets on other performance issues we've seen. What would be the point?) > We run 5 DSpace instances, three of these are systems with hundreds > of thousands of items, and it's dog slow and immensely resource- > intensive. And yes, we want these to be single systems. Why > shouldn't we? > > Surely the more pertinent question is why wouldn't you want to be > able to run a multi-node solution? I'm sure I don't need to tell you > that no matter how good a job you do of making the system perform > better with larger datasets, there will always be a finite limit to > how large the repository can be, how many users you can service, and > how quickly it will process requests for any given hardware > allocation. The pertinent question for me is why, whenever the issue of performance comes up, is one of these "theoretical future of repositories" screeds pulled out and slammed down in front of the conversation? People are reporting problems with the systems they have *right now*. Or rather, they were. And yes, it is true that there is a finite limit to what the hardware is capable of, but the quality of the software plays a significant role in how quickly that limit is reached. But we've had this conversation before. I don't really expect it to end any better this time than it did then. > Yes, DSpace can do a better job than it currently does, but it's > just postponing the inevitable. How much in technology relies on > just making things bigger/faster? Even our single system hardware is > generally made of multiple identical components - CPUs with multiple > cores, memory consisting of multiple 'sticks', each consisting of > multiple storage chips, storage combining multiple hard drives each > having multiple platters. Any method of increasing the processing capabilities of a system, either through more powerful hardware or improvements in the software, is "postponing the inevitable" for any repository with continued growth. The difference is in how much cost there is to any individual repository in each of those methods. Our system, with the changes we've made to it, struggles at around 300,000 items. People are reporting problems (presumably running stock 1.6.2) at around 50,000, from what I can gather. That means that the optimum size for a single repository running unmodified 1.6.2 is less than 50,000 items, or more than six separate DSpace instances for the number of items we hold. That's at least a sixfold increase in hardware and operational costs. Even in a situation where higher education funding had not just been significantly cut, that amount of money would be rather difficult to come by. In a situation where people are able to point to significantly better performance from other systems on similar hardware, it would become substantially more difficult. > And much of our dependencies are going the same way - Oracle > database clusters, Solr is designed to get scalability from running > over multiple shards, even Postgres has taken a major step towards > clustering / replication with it's 9.0 release. > > Either way, you will always hit a hard limit with keeping things on > a single system - so at some point, something has to give, whether > it's separating out DSpace application, Solr and Postgres instances > to separate machines, or accepting this reality in the repository > and building it to scale across multiple nodes itself. This in turn > would bring benefits to how easily you can scale (in theory, a lot > easier to scale at the repository level than scaling each of it's > individual components), as well as potentially better preservation > and federation capabilities. Leaving aside any theoretical ideal futures for the moment, it seems to me that the gist of this conversation is "DSpace does not support single-instance repositories over a certain size". That being the case, I think it would be only fair to make that lack of support explicit in the documentation and PR materials for the software, in order that all of the relevant information is readily available for anyone making decisions about the future of their repository. With regard to building the repository to scale across multiple nodes, I think it's an excellent idea. But until it appears on a road map for the software, an idea is all it is. -- Simon Brown <[email protected]> - Cambridge University Computing Service +44 1223 3 34714 - New Museums Site, Pembroke Street, Cambridge CB2 3QH ------------------------------------------------------------------------------ Beautiful is writing same markup. Internet Explorer 9 supports standards for HTML5, CSS3, SVG 1.1, ECMAScript5, and DOM L2 & L3. Spend less time writing and rewriting code and more time creating great experiences on the web. Be a part of the beta today. http://p.sf.net/sfu/beautyoftheweb _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

