On 5 October 2010 16:33, Simon Brown <[email protected]> wrote:

> Which nobody has requested, making this a massive red herring. I fail
> to see how cutting back on unnecessary and redundant database access
> constitutes "overhead to cover up the problems of larger
> repositories".


One person's "unnecessary and redundant database access" is another's very
necessary database access - well, at least it can be.

I remember the patch for reducing the updating of browse / search indexes,
and I can see why it would be useful to not do those updates during a batch
import if you have an appropriate workflow.

That won't be the case for all of the repositories - quite a few will
welcome the ability to see those items as and when they are added. There is
also the issue of how long it takes to do the one very big update at the end
of the batch run vs. incremental changes as you go - it may be less work
overall, but having one big change can be more disruptive in some cases.


> Any repository, regardless of size, will see
> improvements with this kind of optimisation, at least one example of
> which I have already highlighted (and had my arguments shouted down -
> this is also, incidentally, why I haven't bothered to open any other
> JIRA tickets on other performance issues we've seen. What would be the
> point?)
>

No, you didn't get shouted down for raising a performance issue. Where the
argument came was because you assumed that this would clearly be of benefit
to "any repository", when you did nothing to address the underlying
performance issues (which could have been helped quite dramatically with
some small SQL tweaks and some configuration work in Postgres), and instead
just bypassed them for one very specific use case.

It doesn't matter how large or small a repository is, if they don't perform
batch uploads using the ItemImporter, your change will do *nothing* for
them. But an alteration to the underlying SQL, and guidelines for getting
the best out of Postgres would benefit everyone - regardless of how large or
small the repository is, or the means by which they populate it.


> The pertinent question for me is why, whenever the issue of
> performance comes up, is one of these "theoretical future of
> repositories" screeds pulled out and slammed down in front of the
> conversation? People are reporting problems with the systems they have
> *right now*.


It's not meant to be a barrier to conversation, but a question as to what
you want to resolve. Do you want to address the *scalability* of DSpace, or
do you just want to avoid an immediate performance bottleneck? If we
conflate these, conversations are going to stall, and we're not going to
make any progress.


> Or rather, they were. And yes, it is true that there is a
> finite limit to what the hardware is capable of, but the quality of
> the software plays a significant role in how quickly that limit is
> reached. But we've had this conversation before. I don't really expect
> it to end any better this time than it did then.
>

I completely agree - but a solution that breaks the encapsulation of the
components in the system, and leaves important indexes in an inconsistent
state for an extended period of time is not an automatic win for the
majority of the community.

I offered a lot of suggestions as to how that code could be better
structured, improvements both to the SQL and the configuration of Postgres
to handle the load more efficiently, and suggestions for further tweaks that
would reduce the amount of updates that the code would have needed to do
still further. All of which would have be more beneficial to the community
(not just improving batch uploads, but interactive / singular deposits and
edits) - and not only that, would have improved the performance of your
systems further than you had so far achieved.

Any method of increasing the processing capabilities of a system,
> either through more powerful hardware or improvements in the software,
> is "postponing the inevitable" for any repository with continued
> growth. The difference is in how much cost there is to any individual
> repository in each of those methods. Our system, with the changes
> we've made to it, struggles at around 300,000 items. People are
> reporting problems (presumably running stock 1.6.2) at around 50,000,
> from what I can gather.


This is where we need to be careful about what we are reporting. Quite a few
of the issues around 1.6.x appear to be around rampant memory usage, rather
than a clear function of how many records there are in the database. There
are also different issues involved if we are talking about adding / editing
lots of records, or simply highly accessed.

Even so, regardless of what we do to the code to make it efficient, it does
not and can not absolve the system administrator of correctly maintaining
both DSpace itself, and it's dependencies. I wouldn't want to get drawn on
where that point is without any evidence, but there is a lot of scope for
altering and improving Postgres behaviour by tweaking the memory buffers
that it uses - and it's going to be vital for people to do that in order to
scale beyond a certain point.

Similarly, tables like metadatavalue are going to huge quite quickly, and
that will probably benefit from partitioning at some point. However, for
that to be effective, it's likely that will depend on local usage, and isn't
something that we can just put into the system from the start.


> That means that the optimum size for a single
> repository running unmodified 1.6.2 is less than 50,000 items, or more
> than six separate DSpace instances for the number of items we hold.
> That's at least a sixfold increase in hardware and operational costs.
>

I think we would want and should expect more than 50,000 items in a single
instance - but at some point that will depend on the correct local
administration of the system in order to achieve that.

That is also a brutal calculation on the implementation that misses out on a
lot of factors - how much time do you spend investigating and fixing
performance issues? How much time is/would be spent migrating from one
hardware instance to another vs. simply adding another box to the cluster?
If you are targeting smaller instances within the cluster, they are each
going to be less expensive than the one big box you buy to run it as a
single instance.

G
------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to