Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Graham Triggs Wed, 17 Feb 2010 16:03:53 -0800

Hi Tom,

On 17 February 2010 22:09, Tom De Mulder <td...@cam.ac.uk> wrote:


> I'd like to point out that this has never been substantiated, and that we
> have so far made clear that system load goes UP at the same time as these
> batches SLOW DOWN. I don't know where you get this idea from that speeding
> up batch times would negatively affect overall system performance.
>

Can I clarify that I never stated that this particular case increased system
load.

I had (days ago) made the general point that making a query run faster can
make it take more resources, causing it to get much worse as the dataset
increases (which is almost what happened here - although this particular
query had originally been worked out on Oracle to reduce the resources it
takes, and then blindly converted to Postgres - which appeared OK initially,
but suffers with more data and an older or unoptimized Postgres instance).

But that's a general point, not specifically relating to this issue, in
order to make the case that if you want to demonstrate that this is a
SCALABILITY improvement, then you have to provide more than just the
execution time. Time elapsed is performance, and performance is NOT
scalability. It may often be the case that you simultaneously improve
performance and scalability, but it's not the case that they will always go
hand in hand.

So, I'm not saying that this patch does increase system load. However I do
have scalability concerns with how this patch is implemented - specifically,
how many items can be batch imported in one execution? Theoretically, the
existing importer could load an infinite number of items. This modification
WILL run out of memory after a finite number of items. How many will depend
on the size of the metadata.

If you want to deal with an arbitrary size of batch import (as well as
importing into a large repository), then you are better served following
Richard's suggestion to simply disable the indexing during batch, and
rebuild at the end. Which will be more overall load than your modification,
but has more general suitability (it shouldn't limit the number of items
that you can process in a single run).

The best way to reduce system impact here is to reduce the number of times
> the indexes get pruned to 1 from N, rather than to do still do them (N-1)
> times too many but slightly faster.
>
>
I quite agree that it would be good to reduce the number of time the indexes
are pruned in a batch import, which is why I voted +1 for resolving this
post-1.6, and with the potential issue of memory usage holding all those
items in memory, I want to modify the browse code so that we can do an
incremental re-index - which would mean that you can import all your items
without indexing, and then at the end simply index just the new (or changed)
items, with a single prune at the end.

But what I was demonstrating was not to make those queries slightly faster,
but to make them more efficient - hash operations instead of sorts, few
sequence scans instead of many loops of index scans. It's not about how fast
they are, but understanding how they are executed and what impact that has
on the system. And in doing so, understanding how to install and configure
the database so that the most efficient execution plans are used.

Because by doing that, we aren't just improving the batch importer. We're
improving the ingestion of new items via sword. The creation of new items
via the UI. And we're probably improving general user operations - like
browsing of items for an author (and/or restricted to a particular
collection), which will involve joins and will be more efficient if they are
using hash operations and not sorts.

G

------------------------------------------------------------------------------
Download Intel&reg; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs 
proactively, and fine-tune applications for parallel performance. 
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to