Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Graham Triggs Wed, 27 Jan 2010 03:51:42 -0800

2010/1/21 Tom De Mulder <td...@cam.ac.uk>

> On Wed, 20 Jan 2010, Richard Rodgers wrote:
>
> > Apologies for the confusion - 'index_all' was the old name for the
> script: I did mean index-update. One wouldn't run index-init except in cases
> of new systems, corrupt indices or the like.
> > Index-update operates incrementally, and is *much* faster.
>
> Sadly, though, your solution touches the entire repository, and doesn't
> scale as well. Once the repository size gets large enough, even "faster"
> can take a long time.
>
>
I'm not going to advocate a specific solution here, but a philosophy. Speed
and scalability are different things, and it's dangerous to conflate the
two.


A batch import is just that - a batch process. A non-interactive job that
churns through a bunch of data until it has exhausted all it's import. The
speed of a batch process should only matter when:

1) You have a specific point in time that a process must be completed by.

2) You are sitting there watching it.

3) You have so much data to process each and every day that you can't
possibly ever complete.

Even if it takes an hour to process 4000 documents an hour, that still means
you can import 100,000 in a day. How many people are close to needing to do
that?

As for watching it, well, surely you have better things to do! Or, as Robert
Llewellyn says about recharging electric cars - it takes 9 secs. That's how
long it takes to initiate the process - and your involvement ceases there.

Yes, having something completed by a specific point in time can be a
concern... "I need to have these articles loaded by the date I need to
submit a report about them". But that ought to only be a concern in
determining when the process needs to start.

If the reason for speeding up the batch import is because it impacts on the
usability of the repository for that duration - that is not a scalable
system. It does not matter how much faster you make it, there is always a
finite limit to how many items you can process with a single importer,
running against a single repository, that exists on a single machine.

A scalable system can run imports all day long without affecting the
functionality or performance of the repository for users accessing it
concurrently. A scalable system can run 10, 20,... 100 importers
simultaneously without detrimental affects. A scalable system lets you
import millions of items an hour, by allowing you to utilise the resources
needed to do it, not by trying to squeeze a single process into a finite
resource.

Or to paraphrase your statement, once the repository gets large enough, even
"faster" can never be fast enough. I would say stop worrying about how fast
you can make a batch import, and think about how you can reduce the impact
on the system. The numbers being posted are nowhere near being a concern for
a non-interactive process. Your batch import still only takes 9 secs ;)

G

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to