[Dspace-devel] [DSJ] Updated: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Graham Triggs (JIRA) Wed, 10 Feb 2010 07:04:30 -0800

     [ 
http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Graham Triggs updated DS-470:
-----------------------------

    Attachment: prune.patch

prune.patch alters the pruning queries to reduce database load in cases where 
the database isn't tuned to cope with the existing queries, or the dataset is 
EXTREMELY large.

Using the amended queries does not remove the need for database tuning - these 
queries are yet more efficient running Postgres 8.4 than they are on pre-8.4 
systems, and they do need at least a respectable amount of work_mem (the 
execution time will double in cases where the work_mem is limited)

But they degrade in performance more gracefully than the existing queries, and 
are slightly less work in the optimal cases.

> Batch import times increase drastically as repository size increases; patch 
> to mitigate the problem
> ---------------------------------------------------------------------------------------------------
>
>                 Key: DS-470
>                 URL: http://jira.dspace.org/jira/browse/DS-470
>             Project: DSpace 1.x
>          Issue Type: Improvement
>          Components: DSpace API
>    Affects Versions: 1.6.0
>            Reporter: Simon Brown
>            Priority: Minor
>             Fix For: 1.6.1
>
>         Attachments: batch_importer_speedup.patch, prune.patch
>
>
> As mentioned by my colleague Tom De Mulder on dspace-tech and at 
> http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ 
> As the repository grows, the time taken for batch imports to run also 
> increases. Having profiled the importer in our 1.6.0-RC1 install we 
> determined that most (80%-90%) of the time was spent in calls to 
> IndexBrowse.pruneIndexes(). 
> The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so 
> every time an item is indexed, the indexes are pruned. For any batch of size 
> n, where n > 1, this is (n - 1) times more than is necessary.
> Increasing the visibility of pruneIndexes(), removing the call from 
> IndexBrowse.indexItem(), and making a single call at the end of the 
> BrowseConsumer.end() method reduces this to once per event queue run.
> However, the batch importer calls Context.commit() after each item is 
> imported. Context.commit() runs the event queue, thus causing one event queue 
> run per imported item. 
> This patch addresses both of these issues in a way which has a minimal effect 
> on the rest of the code base; I don't necessarily consider it to be the 
> "best" way, but I wanted to keep the patch small so it could be put out. What 
> it does is:
> 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the 
> BrowseConsumer class instead of indexItem(). Other calls to indexItem() are 
> not affected.
> 2. Call pruneIndexes() from BrowseConsumer.end()
> 3. Change the call in the batch importer from Context.commit() to 
> Context.getDBConnection.commit(). The only effective difference between the 
> two is that the event queue is not run; I think that a better solution might 
> be to move the code to run the event queue from the Context.commit() method 
> to the Context.complete() method, but I don't know what effect that will have 
> on the rest of the code.
> As noted in Tom's blog post linked above, these changes, on a repository with 
> in excess of 120,000 items, brought import time from 4.7 seconds/item down to 
> 4.9 items/second.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

[Dspace-devel] [DSJ] Updated: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to