Batch import times increase drastically as repository size increases; patch to 
mitigate the problem
---------------------------------------------------------------------------------------------------

                 Key: DS-470
                 URL: http://jira.dspace.org/jira/browse/DS-470
             Project: DSpace 1.x
          Issue Type: Improvement
          Components: DSpace API
    Affects Versions: 1.6.0
            Reporter: Simon Brown
            Priority: Minor
         Attachments: batch_importer_speedup.patch

As mentioned by my colleague Tom De Mulder on dspace-tech and at 
http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ 

As the repository grows, the time taken for batch imports to run also 
increases. Having profiled the importer in our 1.6.0-RC1 install we determined 
that most (80%-90%) of the time was spent in calls to 
IndexBrowse.pruneIndexes(). 

The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so 
every time an item is indexed, the indexes are pruned. For any batch of size n, 
where n > 1, this is (n - 1) times more than is necessary.

Increasing the visibility of pruneIndexes(), removing the call from 
IndexBrowse.indexItem(), and making a single call at the end of the 
BrowseConsumer.end() method reduces this to once per event queue run.

However, the batch importer calls Context.commit() after each item is imported. 
Context.commit() runs the event queue, thus causing one event queue run per 
imported item. 

This patch addresses both of these issues in a way which has a minimal effect 
on the rest of the code base; I don't necessarily consider it to be the "best" 
way, but I wanted to keep the patch small so it could be put out. What it does 
is:

1. create an IndexBrowse.indexItemNoPrune() method, which is called from the 
BrowseConsumer class instead of indexItem(). Other calls to indexItem() are not 
affected.
2. Call pruneIndexes() from BrowseConsumer.end()
3. Change the call in the batch importer from Context.commit() to 
Context.getDBConnection.commit(). The only effective difference between the two 
is that the event queue is not run; I think that a better solution might be to 
move the code to run the event queue from the Context.commit() method to the 
Context.complete() method, but I don't know what effect that will have on the 
rest of the code.

As noted in Tom's blog post linked above, these changes, on a repository with 
in excess of 120,000 items, brought import time from 4.7 seconds/item down to 
4.9 items/second.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.dspace.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel
  • [Dspac... Simon Brown (JIRA)
    • [... Richard Rodgers (JIRA)
      • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
        • ... Richard Rodgers
          • ... Tom De Mulder
            • ... Graham Triggs
              • ... Simon Brown
                • ... Mark Diggory
                • ... Simon Brown
                • ... Graham Triggs
                • ... Graham Triggs

Reply via email to