[ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11182#action_11182 ]
Tim Donohue commented on DS-470: -------------------------------- >From discussion during DSpace Developers Mtg on Feb 17 2010 ( >http://www.duraspace.org/irclogs/index.php?date=2010-02-17 ) [15:37] <tdonohue> http://jira.dspace.org/jira/browse/DS-470 : Batch import times increase drastically as repository size increases; patch to mitigate the problem [15:37] <tdonohue> Oh, this is the large problem that grahamtriggs has been helping with [15:38] <tdonohue> +1 for 1.6.1 or 1.7 :) ... [15:38] <kshepherd> ah, uh oh.. this issue ;) ... [15:39] <mhwood> DS-470, +1, post-1.6.0. [15:39] <grahamtriggs> DS-470: +1 for addressing the issues post-1.6.0. -1 for applying the patch as it stands [15:39] <tdonohue> yea, I think DS-470 will require a bit more work/discussion..i'd agree with grahamtriggs [15:40] <tdonohue> ok, we'll leave it at that for now [15:40] <kshepherd> DS-470 +1 to the general idea, but graham had some reasonable objections to viewing "speeding up batch jobs" as a priority over "reducing system load" [15:40] <mhwood> Good point. > Batch import times increase drastically as repository size increases; patch > to mitigate the problem > --------------------------------------------------------------------------------------------------- > > Key: DS-470 > URL: http://jira.dspace.org/jira/browse/DS-470 > Project: DSpace 1.x > Issue Type: Improvement > Components: DSpace API > Affects Versions: 1.6.0 > Reporter: Simon Brown > Priority: Minor > Fix For: 1.6.1 > > Attachments: batch_importer_speedup.patch, prune.patch > > > As mentioned by my colleague Tom De Mulder on dspace-tech and at > http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ > As the repository grows, the time taken for batch imports to run also > increases. Having profiled the importer in our 1.6.0-RC1 install we > determined that most (80%-90%) of the time was spent in calls to > IndexBrowse.pruneIndexes(). > The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so > every time an item is indexed, the indexes are pruned. For any batch of size > n, where n > 1, this is (n - 1) times more than is necessary. > Increasing the visibility of pruneIndexes(), removing the call from > IndexBrowse.indexItem(), and making a single call at the end of the > BrowseConsumer.end() method reduces this to once per event queue run. > However, the batch importer calls Context.commit() after each item is > imported. Context.commit() runs the event queue, thus causing one event queue > run per imported item. > This patch addresses both of these issues in a way which has a minimal effect > on the rest of the code base; I don't necessarily consider it to be the > "best" way, but I wanted to keep the patch small so it could be put out. What > it does is: > 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the > BrowseConsumer class instead of indexItem(). Other calls to indexItem() are > not affected. > 2. Call pruneIndexes() from BrowseConsumer.end() > 3. Change the call in the batch importer from Context.commit() to > Context.getDBConnection.commit(). The only effective difference between the > two is that the event queue is not run; I think that a better solution might > be to move the code to run the event queue from the Context.commit() method > to the Context.complete() method, but I don't know what effect that will have > on the rest of the code. > As noted in Tom's blog post linked above, these changes, on a repository with > in excess of 120,000 items, brought import time from 4.7 seconds/item down to > 4.9 items/second. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel