Hi Richard, This caught my eye this morning because we have a large repository (currently 122,091 Items). We too have issues with our imports really slowing down as our repository grows in size and have looked for a solution to the problem. I just wanted to mention that the solution where you turn off the event consumers and then build the indexes (I assume you meant index-init when you said index-all...?) would not work well for us since it takes up to a week for our index-init to complete. Perhaps it would work for us to just run index-update afterward as I don't think this takes nearly as long to run, but I'm not absolutely sure. Sue
-----Original Message----- From: Richard Rodgers (JIRA) [mailto:a...@dspace.org] Sent: Wednesday, January 20, 2010 10:45 AM To: dspace-devel@lists.sourceforge.net Subject: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem [ http://jira.dspace.org/jira/browse/DS-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=11088#action_11088 ] Richard Rodgers commented on DS-470: ------------------------------------ Hi Simon, Tom etc.Thanks for the careful analysis & work. Without looking closely at the patch, I'm wondering whether there might be a simpler solution. You can use a single API call (setDispatcher in the Context class) in ItemImport to use the 'noindex' dispatcher, which does not call any of the usual event consumers, including search (the dispatcher is already defined in dspace.cfg) Then, after the import, just run 'index_all'. The event system was designed to facilitate just this sort of context-specific use. I'll be glad to furnish further details if this isn't clear. > Batch import times increase drastically as repository size increases; patch > to mitigate the problem > --------------------------------------------------------------------------------------------------- > > Key: DS-470 > URL: http://jira.dspace.org/jira/browse/DS-470 > Project: DSpace 1.x > Issue Type: Improvement > Components: DSpace API > Affects Versions: 1.6.0 > Reporter: Simon Brown > Priority: Minor > Attachments: batch_importer_speedup.patch > > > As mentioned by my colleague Tom De Mulder on dspace-tech and at > http://tdm27.wordpress.com/2010/01/19/dspace-1-6-scalability-testing/ > As the repository grows, the time taken for batch imports to run also > increases. Having profiled the importer in our 1.6.0-RC1 install we > determined that most (80%-90%) of the time was spent in calls to > IndexBrowse.pruneIndexes(). > The reason for this is that IndexBrowse.indexItem() calls pruneIndexes(), so > every time an item is indexed, the indexes are pruned. For any batch of size > n, where n > 1, this is (n - 1) times more than is necessary. > Increasing the visibility of pruneIndexes(), removing the call from > IndexBrowse.indexItem(), and making a single call at the end of the > BrowseConsumer.end() method reduces this to once per event queue run. > However, the batch importer calls Context.commit() after each item is > imported. Context.commit() runs the event queue, thus causing one event queue > run per imported item. > This patch addresses both of these issues in a way which has a minimal effect > on the rest of the code base; I don't necessarily consider it to be the > "best" way, but I wanted to keep the patch small so it could be put out. What > it does is: > 1. create an IndexBrowse.indexItemNoPrune() method, which is called from the > BrowseConsumer class instead of indexItem(). Other calls to indexItem() are > not affected. > 2. Call pruneIndexes() from BrowseConsumer.end() > 3. Change the call in the batch importer from Context.commit() to > Context.getDBConnection.commit(). The only effective difference between the > two is that the event queue is not run; I think that a better solution might > be to move the code to run the event queue from the Context.commit() method > to the Context.complete() method, but I don't know what effect that will have > on the rest of the code. > As noted in Tom's blog post linked above, these changes, on a repository with > in excess of 120,000 items, brought import time from 4.7 seconds/item down to > 4.9 items/second. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://jira.dspace.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel ------------------------------------------------------------------------------ Throughout its 18-year history, RSA Conference consistently attracts the world's best and brightest in the field, creating opportunities for Conference attendees to learn about information security's most important issues through interactions with peers, luminaries and emerging and established companies. http://p.sf.net/sfu/rsaconf-dev2dev _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel