Hi All, I'm still new to DSpace and all it's intricacies, so if this is a repeat of existing knowledge, forgive me.
Continuing Graham's findings, I thought I would throw this out there based on my experience having managed PostgreSQL over the past several years. If you are using anything less than PgSQL 8.3, consider upgrading. If you are using 7.x REALLY upgrade. The performance improvements will be significant overall. (My experience with 8.4 is nil as of yet.) Secondly, and probably more importantly, PgSQL (even 8.3) ships with default settings that prefer compatibility over performance so you really must tune it to your own system. It assumes a server with a very small amount of memory. I imagine this is to some degree what you are encountering and demonstrating in your investigations, Graham. http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server I don't expect that many of DSpace users have experience managing PgSQL, but it does require a bit of knowledge, even as far as modifying the Kernel's shared memory values. Our repository only has around 9,000 items so far, but I know for certain that it's un-tuned and not using up nearly as much resources as it has available. If tuning PgSQL does end up solving this problem (at least for now?), then this info needs to be communicated somewhere (the wiki perhaps?) but tuning PgSQL is something of a black art. One person's settings could be disastrous for someone else. :) --Joel Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | (202) 786-2861 (f) | richar...@si.edu ________________________________ From: Graham Triggs <grahamtri...@gmail.com> Date: Thu, 28 Jan 2010 15:58:05 -0500 To: Simon Brown <st...@cam.ac.uk> Cc: Mark Diggory <mdigg...@atmire.com>, <dspace-devel@lists.sourceforge.net> Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem On 28 Jan 2010, at 14:04, Simon Brown wrote: > Having dug through the code a little more in the meantime, it seems > that the effect of pruneIndexes() is to remove from the browse indexes > information about items which are expunged and/or withdrawn; in that > light it might not be necessary to call it when items are added or > changed at all, pruneIndexes() only removes data from the browse indexes, but the tunes under which it can occur are more subtle than that: 1) bi_item and bi_withdrawn a) the bi_item table needs to be pruned if you withdraw an item. b) the bi_withdrawn table needs to be pruned if you reinstate an item. c) either table needs to be pruned when you expunge an item, depending on the state the item was in at the time 2) metadata tables - bi_1_dis, bi_1_dmap, bi_2_dis, bi_2_dmap, etc.. a) the _dis and _dmap tables for a given index number need to be pruned any time that the metadata (author, subject, etc.) that is being indexed by them is changed. b) all the _dis and _dmap tables need to be pruned whenever an item is withdrawn or expunged. I've done some more research on the problem. First, the following posts: http://archives.postgresql.org/pgsql-performance/2009-01/msg00276.php http://archives.postgresql.org/pgsql-performance/2009-01/msg00280.php highlight the difference of doing an EXCEPT between two SELECTs (as is currently in the browse code), versus a NOT IN (which would be the alternative). Further, if you look at the Postgres 8.4 release docs: http://developer.postgresql.org/pgdocs/postgres/release-8-4.html you'll see that EXCEPT can now use hash aggregates, which is faster than the existing implementation using sorts. The story continues though. The post here: http://archives.postgresql.org/pgsql-performance/2009-06/msg00046.php indicates that hash aggregates are only used when they can fit in work_mem. I did some testing using fabricated tables consisting of 150,000 entries. set work_mem ='64kB'; EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap); "Hash Semi Join (cost=50938.90..55518.35 rows=200 width=6) (actual time=888.268..888.268 rows=0 loops=1)" " Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)" " -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.014..0.014 rows=1 loops=1)" " -> Hash (cost=48550.90..48550.90 rows=150000 width=4) (actual time=888.242..888.242 rows=0 loops=1)" " -> Subquery Scan "ANY_subquery" (cost=45550.90..48550.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)" " -> SetOp Except (cost=45550.90..47050.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)" " -> Sort (cost=45550.90..46300.90 rows=300000 width=4) (actual time=635.657..787.194 rows=300000 loops=1)" " Sort Key: "*SELECT* 1".id" " Sort Method: external merge Disk: 5272kB" " -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.007..222.252 rows=300000 loops=1)" " -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.007..94.056 rows=150000 loops=1)" " -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.007..43.727 rows=150000 loops=1)" " -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.009..83.799 rows=150000 loops=1)" " -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.008..44.104 rows=150000 loops=1)" "Total runtime: 954.148 ms" set work_mem ='64MB'; EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap); "Hash Semi Join (cost=11611.00..14488.52 rows=200 width=6) (actual time=396.518..396.518 rows=0 loops=1)" " Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)" " -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=10) (actual time=0.017..0.017 rows=1 loops=1)" " -> Hash (cost=9736.00..9736.00 rows=150000 width=4) (actual time=396.460..396.460 rows=0 loops=1)" " -> Subquery Scan "ANY_subquery" (cost=0.00..9736.00 rows=150000 width=4) (actual time=396.459..396.459 rows=0 loops=1)" " -> HashSetOp Except (cost=0.00..8236.00 rows=150000 width=4) (actual time=396.457..396.457 rows=0 loops=1)" " -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.008..233.227 rows=300000 loops=1)" " -> Subquery Scan "*SELECT* 1" (cost=0.00..3822.00 rows=150000 width=4) (actual time=0.008..98.401 rows=150000 loops=1)" " -> Seq Scan on bi_2_dis (cost=0.00..2322.00 rows=150000 width=4) (actual time=0.008..51.253 rows=150000 loops=1)" " -> Subquery Scan "*SELECT* 2" (cost=0.00..3664.00 rows=150000 width=4) (actual time=0.010..86.050 rows=150000 loops=1)" " -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.009..45.474 rows=150000 loops=1)" "Total runtime: 399.273 ms" Setting the work_mem to a value that is larger than required (I've not cut it back to see where the cut of point), results in an execution time that is 40% of the original query. I believe you also mentioned disk activity on the Postgres server, and as you can see in the initial plan the sort is using an 52k disk file. The second execution does not appear to use the disk. So, that's a 60% improvement without altering a single line of code, immediately cutting the 5 hour import to 2 hours, but more importantly being pervasive throughout the entire repository. Every single operation to submit new records, edit or remove items from the DSpace instance will see a 60% improvement, and no disk thrashing of the Postgres server, so you will likely see better throughput for non-changing operations whilst any changes are being processed. OK, 2 hours is a fair bit longer than 16 mins, but now we've actually improved the scalability of the instance to about the level that Postgres will allow, we can look at improving the perfomance of the individual operation (and even your patched version will have seen a modest improvement with the optimized Postgres configuration). Well, patching the batch import process to delay the pruneIndex to the end is an option, and we've looked at a cleaner way of implementing the same result. Although there could be a residual issue with such a change as you are having to hold a reference to every item that you import until the end of the process. That's going to cause an issue with the size number of items that you can import. Now, let's look back at Richard Rodger's suggestion. We've already taken 60% off of the pruning part of index-update. But then, in your import - and in Richard's suggestion? - the SearchConsumer was still active, so you are incrementally updating the Lucene index. If you follow the approach of using index-update at the end of the batch import, that updates the search index as well as regenerating the browse entries. So we can actually remove both the SearchConsumer and BrowseConsumer from the batch import saving more time than before. Now, index-update itself only adds the changes to the Lucene index, but recreates the whole contents of the browse tables. That could be avoided by adding an update method that only finds and indexes item ids that are not already in the bi_item or bi_withdrawn table. (Admittedly, that's not a perfect version of update - to do that, you would need to index modified items. It's easy enough to achieve if you add a timestamp column to bi_item and bit_withdrawn that records the last_modified value of the item at the time of indexing) But either way... tuning the Postgres installation will significantly reduce overhead and improve overall scalability of the repository. The simple procedural change to the way the import is run is probably 'good enough' for now. Enhancing the index-update process to only deal with new and changed items will likely be equivalent to the patched importer. And without the negative scalability aspects of the increased memory usage of holding all imported items in memory. Regards, G Graham Triggs Technical Architect Open Repository ------------------------------------------------------------------------------ The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel ------------------------------------------------------------------------------ The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel