Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Richard, Joel M Thu, 28 Jan 2010 14:22:26 -0800

Hi All,

I'm still new to DSpace and all it's intricacies, so if this is a repeat of 
existing knowledge, forgive me.

Continuing Graham's findings, I thought I would throw this out there based on 
my experience having managed PostgreSQL over the past several years.

If you are using anything less than PgSQL 8.3, consider upgrading.  If you are 
using 7.x REALLY upgrade. The performance improvements will be significant 
overall. (My experience with 8.4 is nil as of yet.)

Secondly, and probably more importantly, PgSQL (even 8.3) ships with default 
settings that prefer compatibility over performance so you really must tune it 
to your own system. It assumes a server with a very small amount of memory. I 
imagine this is to some degree what you are encountering and demonstrating in 
your investigations, Graham.

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

I don't expect that many of DSpace users have experience managing PgSQL, but it 
does require a bit of knowledge, even as far as modifying the Kernel's shared 
memory values. Our repository only has around 9,000 items so far, but I know 
for certain that it's un-tuned and not using up nearly as much resources as it 
has available.

If tuning PgSQL does end up solving this problem (at least for now?), then this 
info needs to be communicated somewhere (the wiki perhaps?) but tuning PgSQL is 
something of a black art. One person's settings could be disastrous for someone 
else. :)

--Joel

Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | (202) 786-2861 (f) | richar...@si.edu

________________________________
From: Graham Triggs <grahamtri...@gmail.com>
Date: Thu, 28 Jan 2010 15:58:05 -0500
To: Simon Brown <st...@cam.ac.uk>
Cc: Mark Diggory <mdigg...@atmire.com>, <dspace-devel@lists.sourceforge.net>
Subject: Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times 
increase drastically as repository size increases; patch to mitigate the  
problem

On 28 Jan 2010, at 14:04, Simon Brown wrote:
> Having dug through the code a little more in the meantime, it seems
> that the effect of pruneIndexes() is to remove from the browse indexes
> information about items which are expunged and/or withdrawn; in that
> light it might not be necessary to call it when items are added or
> changed at all,

pruneIndexes() only removes data from the browse indexes, but the tunes under 
which it can occur are more subtle than that:

1) bi_item and bi_withdrawn

a) the bi_item table needs to be pruned if you withdraw an item.
b) the bi_withdrawn table needs to be pruned if you reinstate an item.
c) either table needs to be pruned when you expunge an item, depending on the 
state the item was in at the time

2) metadata tables - bi_1_dis, bi_1_dmap, bi_2_dis, bi_2_dmap, etc..

a) the _dis and _dmap tables for a given index number need to be pruned any 
time that the metadata (author, subject, etc.) that is being indexed by them is 
changed.
b) all the _dis and _dmap tables need to be pruned whenever an item is 
withdrawn or expunged.

I've done some more research on the problem. First, the following posts:

http://archives.postgresql.org/pgsql-performance/2009-01/msg00276.php
http://archives.postgresql.org/pgsql-performance/2009-01/msg00280.php

highlight the difference of doing an EXCEPT between two SELECTs (as is 
currently in the browse code), versus a NOT IN (which would be the alternative).

Further, if you look at the Postgres 8.4 release docs:

http://developer.postgresql.org/pgdocs/postgres/release-8-4.html

you'll see that EXCEPT can now use hash aggregates, which is faster than the 
existing implementation using sorts.

The story continues though. The post here:

http://archives.postgresql.org/pgsql-performance/2009-06/msg00046.php

indicates that hash aggregates are only used when they can fit in work_mem.

I did some testing using fabricated tables consisting of 150,000 entries.

set work_mem ='64kB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis 
EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);

"Hash Semi Join  (cost=50938.90..55518.35 rows=200 width=6) (actual 
time=888.268..888.268 rows=0 loops=1)"
"  Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
"  ->  Seq Scan on bi_2_dis  (cost=0.00..2322.00 rows=150000 width=10) (actual 
time=0.014..0.014 rows=1 loops=1)"
"  ->  Hash  (cost=48550.90..48550.90 rows=150000 width=4) (actual 
time=888.242..888.242 rows=0 loops=1)"
"        ->  Subquery Scan "ANY_subquery"  (cost=45550.90..48550.90 rows=150000 
width=4) (actual time=888.241..888.241 rows=0 loops=1)"
"              ->  SetOp Except  (cost=45550.90..47050.90 rows=150000 width=4) 
(actual time=888.241..888.241 rows=0 loops=1)"
"                    ->  Sort  (cost=45550.90..46300.90 rows=300000 width=4) 
(actual time=635.657..787.194 rows=300000 loops=1)"
"                          Sort Key: "*SELECT* 1".id"
"                          Sort Method:  external merge  Disk: 5272kB"
"                          ->  Append  (cost=0.00..7486.00 rows=300000 width=4) 
(actual time=0.007..222.252 rows=300000 loops=1)"
"                                ->  Subquery Scan "*SELECT* 1"  
(cost=0.00..3822.00 rows=150000 width=4) (actual time=0.007..94.056 rows=150000 
loops=1)"
"                                      ->  Seq Scan on bi_2_dis  
(cost=0.00..2322.00 rows=150000 width=4) (actual time=0.007..43.727 rows=150000 
loops=1)"
"                                ->  Subquery Scan "*SELECT* 2"  
(cost=0.00..3664.00 rows=150000 width=4) (actual time=0.009..83.799 rows=150000 
loops=1)"
"                                      ->  Seq Scan on bi_2_dmap  
(cost=0.00..2164.00 rows=150000 width=4) (actual time=0.008..44.104 rows=150000 
loops=1)"
"Total runtime: 954.148 ms"

set work_mem ='64MB';
EXPLAIN ANALYZE DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis 
EXCEPT SELECT distinct_id AS id FROM bi_2_dmap);

"Hash Semi Join  (cost=11611.00..14488.52 rows=200 width=6) (actual 
time=396.518..396.518 rows=0 loops=1)"
"  Hash Cond: (public.bi_2_dis.id = "ANY_subquery".id)"
"  ->  Seq Scan on bi_2_dis  (cost=0.00..2322.00 rows=150000 width=10) (actual 
time=0.017..0.017 rows=1 loops=1)"
"  ->  Hash  (cost=9736.00..9736.00 rows=150000 width=4) (actual 
time=396.460..396.460 rows=0 loops=1)"
"        ->  Subquery Scan "ANY_subquery"  (cost=0.00..9736.00 rows=150000 
width=4) (actual time=396.459..396.459 rows=0 loops=1)"
"              ->  HashSetOp Except  (cost=0.00..8236.00 rows=150000 width=4) 
(actual time=396.457..396.457 rows=0 loops=1)"
"                    ->  Append  (cost=0.00..7486.00 rows=300000 width=4) 
(actual time=0.008..233.227 rows=300000 loops=1)"
"                          ->  Subquery Scan "*SELECT* 1"  (cost=0.00..3822.00 
rows=150000 width=4) (actual time=0.008..98.401 rows=150000 loops=1)"
"                                ->  Seq Scan on bi_2_dis  (cost=0.00..2322.00 
rows=150000 width=4) (actual time=0.008..51.253 rows=150000 loops=1)"
"                          ->  Subquery Scan "*SELECT* 2"  (cost=0.00..3664.00 
rows=150000 width=4) (actual time=0.010..86.050 rows=150000 loops=1)"
"                                ->  Seq Scan on bi_2_dmap  (cost=0.00..2164.00 
rows=150000 width=4) (actual time=0.009..45.474 rows=150000 loops=1)"
"Total runtime: 399.273 ms"

Setting the work_mem to a value that is larger than required (I've not cut it 
back to see where the cut of point), results in an execution time that is 40% 
of the original query. I believe you also mentioned disk activity on the 
Postgres server, and as you can see in the initial plan the sort is using an 
52k disk file. The second execution does not appear to use the disk.

So, that's a 60% improvement without altering a single line of code, 
immediately cutting the 5 hour import to 2 hours, but more importantly being 
pervasive throughout the entire repository. Every single operation to submit 
new records, edit or remove items from the DSpace instance will see a 60% 
improvement, and no disk thrashing of the Postgres server, so you will likely 
see better throughput for non-changing operations whilst any changes are being 
processed.

OK, 2 hours is a fair bit longer than 16 mins, but now we've actually improved 
the scalability of the instance to about the level that Postgres will allow, we 
can look at improving the perfomance of the individual operation (and even your 
patched version will have seen a modest improvement with the optimized Postgres 
configuration).

Well, patching the batch import process to delay the pruneIndex to the end is 
an option, and we've looked at a cleaner way of implementing the same result.

Although there could be a residual issue with such a change as you are having 
to hold a reference to every item that you import until the end of the process. 
That's going to cause an issue with the size number of items that you can 
import.

Now, let's look back at Richard Rodger's suggestion. We've already taken 60% 
off of the pruning part of index-update. But then, in your import - and in 
Richard's suggestion? - the SearchConsumer was still active, so you are 
incrementally updating the Lucene index. If you follow the approach of using 
index-update at the end of the batch import, that updates the search index as 
well as regenerating the browse entries. So we can actually remove both the 
SearchConsumer and BrowseConsumer from the batch import saving more time than 
before.

Now, index-update itself only adds the changes to the Lucene index, but 
recreates the whole contents of the browse tables. That could be avoided by 
adding an update method that only finds and indexes item ids that are not 
already in the bi_item or bi_withdrawn table.

(Admittedly, that's not a perfect version of update - to do that, you would 
need to index modified items. It's easy enough to achieve if you add a 
timestamp column to bi_item and bit_withdrawn that records the last_modified 
value of the item at the time of indexing)

But either way... tuning the Postgres installation will significantly reduce 
overhead and improve overall scalability of the repository. The simple 
procedural change to the way the import is run is probably 'good enough' for 
now. Enhancing the index-update process to only deal with new and changed items 
will likely be equivalent to the patched importer.

And without the negative scalability aspects of the increased memory usage of 
holding all imported items in memory.

Regards,
G

Graham Triggs
Technical Architect
Open Repository
------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] [DSJ] Commented: (DS-470) Batch import times increase drastically as repository size increases; patch to mitigate the problem

Reply via email to