I'll put my two cents in here... I haven't dug deep into ways of making these queries work better than they already do, but my experience tells me that there has to be a faster way of doing these operations. Sequence scans (i.e., "Seq Scan on bi_2_dmap") are extremely costly on large tables and these are getting on to being large (186k rows.)
And I don't mean to second-guess any of the DSpace developers, but I've worked on databases with hundreds of thousands (even millions) of records and never experienced anything this bad for regularly-used queries. I know we're in a heterogeneous environment and we don't know what kind of server Susan has or how it's tuned, but any query with a cost of 300,000 is insanely high. There's got to be a better way. :) Personally, I have never used the EXCEPT clause and so I am not familiar with it, but it does seem, Susan, that you found a faster query, and from a first glance it looks like it would achieve the same results. It's faster because you've found a query that does an "Index Scan" which is always tons faster than a "Sequence Scan". But you need to satisfy yourself as to whether or not the same rows are being deleted. Finally, PgSQL's cost-based analyzer depends on the statistics of the table being accurate, so it's best to make sure that your regular VACUUM operations are also using the ANALYZE clause. --Joel Joel Richard IT Specialist, Web Services Department Smithsonian Institution Libraries | http://www.sil.si.edu/ (202) 633-1706 | (202) 786-2861 (f) | [email protected] ________________________________ From: Graham Triggs <[email protected]> Date: Wed, 10 Feb 2010 08:08:44 -0500 To: "Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]" <[email protected]> Cc: "[email protected]" <[email protected]> Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1 Hi Sue, Hmmm.. Double Hmmm. Bear in mind, that I personally was tuning queries in Oracle, and the tuning was done two years ago. I'm sure I traced the 'EXISTS' form of the query, and the 'IN... MINUS/EXCEPT' form had proven to be more efficient. Looking at it now, I can't see that the current form is better. That said, in a tuned database environment, whilst the execution plan is significantly shorter for the EXISTS query, the actual difference in efficiency appears to be quite small - possibly negligible. It gets interesting when you take it outside that environment though. As you indicate, the behaviour of the 'EXISTS' query is better in pre-8.4 Postgres (I don't have one to hand to run the tests myself, but the explain plan you attach shows better behaviour). Even in Postgres 8.4, with a large work_mem, the 'EXISTS' query appears slightly better than the existing 'EXCEPT'. (Total runtimes of 279ms vs 394ms - although runtime isn't a good measure, as it includes the overhead of the EXPLAIN itself. The difference in execution time may be related to the length of the execution plan, not the efficiency of the operations). NB: Note that in 8.4, the EXISTS is using a hash-join, which is more efficient than your query plan: "Hash Anti Join (cost=4039.00..8265.50 rows=1 width=6) (actual time=279.345..279.345 rows=0 loops=1)" " Hash Cond: (bi_2_dis.id <http://bi_2_dis.id> = bi_2_dmap.distinct_id)" " -> Seq Scan on bi_2_dis (cost=0.00..2164.00 rows=150000 width=10) (actual time=0.016..55.114 rows=150000 loops=1)" " -> Hash (cost=2164.00..2164.00 rows=150000 width=4) (actual time=118.596..118.596 rows=150000 loops=1)" " -> Seq Scan on bi_2_dmap (cost=0.00..2164.00 rows=150000 width=4) (actual time=0.009..52.910 rows=150000 loops=1)" "Total runtime: 279.992 ms" In the same environment, if I drop the work_mem to 64kB, then the EXISTS query doubles in execution time. Although it's still attempting to do an efficient join - and so doesn't degrade as badly as the EXCEPT query in poorly tuned environments - it's still worse than the existing query operating in a correctly tuned environment. On balance, for non-8.4 users, or where the repository size may require a huge work_mem for an efficient EXCEPT, the benefits of the EXISTS query make it worthwhile to replace the existing queries. I've attached a patch that replaces the queries in the BrowseCreateDAOs. Although you still stand to gain some efficiency from an update to Postgres 8.4, and tuning of the database parameters may also yield some improvements. Regards, G On 9 February 2010 23:51, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] <[email protected]> wrote: Hi Graham, I have never used EXCEPT in a SQL query, so I had to look up what it did...:) I tried a different query and, if it truly would return the same results as the original query, it looks like it would be even more efficient. Take a look: This is our Explain for the original query: DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT distinct_id AS id FROM bi_2_dmap) Hash Join (cost=33265.21..35896.25 rows=25920 width=6) (actual time=1190.510..1190.510 rows=0 loops=1) Hash Cond: (bi_2_dis.id <http://bi_2_dis.id> = "IN_subquery".id) -> Seq Scan on bi_2_dis (cost=0.00..1676.42 rows=69542 width=10) (actual time=0.046..0.046 rows=1 loops=1) -> Hash (cost=32941.21..32941.21 rows=25920 width=4) (actual time=1190.436..1190.436 rows=0 loops=1) -> Subquery Scan "IN_subquery" (cost=31386.00..32941.21 rows=25920 width=4) (actual time=1190.435..1190.435 rows=0 loops=1) -> SetOp Except (cost=31386.00..32682.01 rows=25920 width=4) (actual time=1190.431..1190.431 rows=0 loops=1) -> Sort (cost=31386.00..32034.00 rows=259201 width=4) (actual time=973.276..1064.024 rows=259201 loops=1) Sort Key: id -> Append (cost=0.00..8079.02 rows=259201 width=4) (actual time=0.012..475.111 rows=259201 loops=1) -> Subquery Scan "*SELECT* 1" (cost=0.00..2371.84 rows=69542 width=4) (actual time=0.012..104.803 rows=69542 loops=1) -> Seq Scan on bi_2_dis (cost=0.00..1676.42 rows=69542 width=4) (actual time=0.010..64.376 rows=69542 loops=1) -> Subquery Scan "*SELECT* 2" (cost=0.00..5707.18 rows=189659 width=4) (actual time=0.043..278.514 rows=189659 loops=1) -> Seq Scan on bi_2_dmap (cost=0.00..3810.59 rows=189659 width=4) (actual time=0.038..166.496 rows=189659 loops=1) Total runtime: 1190.692 ms Here is the Explain from the alternate query I came up with: DELETE FROM bi_2_dis WHERE NOT EXISTS (SELECT '1' FROM bi_2_dmap WHERE distinct_id = bi_2_dis.id <http://bi_2_dis.id> ) Seq Scan on bi_2_dis (cost=0.00..266261.03 rows=34771 width=6) (actual time=672.602..672.602 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using bi_2_dmap_dist_idx on bi_2_dmap (cost=0.00..95.12 rows=25 width=0) (actual time=0.008..0.008 rows=1 loops=69542) Index Cond: (distinct_id = $0) Total runtime: 672.716 ms Let me know what you think. Thanks, Sue ________________________________ From: Graham Triggs [mailto:[email protected]] Sent: Friday, February 05, 2010 2:37 PM To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] Cc: [email protected] Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1 Hi Sue, It's hard for me to look through a Postgres configuration file and give general suggestions. I would really need to look into the specific behaviour experienced with different queries to see where the performance is being affected. But the queries to prune the browse table is one area I do have experience of - and if you follow the links in the email I point to you'll see that the current form of the query is more efficient than the alternatives. However, for the most efficient operation - which you'll need for a large repository - the except query needs to be executed using a hash join plan. That is only possible if you are running Postgres 8.4 - and even then you have to have sufficient memory allocated to work_mem (64MB should be more than plenty for the circumstances). In all other circumstances, it will use a sort. You need to run the EXPLAIN ANALYZE query to see what Postgres is actually doing. The email linked earlier shows the different execution plans in each circumstance. Of particular note: " -> SetOp Except (cost=45550.90..47050.90 rows=150000 width=4) (actual time=888.241..888.241 rows=0 loops=1)" " -> Sort (cost=45550.90..46300.90 rows=300000 width=4) (actual time=635.657..787.194 rows=300000 loops=1)" " Sort Key: "*SELECT* 1".id" " Sort Method: external merge Disk: 5272kB" versus " -> HashSetOp Except (cost=0.00..8236.00 rows=150000 width=4) (actual time=396.457..396.457 rows=0 loops=1)" " -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual time=0.008..233.227 rows=300000 loops=1)" It's only in the second query on Postgres 8.4, where sufficient work_mem has been allocated (there is no difference between the first and second query as written) that a hash join is used for the EXCEPT. (Now I've extracted the segment, I can see I read it wrong in the original email - it's a 5MB disk file that is being used. Which would make sense that the default work_mem of 1MB is insufficient to do a hash join). G On 5 February 2010 19:04, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] <[email protected]> wrote: Hi Graham, Yes, we use Postgres, version 8.2 with DSpace 1.5.1. I tried making the change to work_mem - I changed it from 32MB to 64MB - but didn't see any difference in performance/speed in the Item installation process. I'm wondering if I have some other setting(s) in postgresql.conf that may be a factor in the speed and efficiency of transactions? I've attached a copy of our postgresql.conf file and would appreciate any input on improvements. Thanks! Sue ________________________________ From: Graham Triggs [mailto:[email protected]] Sent: Thursday, February 04, 2010 7:01 PM To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] Cc: [email protected]; Smail, James W. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1 Hi Sue, (You don't say explicitly, so I'll assume that you are using Postgres) I can't explain why the addition of metadata rows is taking so long - there are a number of factors that could come into play... server getting many / expensive requests, resource contention, postgres needs maintaining (ie. vacuum) or tuning or upgrading, or all three. However, the issue of slow installing (license granting) of items, is quite likely answered in this post: http://sourceforge.net/mailarchive/message.php?msg_name=9efc26fb-749c-4a67-8481-f1a6ce09b...@gmail.com to put it briefly - the item installation takes on some extra work in browse indexing to ensure that general browse lookups are efficient (you only add an item once, users might issue 100s of browse requests a day). The SQL that makes that calculation is written in it's most efficient form. But you still need the database to be configured properly in order for that operation to be executed in an efficient way. If you are using the default work_mem setting, you can easily save 60% of your installation time - and generally improving the efficiency of the database - by increasing it appropriately. Regards, G On 4 February 2010 23:29, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] <[email protected]> wrote: Hello, We have been noticing recently that, as our repository grows in size (we currently have approx. 125,000 Items and adding more each week), our online submission process is getting slower and slower and slower. It's gotten so bad that, even if I'm only trying to add more authors to an Item by clicking the "Add more" button: [cid:3348821052_8006710] It is taking a really long time (just now it took approximately 20 seconds) for the application to respond with the subsequent screen: [cid:3348821052_7986955] To make things worse, once the customer clicks on the "I grant the license" button, the application hangs not just for seconds, but for minutes (I saw one case of 11 minutes recently). Our customers are getting very frustrated with this. Has anyone else had this problem or does anyone have an idea of where DSpace might be getting bogged-down. I looked at the INFO messages in the dspace.log file and there seems to be a whole lot of "update_item" lines in there while not much seems to be happening on the screen, but I didn't look at the programs in detail. Any help would certainly be appreciated. Best regards, Sue Sue Walker-Thornton ConITS Contract NASA Langley Research Center Integrated Library Systems Application & Database Administrator 130 Research Drive Hampton, VA 23666 Office: (757) 224-4074 Fax: (757) 224-4001 Mobile: (757) 506-9903 Email: [email protected] <mailto:[email protected]> ------------------------------------------------------------------------------ The Planet: dedicated and managed hosting, cloud storage, colocation Stay online with enterprise data centers and the best network in the business Choose flexible plans and management services without long-term contracts Personal 24x7 support from experience hosting pros just a phone call away. http://p.sf.net/sfu/theplanet-com _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech
<<inline: image.jpg>>
<<inline: image.jpg>>
------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

