Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1

Richard, Joel M Fri, 12 Feb 2010 09:05:04 -0800

I'll put my two cents in here...

I haven't dug deep into ways of making these queries work better than they 
already do, but my experience tells me that there has to be a faster way of 
doing these operations. Sequence scans (i.e., "Seq Scan on bi_2_dmap") are 
extremely costly on large tables and these are getting on to being large (186k 
rows.)


And I don't mean to second-guess any of the DSpace developers, but I've worked 
on databases with hundreds of thousands (even millions) of records and never 
experienced anything this bad for regularly-used queries. I know we're in a 
heterogeneous environment and we don't know what kind of server Susan has or 
how it's tuned, but any query with a cost of 300,000 is insanely high. There's 
got to be a better way. :)

Personally, I have never used the EXCEPT clause and so I am not familiar with 
it, but it does seem, Susan, that you found a faster query, and from a first 
glance it looks like it would achieve the same results. It's faster because 
you've found a query that does an "Index Scan" which is always tons faster than 
a "Sequence Scan". But you need to satisfy yourself as to whether or not the 
same rows are being deleted.

Finally, PgSQL's cost-based analyzer depends on the statistics of the table 
being accurate, so it's best to make sure that your regular VACUUM operations 
are also using the ANALYZE clause.

--Joel

Joel Richard
IT Specialist, Web Services Department
Smithsonian Institution Libraries | http://www.sil.si.edu/
(202) 633-1706 | (202) 786-2861 (f) | [email protected]



________________________________
From: Graham Triggs <[email protected]>
Date: Wed, 10 Feb 2010 08:08:44 -0500
To: "Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]" 
<[email protected]>
Cc: "[email protected]" <[email protected]>
Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1

Hi Sue,

Hmmm.. Double Hmmm.

Bear in mind, that I personally was tuning queries in Oracle, and the tuning 
was done two years ago. I'm sure I traced the 'EXISTS' form of the query, and 
the 'IN... MINUS/EXCEPT' form had proven to be more efficient.

Looking at it now, I can't see that the current form is better. That said, in a 
tuned database environment, whilst the execution plan is significantly shorter 
for the EXISTS query, the actual difference in efficiency appears to be quite 
small - possibly negligible.

It gets interesting when you take it outside that environment though. As you 
indicate, the behaviour of the 'EXISTS' query is better in pre-8.4 Postgres (I 
don't have one to hand to run the tests myself, but the explain plan you attach 
shows better behaviour). Even in Postgres 8.4, with a large work_mem, the 
'EXISTS' query appears slightly better than the existing 'EXCEPT'. (Total 
runtimes of 279ms vs 394ms - although runtime isn't a good measure, as it 
includes the overhead of the EXPLAIN itself. The difference in execution time 
may be related to the length of the execution plan, not the efficiency of the 
operations).


NB: Note that in 8.4, the EXISTS is using a hash-join, which is more efficient 
than your query plan:

"Hash Anti Join  (cost=4039.00..8265.50 rows=1 width=6) (actual 
time=279.345..279.345 rows=0 loops=1)"
"  Hash Cond: (bi_2_dis.id <http://bi_2_dis.id>  = bi_2_dmap.distinct_id)"
"  ->  Seq Scan on bi_2_dis  (cost=0.00..2164.00 rows=150000 width=10) (actual 
time=0.016..55.114 rows=150000 loops=1)"
"  ->  Hash  (cost=2164.00..2164.00 rows=150000 width=4) (actual 
time=118.596..118.596 rows=150000 loops=1)"
"        ->  Seq Scan on bi_2_dmap  (cost=0.00..2164.00 rows=150000 width=4) 
(actual time=0.009..52.910 rows=150000 loops=1)"
"Total runtime: 279.992 ms"

In the same environment, if I drop the work_mem to 64kB, then the EXISTS query 
doubles in execution time. Although it's still attempting to do an efficient 
join - and so doesn't degrade as badly as the EXCEPT query in poorly tuned 
environments - it's still worse than the existing query operating in a 
correctly tuned environment.

On balance, for non-8.4 users, or where the repository size may require a huge 
work_mem for an efficient EXCEPT, the benefits of the EXISTS query make it 
worthwhile to replace the existing queries. I've attached a patch that replaces 
the queries in the BrowseCreateDAOs.

Although you still stand to gain some efficiency from an update to Postgres 
8.4, and tuning of the database parameters may also yield some improvements.

Regards,
G

On 9 February 2010 23:51, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL 
SERVICES COMPANY] <[email protected]> wrote:
Hi Graham,
     I have never used EXCEPT in a SQL query, so I had to look up what it 
did...:)  I tried a different query and, if it truly would return the same 
results as the original query, it looks like it would be even more efficient.  
Take a look:

This is our Explain for the original query:

DELETE FROM bi_2_dis WHERE id IN (SELECT id FROM bi_2_dis EXCEPT SELECT 
distinct_id AS id FROM bi_2_dmap)

Hash Join  (cost=33265.21..35896.25 rows=25920 width=6) (actual 
time=1190.510..1190.510 rows=0 loops=1)
  Hash Cond: (bi_2_dis.id <http://bi_2_dis.id>  = "IN_subquery".id)
  ->  Seq Scan on bi_2_dis  (cost=0.00..1676.42 rows=69542 width=10) (actual 
time=0.046..0.046 rows=1 loops=1)
  ->  Hash  (cost=32941.21..32941.21 rows=25920 width=4) (actual 
time=1190.436..1190.436 rows=0 loops=1)
        ->  Subquery Scan "IN_subquery"  (cost=31386.00..32941.21 rows=25920 
width=4) (actual time=1190.435..1190.435 rows=0 loops=1)
              ->  SetOp Except  (cost=31386.00..32682.01 rows=25920 width=4) 
(actual time=1190.431..1190.431 rows=0 loops=1)
                    ->  Sort  (cost=31386.00..32034.00 rows=259201 width=4) 
(actual time=973.276..1064.024 rows=259201 loops=1)
                          Sort Key: id
                          ->  Append  (cost=0.00..8079.02 rows=259201 width=4) 
(actual time=0.012..475.111 rows=259201 loops=1)
                                ->  Subquery Scan "*SELECT* 1"  
(cost=0.00..2371.84 rows=69542 width=4) (actual time=0.012..104.803 rows=69542 
loops=1)
                                      ->  Seq Scan on bi_2_dis  
(cost=0.00..1676.42 rows=69542 width=4) (actual time=0.010..64.376 rows=69542 
loops=1)
                                ->  Subquery Scan "*SELECT* 2"  
(cost=0.00..5707.18 rows=189659 width=4) (actual time=0.043..278.514 
rows=189659 loops=1)
                                      ->  Seq Scan on bi_2_dmap  
(cost=0.00..3810.59 rows=189659 width=4) (actual time=0.038..166.496 
rows=189659 loops=1)
Total runtime: 1190.692 ms


Here is the Explain from the alternate query I came up with:

DELETE FROM bi_2_dis WHERE NOT EXISTS (SELECT '1' FROM bi_2_dmap WHERE 
distinct_id = bi_2_dis.id <http://bi_2_dis.id> )

Seq Scan on bi_2_dis  (cost=0.00..266261.03 rows=34771 width=6) (actual 
time=672.602..672.602 rows=0 loops=1)
  Filter: (NOT (subplan))
  SubPlan
    ->  Index Scan using bi_2_dmap_dist_idx on bi_2_dmap  (cost=0.00..95.12 
rows=25 width=0) (actual time=0.008..0.008 rows=1 loops=69542)
          Index Cond: (distinct_id = $0)
Total runtime: 672.716 ms


Let me know what you think.
Thanks,
Sue

________________________________

From: Graham Triggs [mailto:[email protected]]
Sent: Friday, February 05, 2010 2:37 PM

To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
Cc: [email protected]
Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1

Hi Sue,



It's hard for me to look through a Postgres configuration file and give general 
suggestions. I would really need to look into the specific behaviour 
experienced with different queries to see where the performance is being 
affected.



But the queries to prune the browse table is one area I do have experience of - 
and if you follow the links in the email I point to you'll see that the current 
form of the query is more efficient than the alternatives.



However, for the most efficient operation - which you'll need for a large 
repository - the except query needs to be executed using a hash join plan. That 
is only possible if you are running Postgres 8.4 - and even then you have to 
have sufficient memory allocated to work_mem (64MB should be more than plenty 
for the circumstances). In all other circumstances, it will use a sort.

You need to run the EXPLAIN ANALYZE query to see what Postgres is actually 
doing. The email linked earlier shows the different execution plans in each 
circumstance. Of particular note:



" -> SetOp Except (cost=45550.90..47050.90 rows=150000 width=4) (actual
time=888.241..888.241 rows=0 loops=1)"

" -> Sort (cost=45550.90..46300.90 rows=300000 width=4) (actual
time=635.657..787.194 rows=300000 loops=1)"
" Sort Key: "*SELECT* 1".id"
" Sort Method: external merge Disk: 5272kB"



versus



" -> HashSetOp Except (cost=0.00..8236.00 rows=150000 width=4) (actual
time=396.457..396.457 rows=0 loops=1)"
" -> Append (cost=0.00..7486.00 rows=300000 width=4) (actual
time=0.008..233.227 rows=300000 loops=1)"



It's only in the second query on Postgres 8.4, where sufficient work_mem has 
been allocated (there is no difference between the first and second query as 
written) that a hash join is used for the EXCEPT.



(Now I've extracted the segment, I can see I read it wrong in the original 
email - it's a 5MB disk file that is being used. Which would make sense that 
the default work_mem of 1MB is insufficient to do a hash join).



G







On 5 February 2010 19:04, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL 
SERVICES COMPANY] <[email protected]> wrote:

Hi Graham,
     Yes, we use Postgres, version 8.2 with DSpace 1.5.1.  I tried making the 
change to work_mem - I changed it from 32MB to 64MB - but didn't see any 
difference in performance/speed in the Item installation process.  I'm 
wondering if I have some other setting(s) in postgresql.conf that may be a 
factor in the speed and efficiency of transactions?  I've attached a copy of 
our postgresql.conf file and would appreciate any input on improvements.
Thanks!
Sue


________________________________

From: Graham Triggs [mailto:[email protected]]
Sent: Thursday, February 04, 2010 7:01 PM
To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
Cc: [email protected]; Smail, James W. (LARC-B702)[RAYTHEON 
TECHNICAL SERVICES COMPANY]
Subject: Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1


Hi Sue,



(You don't say explicitly, so I'll assume that you are using Postgres)



I can't explain why the addition of metadata rows is taking so long - there are 
a number of factors that could come into play... server getting many / 
expensive requests, resource contention, postgres needs maintaining (ie. 
vacuum) or tuning or upgrading, or all three.



However, the issue of slow installing (license granting) of items, is quite 
likely answered in this post:



http://sourceforge.net/mailarchive/message.php?msg_name=9efc26fb-749c-4a67-8481-f1a6ce09b...@gmail.com



to put it briefly - the item installation takes on some extra work in browse 
indexing to ensure that general browse lookups are efficient (you only add an 
item once, users might issue 100s of browse requests a day). The SQL that makes 
that calculation is written in it's most efficient form.



But you still need the database to be configured properly in order for that 
operation to be executed in an efficient way. If you are using the default 
work_mem setting, you can easily save 60% of your installation time - and 
generally improving the efficiency of the database - by increasing it 
appropriately.



Regards,

G

On 4 February 2010 23:29, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL 
SERVICES COMPANY] <[email protected]> wrote:

Hello,

     We have been noticing recently that, as our repository grows in size (we 
currently have approx. 125,000 Items and adding more each week), our online 
submission process is getting slower and slower and slower.  It's gotten so bad 
that, even if I'm only trying to add more authors to an Item by clicking the 
"Add more" button:
[cid:3348821052_8006710]

It is taking a really long time (just now it took approximately 20 seconds) for 
the application to respond with the subsequent screen:

[cid:3348821052_7986955]

To make things worse, once the customer clicks on the "I grant the license" 
button, the application hangs not just for seconds, but for minutes (I saw one 
case of 11 minutes recently).  Our customers are getting very frustrated with 
this.

Has anyone else had this problem or does anyone have an idea of where DSpace 
might be getting bogged-down.  I looked at the INFO messages in the dspace.log 
file and there seems to be a whole lot of "update_item" lines in there while 
not much seems to be happening on the screen, but I didn't look at the programs 
in detail.

Any help would certainly be appreciated.

Best regards,
Sue

Sue Walker-Thornton
ConITS Contract
NASA Langley Research Center
Integrated Library Systems Application & Database Administrator
130 Research Drive
Hampton, VA  23666
Office: (757) 224-4074
Fax:    (757) 224-4001
Mobile:  (757) 506-9903
Email:  [email protected] <mailto:[email protected]>


------------------------------------------------------------------------------
The Planet: dedicated and managed hosting, cloud storage, colocation
Stay online with enterprise data centers and the best network in the business
Choose flexible plans and management services without long-term contracts
Personal 24x7 support from experience hosting pros just a phone call away.
http://p.sf.net/sfu/theplanet-com
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

<<inline: image.jpg>>

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Sloooow submission process in DSpace 1.5.1

Reply via email to