We are experiencing very poor performance with our DSpace Browses
(DSpace 1.4.2/postgreSQL 8.2/Solaris 10).  I've spent the past day
looking at code and experimenting with queries and explains and I've got
a good idea of a sql query modification I could make to drastically
improve performance - I'm not just quite sure how to accomplish it
without disrupting other DSpace code and functionality.  Here's an
example:

 

I've been looking at the itemsbyauthor table.  Ours contains 94,401
rows.  Here is the DSpace query that clicking on the
browse-items-by-author link generates:

 

select distinct author, sort_author

    from (SELECT DISTINCT *  from ItemsByAuthor ) distinct_view

    order by sort_author LIMIT 21

 

and here is the Explain from executing the query against our database:

 

"Limit  (cost=28209.73..28211.30 rows=21 width=64) (actual
time=8382.920..8384.061 rows=21 loops=1)"

"  ->  Unique  (cost=28209.73..28917.73 rows=9441 width=64) (actual
time=8382.915..8384.045 rows=21 loops=1)"

"        ->  Sort  (cost=28209.73..28445.73 rows=94401 width=64) (actual
time=8382.910..8383.992 rows=56 loops=1)"

"              Sort Key: sort_author, author"

"              ->  Subquery Scan distinct_view  (cost=14410.61..16534.63
rows=94401 width=64) (actual time=385.620..671.460 rows=94401 loops=1)"

"                    ->  Unique  (cost=14410.61..15590.62 rows=94401
width=42) (actual time=385.611..603.050 rows=94401 loops=1)"

"                          ->  Sort  (cost=14410.61..14646.61 rows=94401
width=42) (actual time=385.605..504.765 rows=94401 loops=1)"

"                                Sort Key: items_by_author_id, item_id,
author, sort_author"

"                                ->  Seq Scan on itemsbyauthor
(cost=0.00..3383.01 rows=94401 width=42) (actual time=19.232..113.700
rows=94401 loops=1)"

"Total runtime: 8384.581 ms"

 

 

Basically,  a Sequential Scan is being done against the itemsbyauthor
table and the index is not being used.  (By the way, the response time
was around 25 seconds last night prior to running a vacuum and reindex!
- Terrible!!)

 

 

Here's my idea:  if I could somehow capture an alphabetic range to use
in a where clause, based on what the User clicks on, I could add a
"where" clause to the query which would force postgreSQL to use the
index, "sort_author_idx", and the elapsed time that it would take to
execute the query would be a fraction of what it's taking now.  For
instance, if I knew the User was starting the browse from the letter "B"
(I know this is oversimplifying, but you'll get the idea), I could
structure the query as follows:

 

select distinct author,sort_author from (SELECT DISTINCT *  from
ItemsByAuthor where sort_author > 'a' and sort_author < 'c' )
distinct_view 

where sort_author > 'a' and sort_author < 'c'

order by sort_author LIMIT 21

 

Look at the explain from this query, below and just look at the
difference in performance!  From 8.4 seconds to 1.3 seconds!!

  

"Limit  (cost=6544.60..6546.17 rows=21 width=64) (actual
time=1253.393..1253.710 rows=21 loops=1)"

"  ->  Unique  (cost=6544.60..6650.94 rows=1418 width=64) (actual
time=1253.388..1253.696 rows=21 loops=1)"

"        ->  Sort  (cost=6544.60..6580.04 rows=14179 width=64) (actual
time=1253.382..1253.641 rows=56 loops=1)"

"              Sort Key: sort_author, author"

"              ->  Subquery Scan distinct_view  (cost=4663.32..4982.35
rows=14179 width=64) (actual time=113.859..145.529 rows=11463 loops=1)"

"                    ->  Unique  (cost=4663.32..4840.56 rows=14179
width=42) (actual time=113.852..137.697 rows=11463 loops=1)"

"                          ->  Sort  (cost=4663.32..4698.77 rows=14179
width=42) (actual time=113.848..126.454 rows=11463 loops=1)"

"                                Sort Key: items_by_author_id, item_id,
author, sort_author"

"                                ->  Bitmap Heap Scan on itemsbyauthor
(cost=476.50..3199.08 rows=14179 width=42) (actual time=71.693..80.927
rows=11463 loops=1)"

"                                      Recheck Cond: ((sort_author >
'a'::text) AND (sort_author < 'c'::text) AND (sort_author > 'a'::text)
AND (sort_author < 'c'::text))"

"                                      ->  Bitmap Index Scan on
sort_author_idx  (cost=0.00..472.95 rows=14179 width=0) (actual
time=71.078..71.078 rows=11463 loops=1)"

"                                            Index Cond: ((sort_author >
'a'::text) AND (sort_author < 'c'::text) AND (sort_author > 'a'::text)
AND (sort_author < 'c'::text))"

"Total runtime: 1254.158 ms"

 

 

Here you see an index scan is used as opposed to the sequential scan
and, of course, it's much faster.

 

Basically, what I'm trying to figure out is:

1.      Where in the code to make the modification (jsp?  One of the
browse programs?)
2.      How to determine how to structure the "where" clause based on
what the user clicks on 

        a.      Jump to?
        b.      Enter the first few letters?
        c.      Next page?
        d.      Previous page?

 

If any of the developers (or anyone else who knows DSpace code inside
and out) are reading this and could help me out with this, I would
really appreciate it.

 

Thanks in advance.

Best,

Sue

 

 

Sue Walker-Thornton

ConITS Contract
NASA Langley Research Center
</></>Integrated Library Systems Application/Database Administrator

130 Research Drive

Hampton, VA  23666

Office: (757) 224-4074
Fax:    (757) 224-4001
Pager: (757) 988-2547 
Email:  [EMAIL PROTECTED]

 

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspace-tech] DSpa... Thornton, Susan M. (LARC-B702)[NCI INFORMATION SYSTEMS]

Reply via email to