Hi Joel,

On 12 February 2010 17:04, Richard, Joel M <[email protected]> wrote:

> I'll put my two cents in here...
>
> I haven't dug deep into ways of making these queries work better than they
> already do, but my experience tells me that there has to be a faster way of
> doing these operations. Sequence scans (i.e., "Seq Scan on bi_2_dmap") are
> extremely costly on large tables and these are getting on to being large
> (186k rows.)


> And I don't mean to second-guess any of the DSpace developers, but I've
> worked on databases with hundreds of thousands (even millions) of records
> and never experienced anything this bad for regularly-used queries. I know
> we're in a heterogeneous environment and we don't know what kind of server
> Susan has or how it's tuned, but any query with a cost of 300,000 is
> insanely high. There's got to be a better way. :)
>

It's a bit of an exaggeration to say that this is a regularly used query. It
will only be used when:

a) Installing a new item
b) Withdrawing an item
c) Reinstating an item
d) Updating the metadata of an item that has at some point been installed

It's not called during the submission / workflow process. It's not called on
any general user access. In fact, the very reason for it's existence is to
remove the need to use a costly DISTINCT when displaying browse pages to
general users - which is a much more common operation.

Personally, I have never used the EXCEPT clause and so I am not familiar
> with it, but it does seem, Susan, that you found a faster query, and from a
> first glance it looks like it would achieve the same results. It's faster
> because you've found a query that does an "Index Scan" which is always tons
> faster than a "Sequence Scan". But you need to satisfy yourself as to
> whether or not the same rows are being deleted.
>

The way I'm reading it, that's a bit of a jump ahead. In Sue's query, yes
there is an index scan being used, but as part of a NOT filter subplan.
There is still a sequence scan on the bi_2_dis table, and so it's doing a
separate index scan for each row returned by bi_2_dis - that's a loop of
around 70000 index scans.

In the EXCEPT version, it is only doing 3 sequence scans - none of the
operations are looped. However, it is the sort that it uses to implement the
EXCEPT in this case that sucks the performance / scalability out of the
query.

If you look at the query in a Postgres 8.4 environment that has been given
sufficient work_mem, then it still only does 3 scans. But the results are
hash joined, not sorted - and the resulting execution is better than the
EXISTS query in 8.2 (with it's 70000 index scans).

Now take a look at the same EXISTS query running in 8.4 - it doesn't use
index scans at all. In this case, it's gone to using just two sequence
scans, and hash joining the results. Making it the most efficient execution
of all (but not by much as two of the sequence scans in the hash joined
EXCEPTS are on the same table and can benefit from caching). But
importantly, the EXISTS degrades more gracefully when the work_mem is
insufficient.

Regards,
G
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspac... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Graham Triggs
    • R... Richard, Joel M
      • ... Graham Triggs
        • ... Simon Brown
          • ... Graham Triggs
            • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]

Reply via email to