Hi everyone.
I'm new to this forum and was wondering if
anyone would be kind enough to help me out with a pretty severe performance
issue. I believe the problem to be rather generic, so I'll put it in
generic terms. Since I'm at home and not a work (but this is really
bugging me), I can't post any specifics. However, I think my
explaination will suffice.
I have a 2 tables that are are getting large
and will only get larger with time (expoentially as more users sign on to
the system). Right the now, a table called 'shipment' contains about
16,000 rows and 'shipment_status' contains about 32,500 rows. These
aren't massive rows (I keep reading about tables with millions), but they
will definately get into 6 digits by next year and query performance is
quite poor.
Now, from what I can understand about tuning,
you want to specify good filters, provide good indexes on the driving filter
as well as any referencial keys that are used while joining. This has
helped me solve performance problems many times in the past (for example,
changing a query speed from 2 seconds to 21 milliseconds).
However, I am now tuning queries that operate
on these two tables and the filters aren't very good (the best is a filter
ratio of 0.125) and the number of rows returned is very large (not taking
into consideration limits).
For example, consider something like this
query that takes ~1 second to finish:
select s.*, ss.*
from shipment s, shipment_status ss,
release_code r
where s.current_status_id = ss.id
and ss.release_code_id =
r.id
and r.filtered_column =
'5'
order by ss.date desc
limit 100;
Release code is just a very small table of 8
rows by looking at the production data, hence the 0.125 filter ratio.
However, the data distribution is not normal since the filtered column
actually pulls out about 54% of the rows in shipment_status when it
joins. Postgres seems to be doing a sequencial scan to pull out all of
these rows. Next, it joins approx 17550 rows to shipment. Since
this query has a limit, it only returns the first 100, which seems like a
waste.
Now, for this query, I know I can filter out
the date instead to speed it up. For example, I can probably search
for all the shipments in the last 3 days instead of limiting it to
100. But since this isn't a real production query, I only wanted to
show it as an example since many times I cannot do a filter by the date (and
the sort may be date or something else irrelavant).
I'm just stressed out how I can make queries
like this more efficient since all I see is a bunch of hash joins and
sequencial scans taking all kinds of time.
I guess here are my 2 questions:
1. Should I just change beg to change the
requirements so that I can make more specific queries and more screens to
access those?
2. Can you recommend ways so that postgres acts
on big tables more efficiently? I'm not really interested in this
specific case (I just made it up). I'm more interested in general
solutions to this general problem of big table sizes with bad filters and
where join orders don't seem to help much.
Thank you very much for your
help.
Best Regards,
Ken
Egervari