All fair questions...
Thank you for your detailed response!
On 1/4/2013 11:03 PM, Jeff Janes wrote:
On Friday, January 4, 2013, AJ Weber wrote:
Hi all,
I have a table that has about 73mm rows in it and growing.
How big is the table in MB? Its indexes?
Not sure on this. Will see if pgAdmin tells me.
...
The server has 12GB RAM, 4 cores, but is shared with a big webapp
running in Tomcat -- and I only have a RAID1 disk to work on.
Woes me...
By a RAID1 disk, do you mean two disks in a RAID1 configuration, or a
single RAID1 composed of an unspecified number of disks?
Often spending many thousands of dollars in DBA time can save you from
having to buy many hundreds of dollars in hard drives. :) On the
other hand, often you end up having to buy the extra disks anyway
afterall.
I mean I have two disks in a RAID1 configuration. The server is
currently in a whitebox datacenter and I have zero control over the
hardware, so adding disks is unfortunately out of the question. I
completely understand the comment, and would love to have a larger SAN
available to me that I could configure...I just don't and have no way of
getting one anytime soon.
Anyway, this table is going to continue to grow, and it's used
frequently (Read and Write).
Are all rows in the table read and written with equal vigor, or are
there hot rows and cold rows that can be recognized based on the row's
values?
No, I could probably figure out a way to setup an "archive" or "older"
section of the data that is updated much less frequently. Deletes are
rare. Inserts/Updates "yes". Select on existing rows -- very frequent.
From what I read, this table is a candidate to be partitioned for
performance and scalability. I have tested some scripts to build
the "inherits" tables with their constraints and the
trigger/function to perform the work.
Am I doing the right thing by partitioning this?
Probably not. Or at least, you haven't given us the information to
know. Very broadly speaking, well-implemented partitioning makes bulk
loading and removal operations take less IO, but makes normal
operations take more IO, or if lucky leaves it unchanged. There are
exceptions, but unless you can identify a very specific reason to
think you might have one of those exceptions, then you probably don't.
I know you can't believe everything you read, but I thought I saw some
metrics about when a table's size exceeds some fraction of available
RAM, or when it approaches 100mm rows, it's a big candidate for
partitioning.
Do you have a natural partitioning key? That is, is there a column
(or expression) which occurs as a selective component in the where
clause of almost all of your most io consuming SQL and DML? If so,
you might benefit from partitioning on it. (But in that case, you
might be able to get most of the benefits of partitioning, without the
headaches of it, just by revamping your indexes to include that
column/expression as their leading field).
If you don't have a good candidate partitioning key, then partitioning
will almost surely make things worse.
The table is a "detail table" to its master records. That is, it's like
an order-details table where it will have a 1-n rows joined to the
master ("order") table on the order-id. So I can partition it based on
the order number pretty easily (which is a bigint, btw).
If so, and I can afford some downtime, is dumping the table via
pg_dump and then loading it back in the best way to do this?
To do efficient bulk loading into a partitioned table, you need to
specifically target each partition, rather than targeting with a
trigger. That pretty much rules out pg_dump, AFAIK, unless you are
going to parse the dump file(s) and rewrite them.
Should I run a cluster or vacuum full after all is done?
Probably not. If a cluster after the partitioning would be
beneficial, there would be a pretty good chance you could do a cluster
*instead* of the partitioning and get the same benefit.
I did try clustering the table on the PK (which is actually 4 columns),
and it appeared to help a bit. I was hoping partitioning was going to
help me even more.
If you do some massive deletes from the parent table as part of
populating the children, then a vacuum full of the parent could be
useful. But if you dump the parent table, truncate it, and reload it
as partitioned tables, then vacuum full would probably not be useful.
Really, you need to identify your most resource-intensive queries
before you can make any reasonable decisions.
Is there a major benefit if I can upgrade to 9.2.x in some way
that I haven't realized?
If you have specific queries that are misoptimized and so are
generating more IO than they need to, then upgrading could help. On
the other hand, it could also make things worse, if a currently well
optimized query becomes worse.
Is there some new feature or optimization you're thinking about with
this comment? If so, could you please just send me a link and/or
feature name and I'll google it myself?
But, instrumentation has improved in 9.2 from 9.0, so upgrading would
make it easier to figure out just which queries are really bad and
have the most opportunity for improvement. A little well informed
optimization might obviate the need for either partitioning or more
hard drives.
This is interesting too. I obviously would like the best available
options to tune the database and the application. Is this detailed in
the release notes somewhere, and what tools could I use to take
advantage of this? (Are there new/improved details included in the
EXPLAIN statement or something?)
Finally, if anyone has any comments about my settings listed above
that might help improve performance, I thank you in advance.
Your default statistics target seemed low. Without knowing the nature
of your most resource intensive queries or how much memory tomcat is
using, it is hard to say more.
Tomcat uses 4G of RAM, plus we have nginx in front using a little and
some other, smaller services running on the server in addition to the
usual Linux gamut of processes.
Cheers,
Jeff