hi...
in these sorts of situations, you have 2 sides of the coin to look at, which
you cover in your email: database design and hardware support....
> 1) I'm using two machines so far for my e-comerce solution. The website is
> usedcars.com, in case you want to go and check the performance, which
> probably sucks right now. :-) One machine is running the web server and java
> same boat again. Whereas if I come up with a distributed solution, I can
> just keep adding those $1000 Linux boxes. :-)
to find a very effective high availability solution on linux that will allow
you to just add "those $1000 linux boxes" (well, 2000+ might be more realistic
=) check out the Linux Virtual Server project... its used in production
environments all over the world, from large ISP Web hosting sites to
manufacturing plant automation services...
http://www.linuxvirtualserver.org/
it allows you to have several boxes in the back-end that look like one big
machine to the outside world.. it can scale to 100+ machines =) redhat just
put out support for it with "pirhana", a graphical tool for managing LVS,
though pirhana is really primitive right now and you're still really stuck with
the command line and editing routing tables by hand right now to do useful
things... there are rumblings right now that the LVS might actually make it in
to the linux kernel proper somewere in 2.4-or-after-land...
the problem comes in deciding how to allow a farm to access your database...
your have 3 choices that i'm aware of:
1) a server that serves the database (same bottlenecks as currently)
2) a server that exports the filesystem for the database to other servers. this
would require some cooperation on the parts
of the various pgsql servers which would require extensions to the server in a
writing environment... however, if its a read only situation, this might
work well..
the third option come with the use of a special filesystem that allows a
filesystem to be used across a network, transparent to the applications....
some places to look:
Parallel Virtual File System: http://www.parl.clemson.edu/pvfs/
xFX: http://now.cs.berkeley.edu/Xfs/xfs.html
Global File System: http://www.globalfilesystem.org/
i haven't used any of these myself yet, so i can't comment on which one is the
best... but they all take slightly different paths to a similar end... they will
allow several PGSQL databases to access a set of files as if they were local...
3) have several copies of the database on several machines.. since its mostly
read-only, this should be easy to maintain.. if you need to do writes, but they
are far fewer than reads, then set one (or more) machine specifically to this
task.. e.g. they just do writes... so you have a database on several machines
of static content to be read and then one or more machines that are storing new
data from the clients. updating the datasets by replicating files/filesystems
can be done by any number of means in unix...
> 2) One table gets 99% of the queries, it has about 300K rows, by watching
> the log, it seems to get about 120 queries per minute on average, peaking
> at 480 queries per minute during peak hours. Not sure what a join is, do
> you mean union? If so, then yes, almost every query to this table is of
> the form
> select a.*, b.1, b.2 where a.oid = b.id and a.int4 < x and a.str =
> 'String' union select a.*, c.1, c.2 where a.oid = c.id and a.int4 < x and
> a.str = 'String' limit 100;
> I defined indexes on all the fields which appear in a WHERE clause, using
> either hash or btree, as recommended by the docs. The only operator I
> couldn't find was for a money type.
> I don't use order by, although I'd like to, because the docs said that
> indexes are not used on order by queries. I use limit also.
i was under the impression that indexes really don't help all that much in a
complex where clause like yours.. e.g. if it requires more than one index,
postgres doesn't use the indecies available. i could be wrong (this could be
outdated information... i don't use the CVS version of the database)
secondly, i'd move some of your select outside of the database to whatever
access method you are using (servlets?)... i'd do it like this:
select oid, * from a where and a.int4 < x and a.str = 'String' order by
<whatever> limit 100;
then, using a for loop or some similar contruct, store the oids one by one in a
variable and do selects on b and c.. this way you can build an index on c.id
and b.id and get use out of them =)
or you might get better performance if you do a mass select on b and c with the
range of oids returns.. *shrug* that would be an answer found with
experimentation.
i'd be willing to bet you see an good speed improvement right there,
however... i've broken out complex sql queries in the middleware several times
in the project i work on these days to vast improvements in speed =) this is
due to the fact that simple database queries are fast and most middleware is
blindingly fast at messing around with the resulting data... use each tool
where it is best suited...
> I think writing a Distribution/Replication engine using Java servlets
> should be a medium size task. Making it fault-tolerant and dealing with
> crash and error recovery issues could make it larger. If I end up writing
> it from scratch, I'll post the source for others to use.
for read only type database, i don't think this is a HUG job and can also
be accomplished with existing UNIX tools (if not specific to pgsql)... a
read/write environment however... well. that's more complicated and infinitely
more useable...
--
Aaron J. Seigo
Sys Admin
************