Simon Riggs wrote:

Sean Shanny
Hardware: Apple G5 dual 2.0 with 8GB memory attached via dual fibre
channel to a fully loaded 3.5TB XRaid. The XRaid is configured as two


7


disk hardware based RAID5 sets software striped to form a RAID50 set.
The DB, WALS, etc are all on that file set. Running OSX journaled


file


system Running postgres 7.4.1. OSX Server 10.3.2 Postgres is


compiled


locally with '--enable-recode' '--enable-multibyte=UNICODE'
'CFLAGS=-mcpu=970 -mtune=970 -mpowerpc64 -O3'



Have you tried altering the blocksize to a higher value? Say 32K?


That is on our to do list. We had made that change while running on BSD 5.1on a Dell 2650 with 4GB and 5 10K SCSI drive in RAID 0. Did not see a huge improvement.



max_connections = 100



Why have you set this to 100 when you have typically 1-3 users?


Have already addressed that by lowering this to 50. Will drop it lower as time goes on.



sort_mem = 256000 # min 64, size in KB



If you have only 1-3 users, then that value seems reasonable.




The query is

SELECT t1.id, t2.md5, t2.url FROM referral_temp t2 LEFT OUTER JOIN
d_referral t1 ON t2.md5 = t1.referral_md5;


\d d_referral id | integer | not null referral_md5 | text | not null referral_raw_url | text | not null referral_host | text | referral_path | text | referral_query | text | job_control_number | integer | not null


\d referral_temp
md5 | text |
url | text |



Have you looked at using reversed indexes, as per recent postings in
[performance]? These seemed to help considerably with lookup speed when
using a large URL database, which seems to be your situation here.


We create an MD5 of the URL and store it as referral_md5. This is our key for lookup. We ran into problems with the URL as the index. The postgres indexing code was complaining about the URL being too long, hence the MD5 which thought longer to compute during the ETL phase is much quicker to match on.

...


Jeff Boes writes
We have a large (several million row) table with a field containing
URLs. Now, funny thing about URLs: they mostly start with a common
substring ("http://www.";). But not all the rows start with this, so we
can't just lop off the first N characters. However, we noticed some


time


ago that an index on this field wasn't as effective as an index on the
REVERSE of the field. So ...

CREATE OR REPLACE FUNCTION fn_urlrev(text) returns text as '
return reverse(lc($_[0]))
' language 'plperl' with (iscachable,isstrict);

and then

CREATE UNIQUE INDEX ix_links_3 ON links
(fn_urlrev(path_base));



You have 2 CPUs: have you tried splitting your input data file into two
tables, then executing the same query simultaneously, to split the
processing? If you get the correct plan, you should use roughly the same
I/O but use all of the available CPU power.


Have not considered that.

I'm sure we'd all be interested in your further results!


I will post things as I discover them.

--sean

Best Regards, Simon Riggs





---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend

Reply via email to