Re: [PERFORM] RAID Stripe size

2005-09-19 Thread John A Meinel
bm\mbn wrote:
> Hi Everyone
>
> The machine is IBM x345 with ServeRAID 6i 128mb cache and 6 SCSI 15k
> disks.
>
> 2 disks are in RAID1 and hold the OS, SWAP & pg_xlog
> 4 disks are in RAID10 and hold the Cluster itself.
>
> the DB will have two major tables 1 with 10 million rows and one with
> 100 million rows.
> All the activities against this tables will be SELECT.

What type of SELECTs will you be doing? Mostly sequential reads of a
bunch of data, or indexed lookups of random pieces?

>
> Currently the strip size is 8k. I read in many place this is a poor
> setting.

>From what I've heard of RAID, if you are doing large sequential
transfers, larger stripe sizes (128k, 256k) generally perform better.
For postgres, though, when you are writing, having the stripe size be
around the same size as your page size (8k) could be advantageous, as
when postgres reads a page, it only reads a single stripe. So if it were
reading a series of pages, each one would come from a different disk.

I may be wrong about that, though.

John
=:->

>
> Am i right ?



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] How many tables is too many tables?

2005-09-19 Thread John A Meinel
[EMAIL PROTECTED] wrote:
> I have a database of hundreds of millions of web links (between sites)
> in Postgres.  For each link, we record the url, the referer, and the
> most recent date the link exists.  I'm having some serious performance
> issues when it comes to writing new data into the database.
>
> One machine is simply not going to be able to scale with the quantities
> of links we hope to store information about and we want to move to some
> kind of cluster.  Because of the quantities of data, it seems to make
> sense to go for a cluster setup such that in a 4 machine cluster, each
> machine has a quarter of the data (is this "Share nothing," or, "Share
> everything"?).  To that end, we figured a good first step was to
> partition the data on one machine into multiple tables defining the
> logic which would find the appropriate table given a piece of data.
> Then, we assumed, adding the logic to find the appropriate machine and
> database in our cluster would only be an incremental upgrade.

In a database app, you generally don't win by going to a cluster,
because you are almost always bound by your I/O. Which means that a
single machine, just with more disks, is going to outperform a group of
machines.

As Tom mentioned, your schema is not very good. So lets discuss what a
better schema would be, and also how you might be able to get decent
performance with a cluster.

First, 200rows * 400,000 tables = 80M rows. Postgres can handle this in
a single table without too much difficulty. It all depends on the
selectivity of your indexes, etc.

I'm not sure how you are trying to normalize your data, but it sounds
like having a url table so that each entry can be a simple integer,
rather than the full path, considering that you are likely to have a
bunch of repeated information.

This makes your main table something like 2 integers, plus the
interesting stuff (from url, to url, data).

If you are finding you are running into I/O problems, you probably could
use this layout to move your indexes off onto their own spindles, and
maybe separate the main table from the url tables.

What is your hardware? What are you trying to do that you don't think
will scale?

If you were SELECT bound, then maybe a cluster would help you, because
you could off-load the SELECTs onto slave machines, and leave your
primary machine available for INSERTs and replication.

>
...

>
> At this point, the primary performance bottleneck is in adding
> additional data to the database.  Our loader program (we load text
> files of link information) is currently getting about 40 rows a second,
> which is nowhere near the performance we need to be seeing.  In theory,
> we want to be able to re-write our entire archive of data within on a
> 1-2 month cycle, so this is a very heavy write application (though
> we're also constantly generating reports from the data, so its not
> write only).

Are you VACUUMing enough? If you are rewriting all of the data, postgres
needs you to clean up afterwards. It is pessimistic, and leaves old rows
in their place.

>
> Is the total number of tables prohibitively affecting our write speed
> or is that an IO problem that can only be addressed by better drive
> partitioning (all data is on one drive, which I've already read is a
> problem)?  Is this approach to data partitioning one which makes any
> sense for performance, or should we move to a more normal distribution
> of links across fewer tables which house more rows each?

If all data is on a single drive, you are nowhere near needing a cluster
to improve your database. What you need is a 14-drive RAID array. It's
probably cheaper than 4x powerful machines, and will provide you with
much better performance. And put all of your tables back into one.

John
=:->

>
> Thanks in advance for your advice.
>
> -matt
>
>
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>choose an index scan if your joining column's datatypes do not
>match
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Battery Backed Cache for RAID

2005-09-14 Thread John A Meinel
Alvaro Herrera wrote:
> On Wed, Sep 14, 2005 at 11:28:43AM -0700, Jeffrey W. Baker wrote:
>
>>On Wed, 2005-09-14 at 11:25 -0700, Peter Darley wrote:
>>
>>> I'm getting a new server for our database, and I have a quick question
>>>about RAID controllers with a battery backed cache.  I understand that the
>>>cache will allow the cache to be written out if the power fails to the box,
>>>which allows it to report a write as committed safely when it's not actually
>>>committed.
>>
>>Actually the cache will just hold its contents while the power is out.
>>When the power is restored, the RAID controller will complete the writes
>>to disk.  If the battery does not last through the outage, the data is
>>lost.
>
>
> Just curious: how long are the batteries supposed to last?
>

The recent *cheap* version of a ramdisk had battery backup for 16 hours.
(Very expensive ramdisks actually have enough battery power to power a
small hard-drive to dump the contents into).

I'm guessing for a RAID controller, the time would be in the max 1 day
range.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Advise about how to delete entries

2005-09-11 Thread John A Meinel
Kevin wrote:
> Arnau wrote:
>
>> Hi all,
>>
>> >
>> > COPY FROM a file with all the ID's to delete, into a temporary
>> table, and  do a joined delete to your main table (thus, only one query).
>>
>>
>>   I already did this, but I don't have idea about how to do this join,
>> could you give me a hint ;-) ?
>>
>> Thank you very much
>
>
> maybe something like this:
>
> DELETE FROM statistics_sasme s
> LEFT JOIN temp_table t ON (s.statistic_id = t.statistic_id)
> WHERE t.statistic_id IS NOT NULL
>

Why can't you do:
DELETE FROM statistics_sasme s JOIN temp_table t ON (s.statistic_id =
t.statistic_id);

Or possibly:

DELETE FROM statistics_sasme s
  WHERE s.id IN (SELECT t.statistic_id FROM temp_table t);

I'm not sure how delete exactly works with joins, but the IN form should
be approximately correct.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Prepared statement not using index

2005-09-11 Thread John A Meinel
Guido Neitzer wrote:
> Hi.
>
> I have an interesting problem with the JDBC drivers. When I use a
> select like this:
>
> "SELECT t0.aktiv, t0.id, t0.ist_teilnehmer, t0.nachname, t0.plz,
> t0.vorname FROM public.dga_dienstleister t0 WHERE t0.plz  like
> ?::varchar(256) ESCAPE '|'" withBindings: 1:"53111"(plz)>
>
> the existing index on the plz column is not used.
>
> When I the same select with a concrete value, the index IS used.
>
> I use PostgreSQL 8.0.3 on Mac OS X and the JDBC driver 8.0-312 JDBC 3.
>
> After a lot of other things, I tried using a 7.4 driver and with  this,
> the index is used in both cases.
>
> Why can this happen? Is there a setting I might have not seen?
> Something I do wrong?
>
> cug

I've had this problem in the past. In my case, the issue was that the
column I was searching had a mixed blend of possible values. For
example, with 1M rows, the number 3 occurred 100 times, but the number
18 occurred 700,000 times.

So when I manually did a search for 3, it naturally realized that it
could use an index scan, because it had the statistics to say it was
very selective. If I manually did a search for 18, it switched to
sequential scan, because it was not very selective (both are the correct
plans).

But if you create a prepared statement, parameterized on this number,
postgres has no way of knowing ahead of time, whether you will be asking
about 3 or 18, so when the query is prepared, it has to be pessimistic,
and avoid worst case behavior, so it choses to always use a sequential scan.

The only way I got around this was with writing a plpgsql function which
used the EXECUTE syntax to dynamically re-plan part of the query.

Hope this makes sense. This may or may not be your problem, without
knowing more about you setup. But the symptoms seem similar.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] shared buffers

2005-09-11 Thread John A Meinel
Martin Nickel wrote:
> Chris,
> Would you say that 3 pages is a good maximum for a Postgres install?
> We're running 8.0.3 on 64-bit SUSE on a dual Opteron box with 4G and have
> shared_buffers set at 12.  I've moved it up and down (it was 16
> when I got here) without any measurable performance difference.

What I've read on the mailing list, is that usually the sweet spot is
actually around 10k pages. 120k seems far too high.

I believe that the major fixes to the buffer manager are more in 8.1
rather than 8.0, so you probably are hitting some problems. (The biggest
problem was that there were places that require doing a complete scan
through shared memory looking for dirty pages, or some such).

>
> The reason I ask is because I occasionally see large-ish queries take
> forever (like cancel-after-12-hours forever) and wondered if this could
> result from shared_buffers being too large.

There are lots of possibilities for why these take so long, perhaps you
would want to post them, and we can try to help.
For instance, if you have a foreign key reference from one table to
another, and don't have indexes on both sides, then deleting from the
referenced table, will cause a sequential scan on the referring table
for *each* deleted row. (IIRC).

John
=:->

>
> Thanks for your (and anyone else's) help!
> Martin Nickel


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Caching by Postgres

2005-08-23 Thread John A Meinel
gokulnathbabu manoharan wrote:
> Hi all,
>
> I like to know the caching policies of Postgresql.
> What parameter in the postgresql.conf affects the
> cache size used by the Postgresql?  As far as I have
> searched my knowledge of the parameters are

In general, you don't. The OS handles caching based on file usage.
So if you are using the files, the OS should cache them. Just like it
does with any other program.

>
> 1. shared_buffers - Sets the limit on the amount of
> shared memory used.  If I take this is as the cache
> size then my performance should increase with the
> increase in the size of shared_buffers.  But it seems
> it is not the case and my performance actually
> decreases with the increase in the shared_buffers.  I
> have a RAM size of 32 GB.  The table which I use more
> frequently has around 68 million rows.  Can I cache
> this entire table in RAM?

There is a portion of this which is used for caching. But I believe
before 8.1 there was code that went linearly through all of the
shared_buffers and checked for dirty/clean pages. So there was a
tradeoff that the bigger you make it, the longer that search goes. So
you got diminishing returns, generally around 10k shared buffers.
I think it is better in 8.1, but if the OS is going to cache it anyway
(since it does), then having a Postgres cache is just wasting memory,
and not letting cache as much.

So I'm guessing that with 8.1 there would be 2 sweet spots. Low
shared_buffers (<= 10k), and really high shared buffers (like all of
available ram).
But because postgres has been tuned for the former I would stick with it
(I don't think shared_buffers can go >2GB, but that might just be
work_mem/maintenance_work_mem).

>
> 2. work_mem - It is the amount of memory used by an
> operation.  My guess is once the operation is complete
> this is freed and hence has nothing to do with the
> caching.
>
> 3. effective_cache_size - The parameter used by the
> query planner and has nothing to do with the actual
> caching.

This is important from a planner issue. Because the planner can then
expect that the OS is doing its job and caching the tables, so index
scans are cheaper than they would be otherwise.

John
=:->

>
> So kindly help me in pointing me to the correct
> parameter to set.
>
> It will be great if you can point me to the docs that
> explains the implementation of caching in Postgresql
> which will help me in understanding things much
> clearly.
>
> Thanks in advance.
> Gokul.
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-23 Thread John A Meinel
Jeremiah Jahn wrote:
> On Sun, 2005-08-21 at 16:13 -0400, Ron wrote:
>
>>At 10:54 AM 8/21/2005, Jeremiah Jahn wrote:
>>

...

>>So you have 2 controllers each with 2 external slots?  But you are
>>currently only using 1 controller and only one external slot on that
>>controller?
>
>
> Sorry, no. I have one dual channel card in the system and two
> controllers on the array. Dell PowerVault 220S w/ PERC4eDC-PCI Express
>
>

...

>>Regardless of the above, each of these controllers should still be
>>good for about 80-85% of 640MB/s, or ~510-540 MB/s apiece when doing
>>raw sequential IO if you plug 3-4 fast enough HD's into each SCSI
>>channel.  Cheetah 15K.4's certainly are fast enough.  Optimal setup
>>is probably to split each RAID 1 pair so that one HD is on each of
>>the SCSI channels, and then RAID 0 those pairs.  That will also
>>protect you from losing the entire disk subsystem if one of the SCSI
>>channels dies.
>
> I like this idea, but how exactly does one bond the two channels
> together? Won't this cause me to have both an /dev/sdb and an /dev/sdc?
>

Well, even if you did, you could always either use software raid, or lvm
to turn it into a single volume.

It also depends what the controller card bios would let you get away
with. Some cards would let you setup 4 RAID1's (one drive from each
channel), and then create a RAID0 of those pairs. Software raid should
do this without any problem. And can even be done such that it can be
grown in the future, as well as work across multiple cards (though the
latter is supported by some cards as well).

>
>
>>That 128MB of buffer cache may very well be too small to keep the IO
>>rate up, and/or there may be a more subtle problem with the LSI card,
>>and/or you may have a configuration problem, but _something(s)_ need
>>fixing since you are only getting raw sequential IO of ~100-150MB/s
>>when it should be above 500MB/s.
>
>
> It looks like there's a way to add more memory to it.

This memory probably helps more in writing than reading. If you are
reading the same area over and over, it might end up being a little bit
of extra cache for that (but it should already be cached in system RAM,
so you don't really get anything).

...

>>Initial reads are only going to be as fast as your HD subsystem, so
>>there's a reason for making the HD subsystem faster even if all you
>>care about is reads.  In addition, I'll repeat my previous advice
>>that upgrading to 16GB of RAM would be well worth it for you.
>
>
> 12GB is my max. I may run with it for a while and see.

If your working set truly is 10GB, then you can get a massive
performance increase even at 12GB. If your working set is 10GB and you
have 6GB of RAM, it probably is always swapping out what it just read
for the new stuff, even though you will read that same thing again in a
few seconds. So rather than just paying for the 4GB that can't be
cached, you pay for the whole 10.

John
=:->

>
>
>>Hope this helps,
>>Ron Peacetree
>>
>>
>>
>>---(end of broadcast)---
>>TIP 9: In versions below 8.0, the planner will ignore your desire to
>>   choose an index scan if your joining column's datatypes do not
>>   match



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-21 Thread John A Meinel

Jeremiah Jahn wrote:

On Sat, 2005-08-20 at 21:32 -0500, John A Meinel wrote:


Ron wrote:


At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:


Well, since you can get a read of the RAID at 150MB/s, that means that 
it is actual I/O speed. It may not be cached in RAM. Perhaps you could 
try the same test, only using say 1G, which should be cached.



[EMAIL PROTECTED] pgsql]# time dd if=/dev/zero of=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m8.885s
user0m0.299s
sys 0m6.998s
[EMAIL PROTECTED] pgsql]# time dd of=/dev/null if=testfile bs=1024 count=100
100+0 records in
100+0 records out

real0m1.654s
user0m0.232s
sys 0m1.415s



The write time seems about the same (but you only have 128MB of write 
cache), but your read jumped up to 620MB/s. So you drives do seem to be 
giving you 150MB/s.






...

I'm actually curious about PCI bus saturation at this point. Old 32-bit 
33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this 
is a higher performance system. But I'm really surprised that your write 
speed is that close to your read speed. (100MB/s write, 150MB/s read).



The raid array I have is currently set up to use a single channel. But I
have dual controllers In the array. And dual external slots on the card.
The machine is brand new and has pci-e backplane.  





Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them 
doing raw sequential IO like this should be capable of at
~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s 


BTW I'm using Seagate Cheetah 15K.4's



Now, are the numbers that Ron is quoting in megabytes or megabits? I'm 
guessing he knows what he is talking about, and is doing megabytes. 
80MB/s sustained seems rather high for a hard-disk.


Though this page:
http://www.storagereview.com/articles/200411/20041116ST3146754LW_2.html

Does seem to agree with that statement. (Between 56 and 93MB/s)

And since U320 is a 320MB/s bus, it doesn't seem like anything there 
should be saturating. So why the low performance




_IF_ the controller setup is high powered enough to keep that kind of IO 
rate up.  This will require a controller or controllers providing dual 
channel U320 bandwidth externally and quad channel U320 bandwidth 
internally.  IOW, it needs a controller or controllers talking 64b 
133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized 
IO buffer as well.


AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic 
MegaRAID controllers.  What I don't know is which exact one yours is, 
nor do I know if it (or any of the MegaRAID controllers) are high 
powered enough.



PERC4eDC-PCI Express, 128MB Cache, 2-External Channels


Do you know which card it is? Does it look like this one:
http://www.lsilogic.com/products/megaraid/megaraid_320_2e.html

Judging by the 320 speed, and 2 external controllers, that is my guess.
They at least claim a theoretical max of 2GB/s.

Which makes you wonder why reading from RAM is only able to get 
throughput of 600MB/s. Did you run it multiple times? On my windows 
system, I get just under 550MB/s for what should be cached, copying from 
/dev/zero to /dev/null I get 2.4GB/s (though that might be a no-op).


On a similar linux machine, I'm able to get 1200MB/s for a cached file. 
(And 3GB/s for a zero=>null copy).


John
=:->




Talk to your HW supplier to make sure you have controllers adequate to 
your HD's.


...and yes, your average access time will be in the 5.5ms - 6ms range 
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO 
when accessing them.

Best to not go to HD at all ;-)


Well, certainly, if you can get more into RAM, you're always better off. 
For writing, a battery-backed write cache, and for reading lots of 
system RAM.



I'm not really worried about the writing, it's the reading the reading
that needs to be faster. 




Hope this helps,
Ron Peacetree



John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-20 Thread John A Meinel

Ron wrote:

At 02:53 PM 8/20/2005, Jeremiah Jahn wrote:


On Fri, 2005-08-19 at 16:03 -0500, John A Meinel wrote:
> Jeremiah Jahn wrote:
> > On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
> >

>
> > it's cached alright. I'm getting a read rate of about 150MB/sec. I 
would
> > have thought is would be faster with my raid setup. I think I'm 
going to
> > scrap the whole thing and get rid of LVM. I'll just do a straight 
ext3

> > system. Maybe that will help. Still trying to get suggestions for a
> > stripe size.
> >


Well, since you can get a read of the RAID at 150MB/s, that means that 
it is actual I/O speed. It may not be cached in RAM. Perhaps you could 
try the same test, only using say 1G, which should be cached.



>
> I don't think 150MB/s is out of the realm for a 14 drive array.
> How fast is time dd if=/dev/zero of=testfile bs=8192 count=100
>
time dd if=/dev/zero of=testfile bs=8192 count=100
100+0 records in
100+0 records out

real1m24.248s
user0m0.381s
sys 0m33.028s


> (That should create a 8GB file, which is too big to cache everything)
> And then how fast is:
> time dd if=testfile of=/dev/null bs=8192 count=100

time dd if=testfile of=/dev/null bs=8192 count=100
100+0 records in
100+0 records out

real0m54.139s
user0m0.326s
sys 0m8.916s


and on a second run:

real0m55.667s
user0m0.341s
sys 0m9.013s


>
> That should give you a semi-decent way of measuring how fast the RAID
> system is, since it should be too big to cache in ram.

about 150MB/Sec. Is there no better way to make this go faster...?


I'm actually curious about PCI bus saturation at this point. Old 32-bit 
33MHz pci could only push 1Gbit = 100MB/s. Now, I'm guessing that this 
is a higher performance system. But I'm really surprised that your write 
speed is that close to your read speed. (100MB/s write, 150MB/s read).




Assuming these are U320 15Krpm 147GB HDs, a RAID 10 array of 14 of them 
doing raw sequential IO like this should be capable of at
 ~7*75MB/s= 525MB/s using Seagate Cheetah 15K.4's,  ~7*79MB/s= 553MB/s 
if using Fujitsu MAU's, and ~7*86MB/s= 602MB/s if using Maxtor Atlas 15K 
II's to devices external to the RAID array.


I know I thought these were SATA drives, over 2 controllers. I could be 
completely wrong, though.




_IF_ the controller setup is high powered enough to keep that kind of IO 
rate up.  This will require a controller or controllers providing dual 
channel U320 bandwidth externally and quad channel U320 bandwidth 
internally.  IOW, it needs a controller or controllers talking 64b 
133MHz PCI-X, reasonably fast DSP/CPU units, and probably a decent sized 
IO buffer as well.


AFAICT, the Dell PERC4 controllers use various flavors of the LSI Logic 
MegaRAID controllers.  What I don't know is which exact one yours is, 
nor do I know if it (or any of the MegaRAID controllers) are high 
powered enough.


Talk to your HW supplier to make sure you have controllers adequate to 
your HD's.


...and yes, your average access time will be in the 5.5ms - 6ms range 
when doing a physical seek.
Even with RAID, you want to minimize seeks and maximize sequential IO 
when accessing them.

Best to not go to HD at all ;-)


Well, certainly, if you can get more into RAM, you're always better off. 
For writing, a battery-backed write cache, and for reading lots of 
system RAM.




Hope this helps,
Ron Peacetree



John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-20 Thread John A Meinel

Jeremiah Jahn wrote:

I'm just watching gnome-system-monoitor. Which after careful
consideration.and looking at dstat means I'm on CRACKGSM isn't
showing cached memory usageI asume that the cache memory usage is
where data off of the disks would be cached...?



Well a simple "free" also tells you how much has been cached.
I believe by reading the _cach line, it looks like you have 4.6G cached. 
So you are indeed using memory.


I'm still concerned why it seems to be taking 3-4ms per index lookup, 
when things should already be cached in RAM.
Now, I may be wrong about whether the indexes are cached, but I sure 
would expect them to be.
What is the time for a cached query on your system (with normal nested 
loops)? (give the EXPLAIN ANALYZE for the *second* run, or maybe the 
fourth).


I'm glad that we aren't seeing something weird with your kernel, at least.

John
=:->





memory output from dstat is this for  a few seconds:

---procs--- --memory-usage- ---paging-- --disk/sdadisk/sdb- 
swap--- total-cpu-usage
run blk new|_used _buff _cach _free|__in_ _out_|_read write:_read write|_used 
_free|usr sys idl wai hiq siq
  0   0   0|1336M   10M 4603M   17M| 490B  833B|3823B 3503k:1607k 4285k| 160k 
2048M|  4   1  89   7   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 :   0   464k| 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   048k:   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0   132k:   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   036k:   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   012k:   0 0 | 160k 
2048M| 25   0  75   0   0   0
  1   0   0|1337M   10M 4600M   18M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 25   0  75   0   0   0
  2   0   0|1353M   10M 4585M   18M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 25   1  75   0   0   0
  1   0   0|1321M   10M 4616M   19M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 18   8  74   0   0   0
  1   0   0|1326M   10M 4614M   17M|   0 0 |   0 0 :4096B0 | 160k 
2048M| 16  10  74   1   0   0
  1   0   0|1330M   10M 4609M   17M|   0 0 |   012k:4096B0 | 160k 
2048M| 17   9  74   0   0   0
  0   1   0|1343M   10M 4596M   17M|   0 0 |   0 0 :   0   316M| 160k 
2048M|  5  10  74  11   0   1
  0   1   0|1339M   10M 4596M   21M|   0 0 |   0 0 :   0 0 | 160k 
2048M|  0   0  74  25   0   1
  0   2   0|1334M   10M 4596M   25M|   0 0 |   0  4096B:   0 0 | 160k 
2048M|  0   0  54  44   0   1
  1   0   0|1326M   10M 4596M   34M|   0 0 |   0 0 :   0   364k| 160k 
2048M|  4   1  60  34   0   1
  1   0   0|1290M   10M 4596M   70M|   0 0 |   012k:   0 0 | 160k 
2048M| 24   1  75   0   0   0
  1   0   0|1301M   10M 4596M   59M|   0 0 |   020k:   0 0 | 160k 
2048M| 21   4  75   0   0   0
  1   0   0|1312M   10M 4596M   48M|   0 0 |   0 0 :   0 0 | 160k 
2048M| 22   4  75   0   0   0
  1   0   0|1323M   10M 4596M   37M|   0 0 |   0 0 :   024k| 160k 
2048M| 21   4  75   0   0   0
  1   0   0|1334M   10M 4596M   25M|   0 0 |   0 0 :   056k| 160k 
2048M| 21   4  75   0   0   0



On Fri, 2005-08-19 at 16:07 -0500, John A Meinel wrote:


Jeremiah Jahn wrote:


Rebuild in progress with just ext3 on the raid array...will see if this
helps the access times. If it doesn't I'll mess with the stripe size. I
have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index



/dev/null' none of this seems to have helped, or even increased my


memory usage. argh! The only thing about this new system that I'm
unfamiliar with is the array setup and LVM, which is why I think that's
where the issue is. clustering and indexing as well as vacuum etc are
things that I do and have been aware of for sometime. Perhaps slony is a
factor, but I really don't see it causing problems on index read speed
esp. when it's not running.

thanx for your help, I really appreciate it.
-jj-



By the way, how are you measuring memory usage? Can you give the output
of that command, just to make sure you are reading it correctly.

John
=:->





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-19 Thread John A Meinel
Jeremiah Jahn wrote:
> Rebuild in progress with just ext3 on the raid array...will see if this
> helps the access times. If it doesn't I'll mess with the stripe size. I
> have REINDEXED, CLUSTERED, tablespaced and cached with 'cat table/index
>
>>/dev/null' none of this seems to have helped, or even increased my
>
> memory usage. argh! The only thing about this new system that I'm
> unfamiliar with is the array setup and LVM, which is why I think that's
> where the issue is. clustering and indexing as well as vacuum etc are
> things that I do and have been aware of for sometime. Perhaps slony is a
> factor, but I really don't see it causing problems on index read speed
> esp. when it's not running.
>
> thanx for your help, I really appreciate it.
> -jj-
>

By the way, how are you measuring memory usage? Can you give the output
of that command, just to make sure you are reading it correctly.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-19 Thread John A Meinel
Jeremiah Jahn wrote:
> On Fri, 2005-08-19 at 12:18 -0500, John A Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>


...

>>
>>Well, in general, 3ms for a single lookup seems really long. Maybe your
>>index is bloated by not vacuuming often enough. Do you tend to get a lot
>>of updates to litigant_details?
>
> I have vacuumed this already. I get lots of updates, but this data is
> mostly unchanging.
>
>
>>There are a couple possibilities at this point. First, you can REINDEX
>>the appropriate index, and see if that helps. However, if this is a test
>>box, it sounds like you just did a dump and reload, which wouldn't have
>>bloat in an index.
>
>
> I loaded it using slony

I don't know that slony versus pg_dump/pg_restore really matters. The
big thing is that Updates wouldn't be trashing your index.
But if you are saying that you cluster once/wk your index can't be that
messed up anyway. (Unless CLUSTER messes up the non-clustered indexes,
but that would make cluster much less useful, so I would have guessed
this was not the case)

>
>
>>Another possibility. Is this the column that you usually use when
>>pulling information out of litigant_details? If so, you can CLUSTER
>>litigant_details on the appropriate index. This will help things be
>>close together that should be, which decreases the index lookup costs.
>
> clustering on this right now. Most of the other things are already
> clustered. name and case_data

Just as a reality check, they are clustered on the columns in question,
right? (I don't know if this column is a primary key or not, but any
index can be used for clustering).

>
>
>>However, if this is not the common column, then you probably will slow
>>down whatever other accesses you may have on this table.
>>
>>After CLUSTER, the current data will stay clustered, but new data will
>>not, so you have to continually CLUSTER, the same way that you might
>>VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
>>expensive as a VACUUM FULL. Be aware of this, but it might vastly
>>improve your performance, so it would be worth it.
>
> I generally re-cluster once a week.
>
>>>
>>>>So there is no need for preloading your indexes on the identity table.
>>>>It is definitely not the bottleneck.
>>>>
>>>>So a few design bits, which may help your database.
>>>>Why is "actor_id" a text field instead of a number?
>>>
>>>This is simply due to the nature of the data.
>>>
>>
>>I'm just wondering if changing into a number, and using a number->name
>>lookup would be faster for you. It may not be. In general, I prefer to
>>use numbers for references. I may be over paranoid, but I know that some
>>locales are bad with string -> string comparisons. And since the data in
>>your database is stored as UNICODE, I'm not sure if it has to do any
>>translating or not. Again, something to consider, it may not make any
>>difference.
>
> I don't believe so. I initialze the DB as 'lang=C'. I used to have the
> problem where things were being inited as en_US. this would prevent any
> text based index from working. This doesn't seem to be the case here, so
> I'm not worried about it.
>

Sorry, I think I was confusing you with someone else who posted SHOW ALL.

>
>
>>

...

> it's cached alright. I'm getting a read rate of about 150MB/sec. I would
> have thought is would be faster with my raid setup. I think I'm going to
> scrap the whole thing and get rid of LVM. I'll just do a straight ext3
> system. Maybe that will help. Still trying to get suggestions for a
> stripe size.
>

I don't think 150MB/s is out of the realm for a 14 drive array.
How fast is
time dd if=/dev/zero of=testfile bs=8192 count=100
(That should create a 8GB file, which is too big to cache everything)
And then how fast is:
time dd if=testfile of=/dev/null bs=8192 count=100

That should give you a semi-decent way of measuring how fast the RAID
system is, since it should be too big to cache in ram.

>
>>I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
>>honestly can't say why the per loop would be that much slower.
>>Are both systems running the same postgres version? It sounds like it is
>>different (since you say something about switching to 8.0).
>
> These had little or no effect.
> The production machine is running 7.4 while the devel machine is running
> 8.0
>

Well, my concern is that maybe some portion of the 8.0 code actually
slowed things down for you. You could try reverting to 7.4 on the d

Re: [PERFORM] extremly low memory usage

2005-08-19 Thread John A Meinel
Ron wrote:
> At 01:18 PM 8/19/2005, John A Meinel wrote:
>
>> Jeremiah Jahn wrote:
>> > Sorry about the formatting.
>> >
>> > On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
>> >
>> >>Jeremiah Jahn wrote:
>> >>
>> >>
>>
>> ...
>>
>> >>The expensive parts are the 4915 lookups into the litigant_details
>> (each
>> >>one takes approx 4ms for a total of ~20s).
>> >>And then you do it again on case_data (average 3ms each * 4906 loops =
>> >>~15s).
>> >
>> > Is there some way to avoid this?
>> >
>>
>> Well, in general, 3ms for a single lookup seems really long. Maybe your
>> index is bloated by not vacuuming often enough. Do you tend to get a lot
>> of updates to litigant_details?
>
>
> Given that the average access time for a 15Krpm HD is in the 5.5-6ms
> range (7.5-8ms for a 10Krpm HD), having an average of 3ms for a single
> lookup implies that ~1/2 (the 15Krpm case) or ~1/3 (the 10Krpm case)
> table accesses is requiring a seek.
>


Well, from what he has said, the total indexes are < 1GB and he has 6GB
of ram. So everything should fit. Not to mention he is only accessing
5000/several million rows.


> This implies a poor match between physical layout and access pattern.

This seems to be the case. But since this is not the only query, it may
be that other access patterns are more important to optimize for.

>
> If I understand correctly, the table should not be very fragmented given
> that this is a reasonably freshly loaded DB?  That implies that the
> fields being looked up are not well sorted in the table compared to the
> query pattern.
>
> If the entire table could fit in RAM, this would be far less of a
> consideration.  Failing that, the physical HD layout has to be improved
> or the query pattern has to be changed to reduce seeks.
>
>

...

>> After CLUSTER, the current data will stay clustered, but new data will
>> not, so you have to continually CLUSTER, the same way that you might
>> VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
>> expensive as a VACUUM FULL. Be aware of this, but it might vastly
>> improve your performance, so it would be worth it.
>
>
> CLUSTER can be a very large maintenance overhead/problem if the table(s)
> in question actually need to be "continually" re CLUSTER ed.
>
> If there is no better solution available, then you do what you have to,
> but it feels like there should be a better answer here.
>
> Perhaps the DB schema needs examining to see if it matches up well with
> its real usage?
>
> Ron Peacetree
>

I certainly agree that CLUSTER is expensive, and is an on-going
maintenance issue. If it is the normal access pattern, though, it may be
worth it.

I also wonder, though, if his table is properly normalized. Which, as
you mentioned, might lead to improved access patterns.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-19 Thread John A Meinel
Jeremiah Jahn wrote:
> Sorry about the formatting.
>
> On Thu, 2005-08-18 at 12:55 -0500, John Arbash Meinel wrote:
>
>>Jeremiah Jahn wrote:
>>
>>

...

>>The expensive parts are the 4915 lookups into the litigant_details (each
>>one takes approx 4ms for a total of ~20s).
>>And then you do it again on case_data (average 3ms each * 4906 loops =
>>~15s).
>
> Is there some way to avoid this?
>

Well, in general, 3ms for a single lookup seems really long. Maybe your
index is bloated by not vacuuming often enough. Do you tend to get a lot
of updates to litigant_details?

There are a couple possibilities at this point. First, you can REINDEX
the appropriate index, and see if that helps. However, if this is a test
box, it sounds like you just did a dump and reload, which wouldn't have
bloat in an index.

Another possibility. Is this the column that you usually use when
pulling information out of litigant_details? If so, you can CLUSTER
litigant_details on the appropriate index. This will help things be
close together that should be, which decreases the index lookup costs.

However, if this is not the common column, then you probably will slow
down whatever other accesses you may have on this table.

After CLUSTER, the current data will stay clustered, but new data will
not, so you have to continually CLUSTER, the same way that you might
VACUUM. *However*, IIRC CLUSTER grabs an Exclusive lock, so it is as
expensive as a VACUUM FULL. Be aware of this, but it might vastly
improve your performance, so it would be worth it.

>
>
>>So there is no need for preloading your indexes on the identity table.
>>It is definitely not the bottleneck.
>>
>>So a few design bits, which may help your database.
>>Why is "actor_id" a text field instead of a number?
>
> This is simply due to the nature of the data.
>

I'm just wondering if changing into a number, and using a number->name
lookup would be faster for you. It may not be. In general, I prefer to
use numbers for references. I may be over paranoid, but I know that some
locales are bad with string -> string comparisons. And since the data in
your database is stored as UNICODE, I'm not sure if it has to do any
translating or not. Again, something to consider, it may not make any
difference.


>
>>You could try creating an index on "litigant_details (actor_id,
>>count_ori)" so that it can do just an index lookup, rather than an index
>>+ filter.
>
> I have one, but it doesn't seem to like to use it. Don't really need it
> though, I can just drop the court_id out of the query. It's redundant,
> since each actor_id is also unique in litigant details. I had run vac
> full and analyze but I ran them again anyway and the planning improved.
> However, my 14 disk raid 10 array is still slower than my 3 disk raid 5
> on my production box. 46sec vs 30sec (with live traffic on the
> production) One of the strange things is that when I run the cat command
> on my index and tables that are "HOT" it has no effect on memory usage.
> Right now I'm running ext3 on LVM. I'm still in a position to redo the
> file system and everything. Is this a good way to do it or should I
> switch to something else? What about stripe and extent sizes...? kernel
> parameters to change?

Well, the plans are virtually identical. There is one small difference
as to whether it joins against case_data or court first. But 'court' is
very tiny (small enough to use a seqscan instead of index scan) I'm a
little surprised with court being this small that it doesn't do
something like a hash aggregation, but court takes no time anyway.

The real problem is that your nested loop index time is *much* slower.

Devel:
->  Index Scan using lit_actor_speed on litigant_details
(cost=0.00..3.96 rows=1 width=81)
(actual time=4.788..4.812 rows=1 loops=5057)

Production:
->  Index Scan using lit_actor_speed on litigant_details
(cost=0.00..5.63 rows=1 width=81)
(actual time=3.355..3.364 rows=1 loops=5057)

Devel:
->  Index Scan using case_speed on case_data
(cost=0.00..3.46 rows=1 width=26)
(actual time=4.222..4.230 rows=1 loops=5052)

Production:
->  Index Scan using case_data_pkey on case_data
(cost=0.00..5.31 rows=1 width=26)
(actual time=1.897..1.904 rows=1 loops=5052)

Notice that the actual per-row cost is as much as 1/2 less than on your
devel box.

As a test, can you do "time cat $index_file >/dev/null" a couple of
times. And then determine the MB/s.
Alternatively run vmstat in another shell. If the read/s doesn't change,
then you know the "cat" is being served from RAM, and thus it really is
cached.

I can point you to REINDEX and CLUSTER, but if it is caching in ram, I
honestly can't say why the per loop would be that much slower.
Are both systems running the same postgres version? It sounds like it is
different (since you say something about switching to 8.0).
I doubt it, but you might try an 8.1devel version.

...

>>Do you regularly vacuum analyze your tables?

Re: [PERFORM] LEFT JOIN ON vs. LEFT JOIN USING performance

2005-08-19 Thread John A Meinel
Diego de Lima wrote:
> Hi list,
>  
> I´m using Pg 8.0.3 on Linux FC2.
>  
> This question may have a very simple answer (I hope), but I´m having
> lots of trouble solving it, and I counldn´t find any other post about it
> or anything in the pg docs.
>  
> I have some very complex select statements on 4 million rows
> tables. When using LEFT JOIN ON, some select statements takes about 2
> minutes. When I write exactly the same statement but with LEFT JOIN
> USING, it takes only 1 minute. Comparing to Oracle, the same statement
> takes 1 minute also, but with LEFT JOIN ON.
>  
> Sometimes tables have the same column names and I can use LEFT JOIN
> USING, but in some other cases I MUST use LEFT JOIN ON, because the
> tables have different column names.
>  
> So my question is: is there a way to make LEFT JOIN ON uses the same
> plan of LEFT JOIN USING?
>  
> Thanks,
>  
> Diego de Lima
>  
>  

I'm guessing that ON/USING isn't the specific problem. It's probably
more an issue of how the planner is deciding to do the joins (merge
join, hash join, nested loop, etc.)

Can you send the results of EXPLAIN ANALYZE ?

Also, any sort of join where you have to join against millions of rows
is going to be slow. I don't know your specific design, but likely you
could change the design to be more selective at an earlier level, which
means that you can cut the size of the join by a lot. If you post you
query, a lot of times people here can help optimize your query. (But
make sure to explain what you are trying to do, so the optimizations
make sense.)

John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance pb vs SQLServer.

2005-08-18 Thread John A Meinel
Qingqing Zhou wrote:
> "Alvaro Herrera" <[EMAIL PROTECTED]> writes
>
>>Interesting; do they use an overwriting storage manager like Oracle, or
>>a non-overwriting one like Postgres?
>>
>
>
> They call this MVCC "RLV(row level versioning)". I think they use rollback
> segment like Oracle (a.k.a "version store" or tempdb in SQL Server).  Some
> details are explained in their white paper:"Database concurrency and row
> level versioning in SQL Server 2005".
>
> Regards,
> Qingqing
>

I found the paper here:
http://www.microsoft.com/technet/prodtechnol/sql/2005/cncrrncy.mspx

And it does sound like they are doing it the Oracle way:

When a record in a table or index is updated, the new record is stamped
with the transaction sequence_number of the transaction that is doing
the update. The previous version of the record is stored in the version
store, and the new record contains a pointer to the old record in the
version store. Old records in the version store may contain pointers to
even older versions. All the old versions of a particular record are
chained in a linked list, and SQL Server may need to follow several
pointers in a list to reach the right version. Version records need to
be kept in the version store only as long as there are there are
operations that might require them.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] extremly low memory usage

2005-08-17 Thread John A Meinel
Jeremiah Jahn wrote:
> I just put together a system with 6GB of ram on a 14 disk raid 10 array.
> When I run my usual big painful queries, I get very little to know
> memory usage. My production box (raid 5 4GB ram) hovers at 3.9GB used
> most of the time. the new devel box sits at around 250MB.
>
> I've switched to an 8.0 system on the new devel box, but the .conf
> really didn't change. Index usage is the same. Something seems wrong and
> I'm not sure why.
>

How big is your actual database on disk? And how much of it is actually
touched by your queries?

It seems that your tough queries might only be exercising a portion of
the database. If you really want to make memory usage increase try
something like:
find . -type f -print0 | xargs -0 cat >/dev/null
Which should read all the files. After doing that, does the memory usage
increase?

>
> any thoughts,
> -jj-
>
>
> shared_buffers = 32768  # min 16, at least max_connections*2, 8KB each
> work_mem = 2097151  # min 64, size in KB

This seems awfully high. 2GB Per sort? This might actually be flushing
some of your ram, since it would get allocated and filled, and then
freed when finished. Remember, depending on what you are doing, this
amount can get allocated more than once per query.

> maintenance_work_mem = 819200   # min 1024, size in KB
> max_fsm_pages = 8   # min max_fsm_relations*16, 6 bytes each
> checkpoint_segments = 30# in logfile segments, min 1, 16MB each
> effective_cache_size = 360   <-this is a little out of control, but 
> would it have any real effect?

It should just tell the planner that it is more likely to have buffers
in cache, so index scans are slightly cheaper than they would otherwise be.

> random_page_cost = 2# units are one sequential page fetch cost
> log_min_duration_statement = 1 # -1 is disabled, in milliseconds.
> lc_messages = 'C'   # locale for system error message strings
> lc_monetary = 'C'   # locale for monetary formatting
> lc_numeric = 'C'# locale for number formatting
> lc_time = 'C'   # locale for time formatting
>

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Insert performance (OT?)

2005-08-17 Thread John A Meinel
Manfred Koizar wrote:
> On Tue, 19 Jul 2005 11:51:51 +0100, Richard Huxton 
> wrote:
>
>>You could get away with one query if you converted them to left-joins:
>>INSERT INTO ...
>>SELECT * FROM upload LEFT JOIN ... WHERE f3 IS NULL
>>UNION
>>SELECT * FROM upload LEFT JOIN ... WHERE f4 IS NULL
>
>
> For the archives:  This won't work.  Each of the two SELECTs
> eliminates rows violating one of the two constraints but includes rows
> violating the other constraint.  After the UNION you are back to
> violating both constraints :-(

Couldn't you use "INTERSECT" then? To only get the rows that *both*
queries return?
John
=:->

>
> Servus
>  Manfred
>
>
> ---(end of broadcast)---
> TIP 5: don't forget to increase your free space map settings
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Data Selection Slow From VB 6.0

2005-08-17 Thread John A Meinel
Mahesh Shinde wrote:
> Hi
...

> To connect To the PostgreSQL I am using **PostgreSQL Win32 ODBC and
> OLEDB client drivers 1.0.0.2**
>  

...

> Since in the database I have one Major table that Debtor table which
> is master table and having around 55 lac records. I have set debtorId as
> a primary key having index on it.I am developing a search screen to
> search a specific debtor info using this table.
>  
> When I fire a query to search a debtor id,  it took around 5 seconds to
> return an answer for a query whether entered debtor id is present in the
> database or not using ODBC. Where as when Explian  the query on the
> database
>  Index Scan using tbmstban_debtorid on tbmstbandetails  (cost=0.00..6.01
> rows=2 width=143)
>Index Cond: ((debtorid)::text = '234'::text)

Are you checking this from the VB App? Or just going onto the server and
running psql? (I'm guessing there is some way to run a flat query using
VB. In which case you can just have the query run EXPLAIN ANALYZE, the
return value is just the text, one line after another.)

What I'm thinking is that it might be a locale/encoding issue.
What is the encoding of your database? And what is the default locale
and the locale that you are connecting as?

Can you give us the "EXPLAIN ANALYZE" output so that we can see if the
planner is doing what it thinks it is?

It certainly sounds like either it is always doing a sequential scan, or
something else is going on. 5 sec is a really long time for the type of
query you are doing.

Oh, and can you run the win32 psql client to see if it might be ODBC
which is causing the problem?

John
=:->


> 
> Query for the search criteria is
>  *select * from tbmstdebtordetails where debtorid ='234'*
>  
>  Where as when I am using a like query to search a record starting with
> debtor id having a characters then it took around 10-15 sec to return a
> record set having records.
> query is 
> *select * from tbmstdebtordetails where debtorid like '234%'*
>  
> Explain output on the database
>  Index Scan using tbmstban_debtorid on tbmstbandetails  (cost=0.00..6.01
> rows=2 width=143)
>Index Cond: ((debtorid)::text = '234%'::text)
> Thanks & regards,
> Mahesh Shinde
> --
> Codec Communications (I) Pvt. Ltd.
> PUNE (INDIA)
> T # 91-20-24221460/70(Ext 43)
> Desk No. 25143
> Email – [EMAIL PROTECTED] 




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] PG8 Tuning

2005-08-16 Thread John A Meinel
Tom Lane wrote:
> John A Meinel <[EMAIL PROTECTED]> writes:
>
>>Alvaro Herrera wrote:
>>
>>>I've been asked this a couple of times and I don't know the answer: what
>>>happens if you give XLog a single drive (unmirrored single spindle), and
>>>that drive dies?  So the question really is, should you be giving two
>>>disks to XLog?
>
>
>>I can propose a simple test. Create a test database. Run postgres,
>>insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
>>directory. Start postgres again, what does it do?
>
>
> That test would really be completely unrelated to the problem.
>
> If you are able to shut down the database cleanly, then you do not need
> pg_xlog anymore --- everything is on disk in the data area.  You might
> have to use pg_resetxlog to get going again, but you won't lose anything
> by doing so.

So pg_xlog is really only needed for a dirty shutdown. So what about the
idea of having pg_xlog on a ramdisk that is syncronized periodically to
a real disk.

I'm guessing you would get corruption of the database, or at least you
don't know what is clean and what is dirty, since there would be no WAL
entry for some of the things that completed, but also no WAL entry for
things that were not completed.

So what is actually written to the WAL? Is it something like:
"I am writing these pages, and when page X has a certain value, I am
finished"

I'm just curious, because I don't believe you write to the WAL when you
complete the writing the data, you only make a note about what you are
going to do before you do it. So there needs to be a way to detect if
you actually finished (which would be in the actual data).

John
=:->

>
> The question of importance is: if the xlog drive dies while the database
> is running, are you going to be able to get the postmaster to shut down
> cleanly?  My suspicion is "no" --- if the kernel is reporting write
> failures on WAL, that's going to prevent writes to the data drives (good
> ol' WAL-before-data rule).  You could imagine failure modes where the
> drive is toast but isn't actually reporting any errors ... but one hopes
> that's not a common scenario.
>
> In a scenario like this, it might be interesting to have a shutdown mode
> that deliberately ignores writing to WAL and just does its best to get
> all the dirty pages down onto the data drives.
>
> In the meantime ... use a mirrored drive for WAL.
>
>   regards, tom lane
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] choosing RAID level for xlogs

2005-08-16 Thread John A Meinel
Anjan Dave wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>

To give a shorter statement after my long one...
If you have enough cache that the controller can write out big chunks to
the disk at a time, you can get very good sequential RAID5 performance,
because the stripe size is large (so it can do a parallel write to all
disks).

But for small chunk writes, you suffer the penalty of the read before
write, and possible multi-disk read (depends on what is in cache).

RAID10 generally handles small writes better, and I would guess that
4disks would perform almost identically to 6disks, since you aren't
usually writing enough data to span multiple stripes.

If your battery-backed cache is big enough that you don't fill it, they
probably perform about the same (superfast) since the cache hides the
latency of the disks.

If you start filling up your cache, RAID5 probably can do better because
of the parallelization.

But small writes followed by an fsync do favor RAID10 over RAID5.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] choosing RAID level for xlogs

2005-08-16 Thread John A Meinel
Anjan Dave wrote:
> Yes, that's true, though, I am a bit confused because the Clariion array
> document I am reading talks about how the write cache can eliminate the
> RAID5 Write Penalty for sequential and large IOs...resulting in better
> sequential write performance than RAID10.
>
> anjan
>

Well, if your stripe size is 128k, and you have N disks in the RAID (N
must be even and > 4 for RAID10).

With RAID5 you have a stripe across N-1 disks, and 1 parity entry.
With RAID10 you have a stripe across N/2 disks, replicated on the second
set.

So if the average write size is >128k*N/2, then you will generally be
using all of the disks during a write, and you can expect a the maximum
scale up of about N/2 for RAID10.

If your average write size is >128k*(N-1) then you can again write an
entire stripe at a time and even the parity since you already know all
of the information you don't have to do any reading. So you can get a
maximum speed up of N-1.

If you are doing infrequent smallish writes, it can be buffered by the
write cache, and isn't disk limited at all. And the controller can write
it out when it feels like it. So it should be able to do more buffered
all-at-once writes.

If you are writing a little bit more often (such that the cache fills
up), depending on your write pattern, it is possible that all of the
stripes are already in the cache, so again there is little penalty for
the parity stripe.

I suppose the worst case is if you were writing lots of very small
chunks, all over the disk in random order. In which case each write
encounters a 2x read penalty for a smart controller, or a Nx read
penalty if you are going for more safety than speed. (You can read the
original value, and the parity, and re-compute the parity with the new
value (2x  read penalty), but if there is corruption it would not be
detected, so you might want to read all of the stripes in the block, and
recompute the parity with the new data (Nx read penalty)).

I think the issue for Postgres is that it writes 8k pages, which is
quite small relative to the stripe size. So you don't tend to build up
big buffers to write out the entire stripe at once.

So if you aren't filling up your write buffer, RAID5 can do quite well
with bulk loads.
I also don't know about the penalties for a read followed immediately by
a write. Since you will be writing to the same location, you know that
you have to wait for the disk to spin back to the same location. At 10k
rpm that is a 6ms wait time. For 7200rpm disks, it is 8.3ms.

Just to say that there are some specific extra penalties when you are
reading the location that you are going to write right away. Now a
really smart controller with lots of data to write could read the whole
circle on the disk, and then start writing out the entire circle, and
not have any spin delay. But you would have to know the size of the
circle, and that depends on what block you are on, and the heads
arrangement and everything else.
Though since hard-drives also have small caches in them, you could hide
some of the spin delay, but not a lot, since you have to leave the head
there until you are done writing, so while the current command would
finish quickly, the next command couldn't start until the first actually
finished.

Writing large buffers hides all of these seek/spin based latencies, so
you can get really good throughput. But a lot of DB action is small
buffers randomly distributed, so you really do need low seek time, of
which RAID10 is probably better than RAID5.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] PG8 Tuning

2005-08-16 Thread John A Meinel
Alvaro Herrera wrote:
> On Tue, Aug 16, 2005 at 09:12:31AM -0700, Josh Berkus wrote:
>
>
>>However, you are absolutely correct in that it's *relative* advice, not
>>absolute advice.   If, for example, you're using a $100,000 EMC SAN as your
>>storage you'll probably be better off giving it everything and letting its
>>controller and cache handle disk allocation etc.   On the other hand, if
>>you're dealing with the 5 drives in a single Dell 6650, I've yet to encounter
>>a case where a separate xlog disk did not benefit an OLTP application.
>
>
> I've been asked this a couple of times and I don't know the answer: what
> happens if you give XLog a single drive (unmirrored single spindle), and
> that drive dies?  So the question really is, should you be giving two
> disks to XLog?
>

I can propose a simple test. Create a test database. Run postgres,
insert a bunch of stuff. Stop postgres. Delete everything in the pg_xlog
directory. Start postgres again, what does it do?

I suppose to simulate more of a failure mode, you could kill -9 the
postmaster (and all children processes) perhaps during an insert, and
then delete pg_xlog.

But I would like to hear from the postgres folks what they *expect*
would happen if you ever lost pg_xlog.

What about something like keeping pg_xlog on a ramdisk, and then
rsyncing it to a hard-disk every 5 minutes. If you die in the middle,
does it just restore back to the 5-minutes ago point, or does it get
more thoroughly messed up?
For some people, a 5-minute old restore would be okay, as long as you
still have transaction safety, so that you can figure out what needs to
be restored.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Need for speed

2005-08-16 Thread John A Meinel
Ulrich Wisser wrote:
> Hello,
>
> one of our services is click counting for on line advertising. We do
> this by importing Apache log files every five minutes. This results in a
> lot of insert and delete statements. At the same time our customers
> shall be able to do on line reporting.

What are you deleting? I can see having a lot of updates and inserts,
but I'm trying to figure out what the deletes would be.

Is it just that you completely refill the table based on the apache log,
rather than doing only appending?
Or are you deleting old rows?

>
> We have a box with
> Linux Fedora Core 3, Postgres 7.4.2
> Intel(R) Pentium(R) 4 CPU 2.40GHz
> 2 scsi 76GB disks (15.000RPM, 2ms)
>
> I did put pg_xlog on another file system on other discs.
>
> Still when several users are on line the reporting gets very slow.
> Queries can take more then 2 min.

If it only gets slow when you have multiple clients it sounds like your
select speed is the issue, more than conflicting with your insert/deletes.

>
> I need some ideas how to improve performance in some orders of
> magnitude. I already thought of a box with the whole database on a ram
> disc. So really any idea is welcome.

How much ram do you have in the system? It sounds like you only have 1
CPU, so there is a lot you can do to make the box scale.

A dual Opteron (possibly a dual motherboard with dual core (but only
fill one for now)), with 16GB of ram, and an 8-drive RAID10 system would
perform quite a bit faster.

How big is your database on disk? Obviously it isn't very large if you
are thinking to hold everything in RAM (and only have 76GB of disk
storage to put it in anyway).

If your machine only has 512M, an easy solution would be to put in a
bunch more memory.

In general, your hardware is pretty low in overall specs. So if you are
willing to throw money at the problem, there is a lot you can do.

Alternatively, turn on statement logging, and then post the queries that
are slow. This mailing list is pretty good at fixing poor queries.

One thing you are probably hitting is a lot of sequential scans on the
main table.

If you are doing mostly inserting, make sure you are in a transaction,
and think about doing a COPY.

There is a lot more that can be said, we just need to have more
information about what you want.

John
=:->

>
> Ulrich
>
>
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
Alvaro Herrera wrote:
> On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote:
>

...

>
> This is a known problem, solved in 8.1.  A workaround for previous
> releases is to defer FK checks until commit:

So I don't know exactly what the fix was, but I just tested, and my
problem is indeed fixed with the latest CVS head. It no longer blocks.

>
> create table b (a int references a initially deferred);
>

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Mostly read performance

2005-08-11 Thread John A Meinel
Jeffrey Tenny wrote:
> I have a largely table-append-only application where most transactions
> are read-intensive and many are read-only.  The transactions may span
> many tables, and in some cases might need to pull 70 MB of data out of a
> couple of the larger tables.
>
>
> In 7.3, I don't seem to see any file system or other caching that helps
> with repeated reads of the 70MB of data.  Secondary fetches are pretty
> much as slow as the first fetch. (The 70MB in this example might take
> place via 2000 calls to a parameterized statement via JDBC).
>
> Were there changes after 7.3 w.r.t. caching of data? I read this list
> and see people saying that 8.0 will use the native file system cache to
> good effect.  Is this true? Is it supposed to work with 7.3?  Is there
> something I need to do to get postgresql to take advatage of large ram
> systems?
>
> Thanks for any advice.
>

Well, first off, the general recommendation is probably that 7.3 is
really old, and you should try to upgrade to at least 7.4, though
recommended to 8.0.

The bigger questions: How much RAM do you have? How busy is your system?

8.0 doesn't really do anything to do make the system cache the data.
What kernel are you using?

Also, if your tables are small enough, and your RAM is big enough, you
might already have everything cached.

One way to flush the caches, is to allocate a bunch of memory, and then
scan through it. Or maybe mmap a really big file, and access every byte.
But if your kernel is smart enough, it could certainly deallocate pages
after you stopped accessing them, so I can't say for sure that you can
flush the memory cache. Usually, I believe these methods are sufficient.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
Alvaro Herrera wrote:
> On Thu, Aug 11, 2005 at 03:36:31PM -0500, John A Meinel wrote:
>
>>I'm having an odd case where my system is locking such that if I insert
>>into a table during a transaction, if I start a new connection and
>>transaction, it blocks while trying to do a similar insert until the
>>first transaction is committed or rolled back.
>
>
> Are there foreign keys here?  I can duplicate the problem easily with
> them:
>
> -- session 1
> create table a (a serial primary key);
> create table b (a int references a);
> insert into a values (1);
>
> begin;
> insert into b values (1);
>
>
> -- session 2
> insert into b values (1);
> -- hangs
>

Actually, there are but the insert is occurring into table 'a' not table
'b'.
'a' refers to other tables, but these should not be modified.

>
> If I commit on session 1, session 2 is unlocked.
>
> This is a known problem, solved in 8.1.  A workaround for previous
> releases is to defer FK checks until commit:
>
> create table b (a int references a initially deferred);

I'll try one of the CVS entries and see if it happens there. Good to
hear there has been work done.

John
=:->

>



signature.asc
Description: OpenPGP digital signature


[PERFORM] Odd Locking Problem

2005-08-11 Thread John A Meinel
I'm having an odd case where my system is locking such that if I insert
into a table during a transaction, if I start a new connection and
transaction, it blocks while trying to do a similar insert until the
first transaction is committed or rolled back.

The schema is rather complex (currently 157 tables, 200 views), and I
still haven't been able to create a small test case. Everything I've
tried so far just works.

The data is private, but the schema is open source, so I probably could
work with someone on it. When I look at the pg_locks table, I seem to be
blocked on:
SELECT * FROM pg_locks WHERE granted = false;
 relation | database | transaction |  pid  |   mode   | granted
--+--+-+---+--+-
  |  | 1525932 | 30175 | ShareLock| f
...

Which if I understand correctly, means that the current transaction is
intentionally blocking waiting for the other transaction to finish.

I'm currently running 8.0.3, but the database was first created under
7.4.? I confirmed this behavior on both systems.

Under what circumstances would this occur?

To try and outline the situation there is a main object table, which is
the root object. It contains a group column which is used for access
rights. There is a groupref table, which keeps track of the group rights
for each user. (Each user has specific insert,update,select rights per
group).

The select rights are enforced by views (the tables are not publicly
accessible, the views join against the groupref table to check for
select permission).
Insert and update rights are validated by BEFORE INSERT triggers.

Most tables references the object table. Basically it is OO, but doesn't
use the postgres inheritance (in our testing postgres inheritance didn't
scale well for deep inheritance, and wasn't able to enforce uniqueness
anyway.) The views present an OO appearance, and behind the scenes
direct table foreign keys maintain referential integrity.

I have checked using RAISE NOTICE and the BEFORE INSERT trigger gets all
the way to the RETURN statement before things hang, so I haven't figured
out what is actually hanging.

I have a bzip'd version of the schema and just enough data to be useful
available here:
http://www.arbash-meinel.com/extras/schema_and_data.sql.bz2

This is the commands to replicate the locking:

-- Connect as postgres

-- Required before any inserts, so that the TEMP env table is
-- created and filled out.
select mf_setup_env();

-- Begin a transaction and insert some data
BEGIN;
INSERT INTO object(vgroup,otype,oname) VALUES ('test',1,'test');

-- Start a new shell, and connect again and do exactly the same thing
-- as the above.
-- It should hang until you either do END/ROLLBACK in the first
-- connection.

Thanks for any help,
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Speedier count(*)

2005-08-10 Thread John A Meinel
Dan Harris wrote:
> I have a web page for my customers that shows them count of records  and
> some min/max date ranges in each table of a database, as this is  how we
> bill them for service.  They can log in and check the counts  at any
> time.  I'd like for the counts to be as fresh as possible by  keeping
> this dynamic, but I will use a periodic 'snapshot'/cron job  if that is
> the only option to speed this up.   I have thought about  using the
> table statistics, but the estimate error is probably  unacceptable
> because of the billing purposes.
>
> For some reason, the SQL Server we migrated the app from can return
> count(*) in a split second on multi-million row tables, even though  it
> is a MUCH slower box hardware-wise, but it's now taking many  seconds to
> run. I have read in the archives the problems MVCC brings  into the
> count(*) dilemma forcing Pg to run a seq scan to get  counts.  Does
> SQLServer not use MVCC or have they found another  approach for arriving
> at this number?  Compounding all the min/max  and counts from other
> tables and all those queries take about a  minute to run. The tables
> will contain anywhere from 1 million to 40  million rows.

I believe SQL Server doesn't use MVCC in the same way. At the very
least, it stores some row information in the index, so it can get some
info from just an index, without having to go to the actual page (MVCC
requires a main page visit to determine visibility.)

Depending on how much it impacts performance, you can create an
INSERT/UPDATE trigger so that whenever a new entry is added, it
automatically updates a statistics table. It would be maintained as you
go, rather than periodically like a cron job.

I would go Cron if things can be slightly out of date (like 1 hour at
least), and you need updates & inserts to not be slowed down.
Otherwise I think the trigger is nicer, since it doesn't do redundant
work, and means everything stays up-to-date.


>
> Also, I am using "select ... group by ... order by .. limit 1" to get
> the min/max since I have already been bit by the issue of min() max()
> being slower.
>
>
> -Dan

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 9, 2005, at 3:51 PM, John A Meinel wrote:


Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure  even if
RAID1 is on top of RAID0.

Well, you tell me if I stated incorrectly.  There are two raid   
enclosures with 7 drives in each.  Each is on its own bus on a  dual- 
channel controller.  Each box has a stripe across its drives  and 
the  enclosures are mirrors of each other.  I understand the  
controller  could be a single point of failure, but I'm not sure I  
understand  your concern about the RAID structure itself.




In this configuration, if you have a drive fail on both  controllers, 
the entire RAID dies. Lets label them A1-7, B1-7,  because you stripe 
within a set, if a single one of A dies, and a  single one of B dies, 
you have lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and  
then stripe above that. Since you are using 2 7-disk enclosures,  I'm 
not sure how you can do it well, since it is not an even number  of 
disks. Though if you are using software RAID, there should be no  
problem.


The difference is that in this scenario, *all* of the A drives can  
die, and you haven't lost any data. The only thing you can't lose  is 
a matched pair (eg losing both A1 and B1 will cause complete  data loss)


I believe the correct notation for this last form is RAID 1 + 0  
(RAID10) since you have a set of RAID1 drives, with a RAID0 on-top  of 
them.




I have read up on the difference now. I don't understand why it's a  
"single point of failure".  Technically any array could be a "single  
point" depending on your level of abstraction.   In retrospect, I  
probably should have gone 8 drives in each and used RAID 10 instead  for 
the better fault-tolerance,  but it's online now and will require  some 
planning to see if I want to reconfigure that in the future.  I  wish 
HP's engineer would have promoted that method instead of 0+1..


I wouldn't say that it is a single point of failure, but I *can* say 
that it is much more likely to fail. (2 drives rather than on average n 
drives)


If your devices will hold 8 drives, you could simply do 1 8-drive, and 
one 6-drive. And then do RAID1 with pairs, and RAID0 across the 
resultant 7 RAID1 sets.


I'm really surprised that someone promoted RAID 0+1 over RAID10. I think 
I've heard that there is a possible slight performance improvement, but 
really the failure mode makes it a poor tradeoff.


John
=:->



-Dan



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Table locking problems?

2005-08-09 Thread John A Meinel

Dan Harris wrote:


On Aug 10, 2005, at 12:49 AM, Steve Poe wrote:


Dan,

Do you mean you did RAID 1 + 0 (RAID 10) or RAID 0 + 1? Just a
clarification, since RAID 0 is still a single-point of failure even if
RAID1 is on top of RAID0.



Well, you tell me if I stated incorrectly.  There are two raid  
enclosures with 7 drives in each.  Each is on its own bus on a dual- 
channel controller.  Each box has a stripe across its drives and the  
enclosures are mirrors of each other.  I understand the controller  
could be a single point of failure, but I'm not sure I understand  your 
concern about the RAID structure itself.


In this configuration, if you have a drive fail on both controllers, the 
entire RAID dies. Lets label them A1-7, B1-7, because you stripe within 
a set, if a single one of A dies, and a single one of B dies, you have 
lost your entire mirror.


The correct way of doing it, is to have A1 be a mirror of B1, and then 
stripe above that. Since you are using 2 7-disk enclosures, I'm not sure 
how you can do it well, since it is not an even number of disks. Though 
if you are using software RAID, there should be no problem.


The difference is that in this scenario, *all* of the A drives can die, 
and you haven't lost any data. The only thing you can't lose is a 
matched pair (eg losing both A1 and B1 will cause complete data loss)


I believe the correct notation for this last form is RAID 1 + 0 (RAID10) 
since you have a set of RAID1 drives, with a RAID0 on-top of them.






How many users are connected when your update / delete queries are
hanging? Have you done an analyze verbose on those queries?



Most of the traffic is from programs we run to do analysis of the  data 
and managing changes.  At the time I noticed it this morning,  there 
were 10 connections open to the database.  That rarely goes  above 20 
concurrent.  As I said in my other response, I believe that  the log 
will only contain the query at the point the query finishes,  so if it 
never finishes...




Have you made changes to the postgresql.conf? kernel.vm settings? IO
scheduler?



I set shmmax appropriately for my shared_buffers setting, but that's  
the only kernel tweak.




If you're not doing so already, you may consider running sar  (iostat) to
monitor when the hanging occurs if their is a memory / IO bottleneck
somewhere.



I will try that.  Thanks



When you discover that an update is hanging, can you get into the 
database, and see what locks currently exist? (SELECT * FROM pg_locks)


That might help you figure out what is being locked and possibly 
preventing your updates.


It is also possible that your UPDATE query is trying to do something 
funny (someone just recently was talking about an UPDATE that wanted to 
do a hash join against 12M rows). Which probably meant that it had to 
spill to disk, where a merge join would have worked better.


John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Slow update statement

2005-08-06 Thread John A Meinel
Patrick Hatcher wrote:
> [Reposted from General section with updated information]
> Pg 7.4.5
>
> I'm running an update statement on about 12 million records using the
> following query:
>
> Update table_A
> set F1 = b.new_data
> from table_B b
> where b.keyfield = table_A.keyfield
>
> both keyfields are indexed, all other keys in table_A were dropped, yet
> this job has been running over 15 hours.  Is
> this normal?

Can you do an EXPLAIN UPDATE so that we can have an idea what the
planner is trying to do?

My personal concern is if it doing something like pulling in all rows
from b, and then one by one updating table_A, but as it is going, it
can't retire any dead rows, because you are still in a transaction. So
you are getting a lot of old rows, which it has to pull in to realize it
was old.

How many rows are in table_B?

I can see that possibly doing it in smaller chunks might be faster, as
would inserting into another table. But I would do more of a test and
see what happens.

John
=:->

>
> I stopped the process the first time after 3 hours of running due to
> excessive log rotation and reset the conf file to these settings:
>
>
> wal_buffers = 64# min 4, 8KB each
>
> # - Checkpoints -
>
> checkpoint_segments = 128   # in logfile segments, min 1, 16MB each
> checkpoint_timeout = 1800   # range 30-3600, in seconds
> #checkpoint_warning = 30# 0 is off, in seconds
> #commit_delay = 0   # range 0-10, in microseconds
> #commit_siblings = 5# range 1-1000
>
>
> Would it just be quicker to run a JOIN statement to a temp file and then
> reinsert?
> TIA Patrick
>
>
> ---(end of broadcast)---
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>   choose an index scan if your joining column's datatypes do not
>   match
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
> John A Meinel wrote:
>
>
>>Surely this isn't what you have. You have *no* loop here, and you have
>>stuff like:
>>  AND
>>(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>>
>>I'm guessing this isn't your last version of the function.
>>
>>As far as putting the CREATE TEMP TABLE inside the function, I think the
>>problem is that the first time it runs, it compiles the function, and
>>when it gets to the UPDATE/INSERT with the temporary table name, at
>>compile time it hard-codes that table id.
>>
>>I tried getting around it by using "EXECUTE" which worked, but it made
>>the function horribly slow. So I don't recommend it.
>>
>>Anyway, if you want us to evaluate it, you really need to send us the
>>real final function.
>>
>>John
>>=:->
>
>
> It is the final function.  It doesn't need a loop because of the
> bayes_token_tmp function I added.  The array is passed to it and it
> returns a record set so I can work off of it like it's a table.  So the
> function works the same way it before, but instead of using SELECT
> intoken from TEMPTABLE, you use SELECT bayes_token_tmp from
> bayes_token_tmp(intokenary).
>
> I think this is more efficient than the create table overhead,
> especially because the incoming record set won't be to big.
>
> Thanks,
>
> schu
>
>

Well, I would at least recommend that you change the "WHERE
bayes_token_tmp NOT IN (SELECT token FROM bayes_token)"
with a
"WHERE NOT EXISTS (SELECT toke FROM bayes_token WHERE
token=bayes_token_tmp)"

You might try experimenting with the differences, but on my system the
NOT IN has to do a full sequential scan on bayes_token and load all
entries into a list, while NOT EXISTS can do effectively a nested loop.

The nested loop requires that there is an index on bayes_token(token),
but I'm pretty sure there is anyway.

Again, in my testing, it was a difference of 4200ms versus 180ms. (800k
rows in my big table, 2k in the temp one)

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
> Matthew Schumacher wrote:
>
>>Tom Lane wrote:
>>
>>
>>
>>>I don't really see why you think that this path is going to lead to
>>>better performance than where you were before.  Manipulation of the
>>>temp table is never going to be free, and IN (sub-select) is always
>>>inherently not fast, and NOT IN (sub-select) is always inherently
>>>awful.  Throwing a pile of simple queries at the problem is not
>>>necessarily the wrong way ... especially when you are doing it in
>>>plpgsql, because you've already eliminated the overhead of network
>>>round trips and repeated planning of the queries.
>>>
>>> regards, tom lane
>>
>>
>>The reason why I think this may be faster is because I would avoid
>>running an update on data that needs to be inserted which saves
>>searching though the table for a matching token.
>>
>>Perhaps I should do the insert first, then drop those tokens from the
>>temp table, then do my updates in a loop.
>>
>>I'll have to do some benchmarking...
>>
>>schu
>
>
> Tom, I think your right, whenever I do a NOT IN it does a full table
> scan against bayes_token and since that table is going to get very big
> doing the simple query in a loop that uses an index seems a bit faster.
>
> John, thanks for your help, it was worth a try, but it looks like the
> looping is just faster.
>
> Here is what I have so far in case anyone else has ideas before I
> abandon it:

Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
  AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);

I'm guessing this isn't your last version of the function.

As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.

I tried getting around it by using "EXECUTE" which worked, but it made
the function horribly slow. So I don't recommend it.

Anyway, if you want us to evaluate it, you really need to send us the
real final function.

John
=:->


>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>  intokenary BYTEA[],
>  inspam_count INTEGER,
>  inham_count INTEGER,
>  inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>
>   UPDATE
> bayes_token
>   SET
> spam_count = greatest_int(spam_count + inspam_count, 0),
> ham_count = greatest_int(ham_count + inham_count , 0),
> atime = greatest_int(atime, inatime)
>   WHERE
> id = inuserid
>   AND
> (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));
>
>   UPDATE
> bayes_vars
>   SET
> token_count = token_count + (
>   SELECT
> count(bayes_token_tmp)
>   FROM
> bayes_token_tmp(intokenary)
>   WHERE
> bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
> newest_token_age = greatest_int(newest_token_age, inatime),
> oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
> id = inuserid;
>
>   INSERT INTO
> bayes_token
> SELECT
>   inuserid,
>   bayes_token_tmp,
>   inspam_count,
>   inham_count,
>   inatime
> FROM
>   bayes_token_tmp(intokenary)
> WHERE
>   (inspam_count > 0 OR inham_count > 0)
> AND
>   (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
> SETOF bytea AS
> '
> BEGIN
>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
> return next intokenary[i];
>   END LOOP;
>   RETURN;
> end
> '
> language 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>
> ---(end of broadcast)---
> TIP 4: Have you searched our list archives?
>
>http://archives.postgresql.org
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Tom Lane wrote:
> Matthew Schumacher <[EMAIL PROTECTED]> writes:
>
>>  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>>  LOOP
>>_token := intokenary[i];
>>INSERT INTO bayes_token_tmp VALUES (_token);
>>  END LOOP;
>
>
>>  UPDATE
>>bayes_token
>>  SET
>>spam_count = greatest_int(spam_count + inspam_count, 0),
>>ham_count = greatest_int(ham_count + inham_count , 0),
>>atime = greatest_int(atime, 1000)
>>  WHERE
>>id = inuserid
>>  AND
>>(token) IN (SELECT intoken FROM bayes_token_tmp);
>
>
> I don't really see why you think that this path is going to lead to
> better performance than where you were before.  Manipulation of the
> temp table is never going to be free, and IN (sub-select) is always
> inherently not fast, and NOT IN (sub-select) is always inherently
> awful.  Throwing a pile of simple queries at the problem is not
> necessarily the wrong way ... especially when you are doing it in
> plpgsql, because you've already eliminated the overhead of network
> round trips and repeated planning of the queries.

So for an IN (sub-select), does it actually pull all of the rows from
the other table, or is the planner smart enough to stop once it finds
something?

Is IN (sub-select) about the same as EXISTS (sub-select WHERE x=y)?

What about NOT IN (sub-select) versus NOT EXISTS (sub-select WHERE x=y)

I would guess that the EXISTS/NOT EXISTS would be faster, though it
probably would necessitate using a nested loop (at least that seems to
be the way the query is written).

I did some tests on a database with 800k rows, versus a temp table with
2k rows. I did one sequential test (1-2000, with 66 rows missing), and
one sparse test (1-200, 10-100200, 20-200200, ... with 658 rows
missing).

If found that NOT IN did indeed have to load the whole table. IN was
smart enough to do a nested loop.
EXISTS and NOT EXISTS did a sequential scan on my temp table, with a
SubPlan filter (which looks a whole lot like a Nested Loop).

What I found was that IN performed about the same as EXISTS (since they
are both effectively doing a nested loop), but that NOT IN took 4,000ms
while NOT EXISTS was the same speed as EXISTS at around 166ms.

Anyway, so it does seem like NOT IN is not a good choice, but IN seems
to be equivalent to EXISTS, and NOT EXISTS is also very fast.

Is this generally true, or did I just get lucky on my data?

John
=:->



>
>   regards, tom lane
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/docs/faq
>

mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id IN (SELECT id FROM 
object_t);
   QUERY PLAN
-
 Nested Loop IN Join  (cost=0.00..9851.68 rows=2140 width=4) (actual 
time=0.085..183.889 rows=1351 loops=1)
   ->  Seq Scan on ids  (cost=0.00..31.40 rows=2140 width=4) (actual 
time=0.014..24.032 rows=2009 loops=1)
   ->  Index Scan using object_t_pkey on object_t  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.071..0.071 rows=1 loops=2009)
 Index Cond: ("outer".id = object_t.id)
 Total runtime: 184.823 ms
(5 rows)

Time: 186.931 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan on ids  (cost=0.00..9824.93 rows=1070 width=4) (actual 
time=0.086..165.053 rows=1351 loops=1)
   Filter: (subplan)
   SubPlan
 ->  Index Scan using object_t_pkey on object_t o  (cost=0.00..4.58 rows=1 
width=4) (actual time=0.025..0.025 rows=1 loops=2009)
   Index Cond: (id = $0)
 Total runtime: 165.995 ms
(6 rows)

Time: 167.795 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE id NOT IN (SELECT id FROM 
object_t);
QUERY PLAN
---
 Seq Scan on ids  (cost=36410.51..36447.26 rows=1070 width=4) (actual 
time=4168.247..4172.080 rows=658 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
 ->  Seq Scan on object_t  (cost=0.00..34381.81 rows=811481 width=4) 
(actual time=0.044..2464.296 rows=811481 loops=1)
 Total runtime: 4210.784 ms
(5 rows)

Time: 4212.276 ms
mifar07=# EXPLAIN ANALYZE SELECT id FROM ids WHERE NOT EXISTS (SELECT id FROM 
object_t o WHERE o.id = ids.id);
 QUERY PLAN
-
 Seq Scan o

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

2005-08-04 Thread John A Meinel
Matthew Schumacher wrote:
> Okay,
>
> Here is the status of the SA updates and a question:
>
> Michael got SA changed to pass an array of tokens to the proc so right
> there we gained a ton of performance due to connections and transactions
> being grouped into one per email instead of one per token.
>
> Now I am working on making the proc even faster.  Since we have all of
> the tokens coming in as an array, it should be possible to get this down
> to just a couple of queries.
>
> I have the proc using IN and NOT IN statements to update everything at
> once from a temp table, but it progressively gets slower because the
> temp table is growing between vacuums.  At this point it's slightly
> slower than the old update or else insert on every token.

I recommend that you drop and re-create the temp table. There is no
reason to have it around, considering you delete and re-add everything.
That means you never have to vacuum it, since it always only contains
the latest rows.

>
> What I really want to do is have the token array available as a record
> so that I can query against it, but not have it take up the resources of
> a real table.  If I could copy from an array into a record then I can
> even get rid of the loop.  Anyone have any thoughts on how to do this?
>

My one question here, is the inspam_count and inham_count *always* the
same for all tokens? I would have thought each token has it's own count.
Anyway, there are a few lines I would change:

>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
>  intokenary BYTEA[],
>  inspam_count INTEGER,
>  inham_count INTEGER,
>  inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
>   _token BYTEA;
> BEGIN
>

-- create the table at the start of the procedure
CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
-- You might also add primary key if you are going to be adding
-- *lots* of entries, but it sounds like you are going to have
-- less than 1 page, so it doesn't matter

>   for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
>   LOOP
> _token := intokenary[i];
> INSERT INTO bayes_token_tmp VALUES (_token);
>   END LOOP;
>
>   UPDATE
> bayes_token
>   SET
> spam_count = greatest_int(spam_count + inspam_count, 0),
> ham_count = greatest_int(ham_count + inham_count , 0),
> atime = greatest_int(atime, 1000)
>   WHERE
> id = inuserid
>   AND

--(token) IN (SELECT intoken FROM bayes_token_tmp);
  EXISTS (SELECT token FROM bayes_token_tmp
   WHERE intoken=token LIMIT 1);

-- I would also avoid your intoken (NOT) IN (SELECT token FROM
-- bayes_token) There are a few possibilities, but to me
-- as your bayes_token table becomes big, this will start
-- to be the slow point

-- Rather than doing 2 NOT IN queries, it *might* be faster to do
   DELETE FROM bayes_token_tmp
WHERE NOT EXISTS (SELECT token FROM bayes_token
   WHERE token=intoken);


>
>   UPDATE
> bayes_vars
>   SET

-- token_count = token_count + (SELECT count(intoken) FROM
-- bayes_token_tmp WHERE intoken NOT IN (SELECT token FROM bayes_token)),
   token_count = token_count + (SELECT count(intoken)
  FROM bayes_token_tmp)

-- You don't need the where NOT IN, since we already removed those rows

> newest_token_age = greatest_int(newest_token_age, inatime),
> oldest_token_age = least_int(oldest_token_age, inatime)
>   WHERE
> id = inuserid;
>
>   INSERT INTO
> bayes_token
> SELECT
>   inuserid,
>   intoken,
>   inspam_count,
>   inham_count,
>   inatime
> FROM
>   bayes_token_tmp
> WHERE
>   (inspam_count > 0 OR inham_count > 0)

-- AND
--   (intoken) NOT IN (SELECT token FROM bayes_token);

-- You don't need either of those lines, again because we already
-- filtered

--   delete from bayes_token_tmp;
--   And rather than deleting all of the entries just
 DROP TABLE bayes_token_tmp;

>
>   RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
>  LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
>  RETURNS INTEGER
>  IMMUTABLE STRICT
>  AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
>  LANGUAGE SQL;
>

So to clarify, here is my finished function:

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
 intokenary BYTEA[],
 inspam_count INTEGER,
 inham_count INTEGER,
 inatime INTEGER)
RETURNS VOID AS '
DECLARE
  _token BYTEA;
BEGIN

  CREATE TEMP TABLE bayes_token_tmp (intoken bytea);
  for i

Re: [PERFORM] Is There A Windows Version of Performance Tuning Documents?

2005-08-03 Thread John A Meinel
Lane Van Ingen wrote:
> I have in my possession some performance tuning documents authored by Bruce
> Momjian, Josh Berkus, and others. They give good information on utilities to
> use (like ipcs, sar, vmstat, etc) to evaluate disk, memory, etc. performance
> on Unix-based systems.
>
> Problem is, I have applications running on Windows 2003, and have worked
> mostly on Unix before. Was wondering if anyone knows where there might be a
> Windows performance document that tells what to use / where to look in
> Windows for some of this data. I am thinking that I may not seeing what I
> need
> in perfmon or the Windows task manager.
>
> Want to answer questions like:
>   How much memory is being used for disk buffer cache?
>   How to I lock shared memory for PostgreSQL (if possible at all)?
>   How to determine if SWAP (esp. page-in) activity is hurting me?
>   Does Windows use a 'unified buffer cache' or not?
>   How do I determine how much space is required to do most of my sorts in
> RAM?
>

I don't know of any specific documentation. I would mention the
TaskManager as the first place I would look (Ctrl+Shift+Esc, or right
click on the task bar).
You can customize the columns that it shows in the process view, so you
can get an idea if something is paging, how much I/O it is using, etc.

I'm sure there are other better tools, but this one is pretty easy to
get to, and shows quite a bit.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Two queries are better than one?

2005-07-28 Thread John A Meinel
Karim Nassar wrote:
> I ran into a situation today maintaining someone else's code where the
> sum time running 2 queries seems to be faster than 1. The original code
> was split into two queries. I thought about joining them, but
> considering the intelligence of my predecessor, I wanted to test it.
>
> The question is, which technique is really faster? Is there some hidden
> setup cost I don't see with explain analyze?

Yes, the time it takes your user code to parse the result, and create
the new query. :)

It does seem like you are taking an extra 0.1ms for the combined query,
but that means you don't have another round trip to the database. So
that would mean one less context switch, and you don't need to know what
the cli_code is before you can get the cli_name.

I would guess the overhead is the time for postgres to parse out the
text, place another index query, and then combine the rows. It seems
like this shouldn't take 0.1ms, but then again, that isn't very long.

Also, did you run it *lots* of times to make sure that this isn't just
noise?

John
=:->

>
> Postgres 7.4.7, Redhat AES 3
>
> Each query individually:
>
> test=> explain analyze
> test-> select * from order  WHERE ord_batch='343B' AND ord_id='12-645';
>  QUERY PLAN
> 
>  Index Scan using order_pkey on order  (cost=0.00..6.02 rows=1 width=486) 
> (actual time=0.063..0.066 rows=1 loops=1)
>Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = '12-645'::bpchar))
>  Total runtime: 0.172 ms
> (3 rows)
>
>
> test=> explain analyze
> test-> select cli_name from client where cli_code='1837';
>  QUERY PLAN
> -
>  Index Scan using client_pkey on client  (cost=0.00..5.98 rows=2 width=39) 
> (actual time=0.043..0.047 rows=1 loops=1)
>Index Cond: (cli_code = '1837'::bpchar)
>  Total runtime: 0.112 ms
> (3 rows)
>
> Joined:
>
> test=> explain analyze
> test->SELECT cli_name,order.*
> test->   FROM order
> test->   JOIN client ON (ord_client = cli_code)
> test->  WHERE ord_batch='343B' AND ord_id='12-645';
> QUERY PLAN
> --
>  Nested Loop  (cost=0.00..12.00 rows=2 width=525) (actual time=0.120..0.128 
> rows=1 loops=1)
>->  Index Scan using order_pkey on order  (cost=0.00..6.02 rows=1 
> width=486) (actual time=0.064..0.066 rows=1 loops=1)
>  Index Cond: ((ord_batch = '343B'::bpchar) AND (ord_id = 
> '12-645'::bpchar))
>->  Index Scan using client_pkey on client  (cost=0.00..5.98 rows=1 
> width=51) (actual time=0.023..0.026 rows=1 loops=1)
>  Index Cond: ("outer".ord_client = client.cli_code)
>  Total runtime: 0.328 ms
> (6 rows)
>
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Matthew Nuzum wrote:

On 7/26/05, Dan Harris <[EMAIL PROTECTED]> wrote:


I am working on a process that will be inserting tens of million rows
and need this to be as quick as possible.

The catch is that for each row I could potentially insert, I need to
look and see if the relationship is already there  to prevent
multiple entries.  Currently I am doing a SELECT before doing the
INSERT, but I recognize the speed penalty in doing to operations.  I
wonder if there is some way I can say "insert this record, only if it
doesn't exist already".  To see if it exists, I would need to compare
3 fields instead of just enforcing a primary key.



I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.



Did you ever try the temp table approach? You could:

COPY all records into temp_table, with an empty row for ip_id
-- Get any entries which already exist
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);
-- Create new entries
INSERT INTO ipaddress(add) SELECT add FROM temp_table
  WHERE ip_id IS NULL;
-- Update the rest
UPDATE temp_table SET ip_id =
(SELECT ip_id from ipaddress WHERE add=add)
 WHERE ip_id IS NULL AND
EXISTS (SELECT ip_id FROM ipaddress WHERE add=add);

This would let the database do all of the updating work in bulk on it's 
side, rather than you pulling all the data out and doing it locally.


An alternative would be something like:

CREATE TEMP TABLE new_ids (address text, ip_id int);
COPY all potentially new addresses into that table.
-- Delete all entries which already exist
DELETE FROM new_ids WHERE EXISTS
(SELECT ip_id FROM ipaddresses
  WHERE add=new_ids.address);
-- Now create the new entries
INSERT INTO ipaddresses(add) SELECT address FROM new_ids;

-- At this point you are guaranteed to have all addresses existing in
-- the database

If you then insert your full data into the final table, only leave the 
ip_id column as null. Then if you have a partial index where ip_id is 
NULL, you could use the command:


UPDATE final_table SET ip_id =
(SELECT ip_id FROM ipaddresses WHERE add=final_table.add)
WHERE ip_id IS NULL;

You could also do this in a temporary table, before bulk inserting into 
the final table.


I don't know what you have tried, but I know that for Dan, he easily has 
> 36M rows. So I don't think he wants to pull that locally and create a 
in-memory hash just to insert 100 rows or so.


Also, for your situation, if you do keep a local cache, you could 
certainly save the cache between runs, and use a temp table to determine 
what new ids you need to add to it. Then you wouldn't have to pull the 
complete set each time. You just pull new values for entries you haven't 
added yet.


John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel

Alex Turner wrote:

Also seems pretty silly to put it on a regular SATA connection, when
all that can manage is 150MB/sec.  If you made it connection directly
to 66/64-bit PCI then it could actualy _use_ the speed of the RAM, not
to mention PCI-X.

Alex Turner
NetEconomist



Well, the whole point is to have it look like a normal SATA drive, even 
to the point that you can boot off of it, without having to load a 
single driver.


Now, you could offer that you could recreate a SATA controller on the 
card, with a SATA bios, etc. And then you could get the increased speed, 
and still have bootable functionality.


But it is a version 1.0 of a product, and I'm sure they tried to make it 
as cheap as possible (and within their own capabilities.)


John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel

Luke Lonergan wrote:

Yup - interesting and very niche product - it seems like it's only obvious
application is for the Postgresql WAL problem :-)


Well, you could do it for any journaled system (XFS, JFS, ext3, reiserfs).

But yes, it seems specifically designed for a battery backed journal. 
Though the article reviews it for very different purposes.


Though it was a Windows review, and I don't know of any way to make NTFS 
use a separate device for a journal. (Though I expect it is possible 
somehow).


John
=:->




The real differentiator is the battery backup part.  Otherwise, the
filesystem caching is more effective, so put the RAM on the motherboard.

- Luke






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread John A Meinel

Dan Harris wrote:
I am working on a process that will be inserting tens of million rows  
and need this to be as quick as possible.


The catch is that for each row I could potentially insert, I need to  
look and see if the relationship is already there  to prevent  multiple 
entries.  Currently I am doing a SELECT before doing the  INSERT, but I 
recognize the speed penalty in doing to operations.  I  wonder if there 
is some way I can say "insert this record, only if it  doesn't exist 
already".  To see if it exists, I would need to compare  3 fields 
instead of just enforcing a primary key.


Even if this could be a small increase per record, even a few percent  
faster compounded over the whole load could be a significant reduction.


Thanks for any ideas you might have.

-Dan



You could insert all of your data into a temporary table, and then do:

INSERT INTO final_table SELECT * FROM temp_table WHERE NOT EXISTS 
(SELECT info FROM final_table WHERE id=id, path=path, y=y);


Or you could load it into the temporary table, and then:
DELETE FROM temp_table WHERE EXISTS (SELECT FROM final_table WHERE id...);

And then do a plain INSERT INTO.

I can't say what the specific performance increases would be, but 
temp_table could certainly be an actual TEMP table (meaning it only 
exists during the connection), and you could easily do a COPY into that 
table to load it up quickly, without having to check any constraints.


Just a thought,
John
=:->



signature.asc
Description: OpenPGP digital signature


[PERFORM] Cheap RAM disk?

2005-07-26 Thread John A Meinel
I saw a review of a relatively inexpensive RAM disk over at 
anandtech.com, the Gigabyte i-RAM

http://www.anandtech.com/storage/showdoc.aspx?i=2480

Basically, it is a PCI card, which takes standard DDR RAM, and has a 
SATA port on it, so that to the system, it looks like a normal SATA drive.


The card costs about $100-150, and you fill it with your own ram, so for 
a 4GB (max size) disk, it costs around $500. Looking for solid state 
storage devices, the cheapest I found was around $5k for 2GB.


Gigabyte claims that the battery backup can last up to 16h, which seems 
decent, if not really long (the $5k solution has a built-in harddrive so 
that if the power goes out, it uses the battery power to copy the 
ramdisk onto the harddrive for more permanent storage).


Anyway, would something like this be reasonable as a drive for storing 
pg_xlog? With 4GB you could have as many as 256 checkpoint segments.


I'm a little leary as it is definitely a version 1.0 product (it is 
still using an FPGA as the controller, so they were obviously pushing to 
get the card into production).


But it seems like this might be a decent way to improve insert 
performance, without setting fsync=false.


Probably it should see some serious testing (as in power spikes/pulled 
plugs, etc). I know the article made some claim that if you actually 
pull out the card it goes into "high consumption mode" which is somehow 
greater than if you leave it in the slot with the power off. Which to me 
seems like a lot of bull, and really means the 16h is only under 
best-case circumstances. But even 1-2h is sufficient to handle a simple 
power outage.


And if you had a UPS with detection of power failure, you could always 
sync the ramdisk to a local partition before the power goes out. Though 
you could do that with a normal in-memory ramdisk (tmpfs) without having 
to buy the card. Though it does give you up-to an extra 4GB of ram, for 
machines which have already maxed out their slots.


Anyway, I thought I would mention it to the list, to see if anyone else 
has heard of it, or has any thoughts on the matter. I'm sure there are 
some people who are using more expensive ram disks, maybe they have some 
ideas about what this device is missing. (other than costing about 
1/10th the price)


John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] "Vacuum Full Analyze" taking so long

2005-07-25 Thread John A Meinel
Tomeh, Husam wrote:
>
> Nothing was running except the job. The server did not look stressed out
> looking at top and vmstat. We have seen slower query performance when
> performing load tests, so I run the re-index on all application indexes
> and then issue a full vacuum. I ran the same thing on a staging server
> and it took less than 12 hours. Is there a possibility the DB pages are
> corrupted. Is there a command to verify that. (In Oracle, there's a
> dbverify command that checks for corruption on the data files level).
>
> The other question I have. What would be the proper approach to rebuild
> indexes. I re-indexes and then run vacuum/analyze. Should I not use the
> re-index approach, and instead, drop the indexes, vacuum the tables, and
> then create the indexes, then run analyze on tables and indexes??

I *think* if you are planning on dropping the indexes anyway, just drop
them, VACUUM ANALYZE, and then recreate them, I don't think you have to
re-analyze after you have recreated them.

John
=:->

>
> Thanks,
>
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] COPY insert performance

2005-07-25 Thread John A Meinel
Chris Isaacson wrote:
> I need COPY via libpqxx to insert millions of rows into two tables.  One
> table has roughly have as many rows and requires half the storage.  In
> production, the largest table will grow by ~30M rows/day.  To test the
> COPY performance I split my transactions into 10,000 rows.  I insert
> roughly 5000 rows into table A for every 10,000 rows into table B.
>
> Table A has one unique index:
>
> "order_main_pk" UNIQUE, btree (cl_ord_id)
>
> Table B has 1 unique index and 2 non-unique indexes:
>
> "order_transition_pk" UNIQUE, btree (collating_seq)
> "order_transition_ak2" btree (orig_cl_ord_id)
> "order_transition_ak3" btree (exec_id)

Do you have any foreign key references?
If you are creating a table for the first time (or loading a large
fraction of the data), it is common to drop the indexes and foreign keys
first, and then insert/copy, and then drop them again.

Is InnoDB the backend with referential integrity, and true transaction
support? I believe the default backend does not support either (so it is
"cheating" to give you speed, which may be just fine for your needs,
especially since you are willing to run fsync=false).

I think moving pg_xlog to a dedicated drive (set of drives) could help
your performance. As well as increasing checkpoint_segments.

I don't know if you gain much by changing the bg_writer settings, if you
are streaming everything in at once, you probably want to have it
written out right away. My understanding is that bg_writer settings are
for the case where you have mixed read and writes going on at the same
time, and you want to make sure that the reads have time to execute (ie
the writes are not saturating your IO).

Also, is any of this tested under load? Having a separate process issue
queries while you are loading in data. Traditionally MySQL is faster
with a single process inserting/querying for data, but once you have
multiple processes hitting it at the same time, it's performance
degrades much faster than postgres.

You also seem to be giving MySQL 512M of ram to work with, while only
giving 2M/200M to postgres. (re)creating indexes uses
maintenance_work_mem, but updating indexes could easily use work_mem.
You may be RAM starved.

John
=:->


>
> My testing environment is as follows:
> -Postgresql 8.0.1
> -libpqxx 2.5.0
> -Linux 2.6.11.4-21.7-smp x86_64
> -Dual Opteron 246
> -System disk (postgres data resides on this SCSI disk) -  Seagate
> (ST373453LC) - 15K, 73 GB
> (http://www.seagate.com/cda/products/discsales/marketing/detail/0,1081,549,00.html)
> -2nd logical disk - 10K, 36GB IBM SCSI (IC35L036UCDY10-0) - WAL reside
> on this disk
> -NO RAID
>
> *PostgreSQL*
> Here are the results of copying in 10M rows as fast as possible:
> (10K/transaction)
> Total Time:1129.556 s
> Rows/sec: 9899.922
> Transaction>1.2s225
> Transaction>1.5s 77
> Transaction>2.0s  4
> Max Transaction   2.325s
>
> **MySQL**
> **I ran a similar test with MySQL 4.1.10a (InnoDB) which produced these
> results: (I used MySQL's INSERT INTO x VALUES
> (1,2,3)(4,5,6)(...,...,...) syntax) (10K/transaction)
> Total Time: 860.000 s
> Rows/sec:11627.91
> Transaction>1.2s  0
> Transaction>1.5s  0
> Transaction>2.0s  0
> Max Transaction   1.175s
>
> Considering the configurations shown below, can anyone offer advice to
> close the 15% gap and the much worse variability I'm experiencing.  Thanks
>
> My *postgresql.conf* has the following non-default values:
> # -
> # PostgreSQL configuration file
> # -
> listen_addresses = '*' # what IP interface(s) to listen on;
> max_connections = 100
> #---
> # RESOURCE USAGE (except WAL)
> #---
> shared_buffers = 65536  # min 16, at least max_connections*2, 8KB each
> work_mem = 2048   # min 64, size in KB
> maintenance_work_mem = 204800 # min 1024, size in KB
> max_fsm_pages = 225  # min max_fsm_relations*16, 6 bytes each
> bgwriter_delay = 200  # 10-1 milliseconds between rounds
> bgwriter_percent = 10  # 0-100% of dirty buffers in each round
> bgwriter_maxpages = 1000 # 0-1000 buffers max per round
> #---
> # WRITE AHEAD LOG
> #---
> fsync = false   # turns forced synchronization on or off
> wal_buffers = 64  # min 4, 8KB each
> checkpoint_segments = 40 # in logfile segments, min 1, 16MB each
> checkpoint_timeout = 600 # range 30-3600, in seconds
> #---
> # QUERY TUNING
> #---
> effective_cache_size = 65536 # typically 8KB each
> random_page_cost = 2  # units are one sequential page fetch

Re: [PERFORM] Optimizer seems to be way off, why?

2005-07-20 Thread John A Meinel
Dirk Lutzebäck wrote:
> Richard Huxton wrote:
> 
>> Dirk Lutzebäck wrote:
>>
>>> Hi,
>>>
>>> I do not under stand the following explain output (pgsql 8.0.3):
>>>
>>> explain analyze
>>> select b.e from b, d
>>> where b.r=516081780 and b.c=513652057 and b.e=d.e;
>>>
>>> QUERY PLAN
>>> 
>>>
>>> Nested Loop  (cost=0.00..1220.09 rows=1 width=4) (actual
>>> time=0.213..2926.845 rows=324503 loops=1)
>>>   ->  Index Scan using b_index on b  (cost=0.00..1199.12 rows=1
>>> width=4) (actual time=0.104..17.418 rows=3293 loops=1)
>>> Index Cond: (r = 516081780::oid)
>>> Filter: (c = 513652057::oid)
>>>   ->  Index Scan using d_e_index on d  (cost=0.00..19.22 rows=140
>>> width=4) (actual time=0.009..0.380 rows=99 loops=3293)
>>> Index Cond: ("outer".e = d.e)
>>> Total runtime: 3638.783 ms
>>> (7 rows)
>>>
>>> Why is the rows estimate for b_index and the nested loop 1? It is
>>> actually 3293 and 324503.
>>
>>
>>
>> I'm guessing (and that's all it is) that b.r and b.c have a higher
>> correlation than the planner is expecting. That is, it expects the
>> b.c=... to reduce the number of matching rows much more than it is.
>>
>> Try a query just on WHERE b.r=516081780 and see if it gets the
>> estimate right for that.
>>
>> If it's a common query, it might be worth an index on (r,c)
>>
>> -- 
>>   Richard Huxton
>>   Archonet Ltd
>>
> 
> Thanks Richard, dropping the join for b.c now gives better estimates (it
> also uses a different index now) although not accurate (off by factor
> 10). This query is embedded in a larger query which now got a 1000 times
> speed up (!) because I can drop b.c because it is redundant.

Well, part of the problem is that the poorly estimated row is not 'b.e'
but 'b.r', it expects to only find one row that matches, and instead
finds 3293 rows.

Now, that *could* be because it mis-estimates the selectivity of b.r & b.c.

It actually estimated the join with d approximately correctly. (It
thought that for each row it would find 140, and it averaged 99).

> 
> Though, why can't the planner see this correlation? I think somebody
> said the planner does not know about multiple column correlations, does it?

The planner does not maintain cross-column statistics, so you are
correct. I believe it assumes distributions are independent. So that if
r=R is 10% selective, and c= is 20% selective, the total
selectivity of r= AND c= is 2%. I could be wrong on this, but I
think it is approximately correct.

Now if you created the index on b(r,c), then it would have a much better
idea of how selective that would be. At the very least, it could index
on (r,c) rather than indexing on (r) and filtering by (c).

Also, if you have very skewed data (where you have 1 value 100k times,
and 50 values only 10times each), the planner can overestimate the low
values, and underestimate the high one. (It uses random sampling, so it
kind of depends where the entries are.)

Have you tried increasing the statistics on b.r and or b.c? Do you have
an index on b.c or just b.r?

To see what the planner thinks, you might try:

EXPLAIN ANALYZE
select count(*) from b where r=516081780;

That would tell you how selective the planner thinks the r= is.
> 
> Regards,
> 
> Dirk
> 
John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Looking for tips

2005-07-19 Thread John A Meinel

Oliver Crosby wrote:

Hi,
I'm running Postgres 7.4.6 on a dedicated server with about 1.5gigs of ram.
Running scripts locally, it takes about 1.5x longer than mysql, and the
load on the server is only about 21%.
I upped the sort_mem to 8192 (kB), and shared_buffers and
effective_cache_size to 65536 (512MB), but neither the timing nor the
server load have changed at all. FYI, I'm going to be working on data
sets in the order of GB.

I think I've gone about as far as I can with google.. can anybody give
me some advice on how to improve the raw performance before I start
looking at code changes?

Thanks in advance.


First, try to post in plain-text rather than html, it is easier to read. :)

Second, if you can determine what queries are running slow, post the
result of EXPLAIN ANALYZE on them, and we can try to help you tune
them/postgres to better effect.

Just a blanket question like this is hard to answer. Your new
shared_buffers are probably *way* too high. They should be at most
around 10% of ram. Since this is a dedicated server effective_cache_size
should be probably ~75% of ram, or close to 1.2GB.

There are quite a few things that you can tweak, so the more information
you can give, the more we can help.

For instance, if you are loading a lot of data into a table, if
possible, you want to use COPY not INSERT.
If you have a lot of indexes and are loading a significant portion, it
is sometimes faster to drop the indexes, COPY the data in, and then
rebuild the indexes.

For tables with a lot of inserts/updates, you need to watch out for
foreign key constraints. (Generally, you will want an index on both
sides of the foreign key. One is required, the other is recommended for
faster update/deletes).

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-15 Thread John A Meinel
Dan Harris wrote:
>
> On Jul 14, 2005, at 10:12 PM, John A Meinel wrote:
>
>>
>> My biggest question is why the planner things the Nested Loop would be
>> so expensive.
>> Have you tuned any of the parameters? It seems like something is  out of
>> whack. (cpu_tuple_cost, random_page_cost, etc...)
>>
>
> here's some of my postgresql.conf.  Feel free to blast me if I did
> something idiotic here.
>
> shared_buffers = 5
> effective_cache_size = 1348000
> random_page_cost = 3
> work_mem = 512000

Unless you are the only person connecting to this database, your
work_mem is very high. And if you haven't modified maintenance_work_mem
it is probably very low. work_mem might be causing postgres to think it
can fit all of a merge into ram, making it faster, I can't say for sure.

> max_fsm_pages = 8

This seems high, but it depends how many updates/deletes you get
in-between vacuums. It may not be too excessive. VACUUM [FULL] VERBOSE
replies with how many free pages are left, if you didn't use that
already for tuning. Though it should be tuned based on a steady state
situation. Not a one time test.

> log_min_duration_statement = 6
> fsync = true ( not sure if I'm daring enough to run without this )
> wal_buffers = 1000
> checkpoint_segments = 64
> checkpoint_timeout = 3000
>

These seem fine to me.

Can you include the output of EXPLAIN SELECT both with and without SET
join_collapselimit? Since your tables have grown, I can't compare the
estimated number of rows, and costs very well.

EXPLAIN without ANALYZE is fine, since I am wondering what the planner
is thinking things cost.

John
=:->

>
> # FOR PG_AUTOVACUUM --#
> stats_command_string = true
> stats_row_level = true
>


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:
>
> On Jul 14, 2005, at 7:15 PM, John A Meinel wrote:
>
>>
>>
>> Is the distribution of your rows uneven? Meaning do you have more rows
>> with a later id than an earlier one?
>>
>
> There are definitely some id's that will have many times more than  the
> others.  If I group and count them, the top 10 are fairly  dominant in
> the table.

That usually skews the estimates. Since the estimate is more of an
average (unless the statistics are higher).

>
>>>
>>
>> Hmm.. How to do it permanantly? Well you could always issue "set
>> join_collapse set 1; select * from "
>> But obviously that isn't what you prefer. :)
>>
>> I think there are things you can do to make merge join more expensive
>> than a nested loop, but I'm not sure what they are.
>
>
> Maybe someone else has some ideas to encourage this behavior for  future
> work?  Setting it on a per-connection basis is doable, but  would add
> some burden to us in code.

My biggest question is why the planner things the Nested Loop would be
so expensive.
Have you tuned any of the parameters? It seems like something is out of
whack. (cpu_tuple_cost, random_page_cost, etc...)

>
>>
>> What I really don't understand is that the estimates dropped as well.
>> The actual number of estimate rows drops to 3k instead of > 1M.
>> The real question is why does the planner think it will be so  expensive?
>>
>>
>>> select count(*) from k_b join k_r using (incidentid) where k_b.id=107
>>> and k_r.id=94;
>>> count
>>> ---
>>>373
>>>
>>>
>>
>> Well, this says that they are indeed much more selective.
>> Each one has > 1k rows, but together you end up with only 400.
>>
>
> Is this a bad thing?  Is this not "selective enough" to make it much
> faster?

Yes, being more selective is what makes it faster. But the planner
doesn't seem to notice it properly.

>
> Overall, I'm much happier now after seeing the new plan come about,  if
> I can find a way to make that join_collapse behavior permanent, I  can
> certainly live with these numbers.
>

I'm sure there are pieces to tune, but I've reached my limits of
parameters to tweak :)

> Thanks again for your continued efforts.
>
> -Dan
>

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] lots of updates on small table

2005-07-14 Thread John A Meinel
Alison Winters wrote:
> Hi,
>
>
>>>Our application requires a number of processes to select and update rows
>>>from a very small (<10 rows) Postgres table on a regular and frequent
>>>basis.  These processes often run for weeks at a time, but over the
>>>space of a few days we find that updates start getting painfully slow.
>>>We are running a full vacuum/analyze and reindex on the table every day,
>>
>>Full vacuum, eh?  I wonder if what you really need is very frequent
>>non-full vacuum.  Say, once in 15 minutes (exact rate depending on dead
>>tuple rate.)
>>
>
> Is there a difference between vacuum and vacuum full?  Currently we have
> a cron job going every hour that does:
>
> VACUUM FULL VERBOSE ANALYZE plc_fldio
> REINDEX TABLE plc_fldio

VACUUM FULL exclusively locks the table (so that nothing else can
happen) and the compacts it as much as it can.
You almost definitely want to only VACUUM every 15min, maybe VACUUM FULL
1/day.

VACUUM FULL is more for when you haven't been VACUUMing often enough. Or
have major changes to your table.
Basically VACUUM marks rows as empty and available for reuse, VACUUM
FULL removes empty space (but requires a full lock, because it is moving
rows around).

If anything, I would estimate that VACUUM FULL would be hurting your
performance. But it may happen fast enough not to matter.

>
> The most recent output was this:
>
> INFO:  --Relation public.plc_fldio--
> INFO:  Pages 1221: Changed 3, reaped 256, Empty 0, New 0; Tup 108137: Vac 
> 4176, Keep/VTL 108133/108133, UnUsed 19, MinLen 84, MaxLen 84; Re-using: 
> Free/Avail. Space 445176/371836; EndEmpty/Avail. Pages 0/256.
> CPU 0.04s/0.14u sec elapsed 0.18 sec.
> INFO:  Index plcpage_idx: Pages 315; Tuples 108137: Deleted 4176.
> CPU 0.03s/0.04u sec elapsed 0.14 sec.
> INFO:  Rel plc_fldio: Pages: 1221 --> 1221; Tuple(s) moved: 0.
> CPU 0.03s/0.04u sec elapsed 0.36 sec.
> INFO:  Analyzing public.plc_fldio
> VACUUM
> REINDEX
>
> We'll up it to every 15 minutes, but i don't know if that'll help
> because even with the current vacuuming the updates are still getting
> slower and slower over the course of several days.  What really puzzles
> me is why restarting the processes fixes it.  Does PostgreSQL keep some
> kind of backlog of transactions all for one database connection?  Isn't
> it normal to have processes that keep a single database connection open
> for days at a time?

I believe that certain locks are grabbed per session. Or at least there
is some command that you can run, which you don't want to run in a
maintained connection. (It might be VACUUM FULL, I don't remember which
one it is).

But the fact that your application works at all seems to be that it
isn't acquiring any locks.

I know VACUUM cannot clean up any rows that are visible in one of the
transactions, I don't know if this includes active connections or not.

>
> Regarding the question another poster asked: all the transactions are
> very short.  The table is essentially a database replacement for a
> shared memory segment - it contains a few rows of byte values that are
> constantly updated byte-at-a-time to communicate data between different
> industrial control processes.
>
> Thanks for the thoughts everyone,
>
> Alison
>

Is it possible to have some sort of timer that would recognize it has
been connected for too long, drop the database connection, and
reconnect? I don't know that it would solve anything, but it would be
something you could try.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Tom Lane wrote:
> John A Meinel <[EMAIL PROTECTED]> writes:
>
>>What I don't understand is that the planner is actually estimating that
>>joining against the new table is going to *increase* the number of
>>returned rows.
>
>
> It evidently thinks that incidentid in the k_r table is pretty
> nonunique.  We really need to look at the statistics data to
> see what's going on.
>
>   regards, tom lane
>

Okay, sure. What about doing this, then:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
  JOIN (SELECT DISTINCT incidentid FROM k_r JOIN k_b USING (incidentid)
 WHERE k_r.id = ?? AND k_b.id = ??)
 USING (incidentid)
;

Since I assume that eventactivity is the only table with "recordtext",
and that you don't get any columns from k_r and k_b, meaning it would be
pointless to get duplicate incidentids.

I may be misunderstanding what the query is trying to do, but depending
on what is in k_r and k_b, is it possible to use a UNIQUE INDEX rather
than just an index on incidentid?

There is also the possibility of
EXPLAIN ANALYZE
SELECT recordtext FROM eventactivtity
  JOIN (SELECT incidentid FROM k_r WHERE k_r.id = ??
 UNION SELECT incidentid FROM k_b WHERE k_b.id = ??)
 USING (incidentid)
;

But both of these would mean that you don't actually want columns from
k_r or k_b, just a unique list of incident ids.

But first, I agree, we should make sure the pg_stats values are reasonable.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:
>
> On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:
>
>>
>>
>> You might try giving it a little bit more freedom with:
>>
>> EXPLAIN ANALYZE
>> SELECT recordtext FROM eventactivity, k_r, k_b
>>  WHERE eventactivity.incidentid = k_r.incidentid
>>AND eventactivity.incidentid = k_b.incidentid
>>AND k_r.id = 94
>>AND k_b.id = 107
>> -- AND k_r.incidentid = k_b.incidentid
>> ;
>>
>> I'm pretty sure that would give identical results, just let the  planner
>> have a little bit more freedom about how it does it.
>> Also the last line is commented out, because I think it is redundant.
>>
>
> Ok, I tried this one.  My ssh keeps getting cut off by a router
> somewhere between me and the server due to inactivity timeouts, so  all
> I know is that both the select and explain analyze are taking  over an
> hour to run.  Here's the explain select for that one, since  that's the
> best I can get.
>
> explain select recordtext from eventactivity,k_r,k_b where
> eventactivity.incidentid = k_r.incidentid and  eventactivity.incidentid
> = k_b.incidentid and k_r.id = 94 and k_b.id  = 107;
>   QUERY PLAN
> 
> --
> Merge Join  (cost=9624.61..4679590.52 rows=151009549 width=35)
>Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>->  Merge Join  (cost=4766.92..4547684.26 rows=16072733 width=117)
>  Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>  ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4186753.16 rows=46029271 width=49)
>  ->  Sort  (cost=4766.92..4771.47 rows=1821 width=68)
>Sort Key: (k_b.incidentid)::text
>->  Index Scan using k_b_idx on k_b   (cost=0.00..4668.31
> rows=1821 width=68)
>  Index Cond: (id = 107)
>->  Sort  (cost=4857.69..4862.39 rows=1879 width=68)
>  Sort Key: (k_r.incidentid)::text
>  ->  Index Scan using k_r_idx on k_r  (cost=0.00..4755.52
> rows=1879 width=68)
>Index Cond: (id = 94)
> (13 rows)
>

If anything, the estimations have gotten worse. As now it thinks there
will be 1800 rows returned each, whereas you were thinking it would be
more around 100.

Since you didn't say, you did VACUUM ANALYZE recently, right?

>

...

>>
>> You can also try disabling merge joins, and see how that changes  things.
>>
>
> Are there any negative sideaffects of doing this?

If the planner is estimating things correctly, you want to give it the
most flexibility of plans to pick from, because sometimes a merge join
is faster (postgres doesn't pick things because it wants to go slower).
The only reason for the disable flags is that sometimes the planner
doesn't estimate correctly. Usually disabling a method is not the final
solution, but a way to try out different methods, and see what happens
to the results.

Using: SET enable_mergejoin TO off;
You can disable it just for the current session (not for the entire
database). Which is the recommended way if you have a query that
postgres is messing up on. (Usually it is correct elsewhere).


>>
>> Well, postgres is estimating around 500 rows each, is that way off?  Try
>> just doing:
>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

Once again, do this and post the results. We might just need to tweak
your settings so that it estimates the number of rows correctly, and we
don't need to do anything else.

>>
>> And see if postgres estimates the number of rows properly.
>>
>> I assume you have recently VACUUM ANALYZEd, which means you might need
>> to update the statistics target (ALTER TABLE k_b ALTER COLUMN
>> incidientid SET STATISTICS 100) default is IIRC 10, ranges from  1-1000,
>> higher is more accurate, but makes ANALYZE slower.
>>


...

>> EXPLAIN ANALYZE
>> SELECT recordtext FROM eventactivity
>>  JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
>> WHERE k_r.id = 94 AND k_b.id = 107)
>> USING (incidentid);
>>
>
> This one looks like the same plan as the others:
>
> explain select recordtext from eventactivity join ( select incidentid
> from k_r join k_b using (incidentid) where k_r.id = 94 and k_b.id =  107
> ) a  using (incidentid );

Well, the planner is powerful enough to flatten nested selects. To make
it less "intelligent" you can do:

Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:
>
> On Jul 14, 2005, at 9:42 AM, John A Meinel wrote:

...
Did you try doing this to see how good the planners selectivity
estimates are?

>> Well, postgres is estimating around 500 rows each, is that way off?  Try
>> just doing:
>> EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
>> EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

These should be fast queries.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow joining very large table to smaller ones

2005-07-14 Thread John A Meinel
Dan Harris wrote:
> I'm trying to improve the speed of this query:
>
> explain select recordtext from eventactivity inner join ( select
> incidentid from k_r where id = 94 ) a using ( incidentid ) inner join  (
> select incidentid from k_b where id = 107 ) b using ( incidentid );

You might try giving it a little bit more freedom with:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity, k_r, k_b
 WHERE eventactivity.incidentid = k_r.incidentid
   AND eventactivity.incidentid = k_b.incidentid
   AND k_r.id = 94
   AND k_b.id = 107
-- AND k_r.incidentid = k_b.incidentid
;

I'm pretty sure that would give identical results, just let the planner
have a little bit more freedom about how it does it.
Also the last line is commented out, because I think it is redundant.

You might also try:
EXPLAIN ANALYZE
SELECT recordtext
  FROM eventactivity JOIN k_r USING (incidentid)
  JOIN k_b USING (incidentid)
 WHERE k_r.id = 94
   AND k_b.id = 107
;

Also, if possible give us the EXPLAIN ANALYZE so that we know if the
planner is making accurate estimates. (You might send an EXPLAIN while
waiting for the EXPLAIN ANALYZE to finish)

You can also try disabling merge joins, and see how that changes things.

>   QUERY PLAN
> 
> --
> Merge Join  (cost=2747.29..4249364.96 rows=11968693 width=35)
>Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>->  Merge Join  (cost=1349.56..4230052.73 rows=4413563 width=117)
>  Merge Cond: (("outer".incidentid)::text = "inner"."?column2?")
>  ->  Index Scan using eventactivity1 on eventactivity
> (cost=0.00..4051200.28 rows=44519781 width=49)
>  ->  Sort  (cost=1349.56..1350.85 rows=517 width=68)
>Sort Key: (k_b.incidentid)::text
>->  Index Scan using k_b_idx on k_b   (cost=0.00..1326.26
> rows=517 width=68)
>  Index Cond: (id = 107)
>->  Sort  (cost=1397.73..1399.09 rows=542 width=68)
>  Sort Key: (k_r.incidentid)::text
>  ->  Index Scan using k_r_idx on k_r  (cost=0.00..1373.12
> rows=542 width=68)
>Index Cond: (id = 94)
> (13 rows)
>
>
> There are many millions of rows in eventactivity.  There are a few
> ten-thousand rows in k_r and k_b.  There is an index on 'incidentid'  in
> all three tables.  There should only be less than 100 rows matched  in
> k_r and k_b total.  That part on its own is very very fast.  But,  it
> should have those 100 or so incidentids extracted in under a  second and
> then go into eventactivity AFTER doing that.  At least,  that's my
> intention to make this fast.

Well, postgres is estimating around 500 rows each, is that way off? Try
just doing:
EXPLAIN ANALYZE SELECT incidentid FROM k_b WHERE id = 107;
EXPLAIN ANALYZE SELECT incidentid FROM k_r WHERE id = 94;

And see if postgres estimates the number of rows properly.

I assume you have recently VACUUM ANALYZEd, which means you might need
to update the statistics target (ALTER TABLE k_b ALTER COLUMN
incidientid SET STATISTICS 100) default is IIRC 10, ranges from 1-1000,
higher is more accurate, but makes ANALYZE slower.

>
> Right now, it looks like pg is trying to sort the entire  eventactivity
> table for the merge join which is taking several  minutes to do.  Can I
> rephrase this so that it does the searching  through k_r and k_b FIRST
> and then go into eventactivity using the  index on incidentid?  It seems
> like that shouldn't be too hard to  make fast but my SQL query skills
> are only average.

To me, it looks like it is doing an index scan (on k_b.id) through k_b
first, sorting the results by incidentid, then merge joining that with
eventactivity.

I'm guessing you actually want it to merge k_b and k_r to get extra
selectivity before joining against eventactivity.
I think my alternate forms would let postgres realize this. But if not,
you could try:

EXPLAIN ANALYZE
SELECT recordtext FROM eventactivity
 JOIN (SELECT incidentid FROM k_r JOIN k_b USING (incidentid)
WHERE k_r.id = 94 AND k_b.id = 107)
USING (incidentid);

I don't know how selective your keys are, but one of these queries
should probably structure it better for the planner. It depends a lot on
how selective your query is.
If you have 100M rows, the above query looks like it expects k_r to
restrict it to 44M rows, and k_r + k_b down to 11M rows, which really
should be a seq scan (> 10% of the rows = seq scan). But if you are
saying the selectivity is mis-estimated it could be different.

John
=:->
>
> Thanks
> -Dan
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>   http://www.postgresql.org/docs/faq
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Quad Opteron stuck in the mud

2005-07-13 Thread John A Meinel

Dan Harris wrote:

Gurus,



> even the  explain never

finishes when I try that.


Just a short bit. If "EXPLAIN SELECT" doesn't return, there seems to be
a very serious problem. Because I think EXPLAIN doesn't actually run the
query, just has the query planner run. And the query planner shouldn't
ever get heavily stuck.

I might be wrong, but there may be something much more substantially
wrong than slow i/o.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sorting on longer key is faster ?

2005-07-11 Thread John A Meinel
jobapply wrote:
> The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER BY x..
>
> How can that be possible?
>
> Btw: x and x||t are same ordered
>
> phoeniks=> explain analyze SELECT * FROM test WHERE i<20 ORDER BY x || t;
> QUERY PLAN

I also thought of another possibility. Are there a lot of similar
entries in X? Meaning that the same value is repeated over and over? It
is possible that the sort code has a weakness when sorting equal values.

For instance, if it was doing a Hash aggregation, you would have the
same hash repeated. (It isn't I'm just mentioning a case where it might
affect something).

If it is creating a tree representation, it might cause some sort of
pathological worst-case behavior, where all entries keep adding to the
same side of the tree, rather than being more balanced.

I don't know the internals of postgresql sorting, but just some ideas.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sorting on longer key is faster ?

2005-07-11 Thread John A Meinel
Chris Travers wrote:
> John A Meinel wrote:
>
>> jobapply wrote:
>>
>>
>>> The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER
>>> BY x..
>>>
>>> How can that be possible?
>>>
>>> Btw: x and x||t are same ordered
>>>
>>> phoeniks=> explain analyze SELECT * FROM test WHERE i<20 ORDER BY x
>>> || t;
>>>QUERY PLAN
>>>
>>>
>>
>>
>> What types are x and t, I have the feeling "x || t" is actually a
>> boolean, so it is only a True/False sort, while ORDER BY x has to do
>> some sort of string comparison (which might actually be a locale
>> depended comparison, and strcoll can be very slow on some locales)
>>
>>
>>
> Am I reading this that wrong?  I would think that x || t would mean
> "concatenate x  and t."

Sorry, I think you are right. I was getting my operators mixed up.
>
> This is interesting.  I never through of writing a multicolumn sort this
> way

I'm also surprised that the sort is faster with a merge operation. Are
you using UNICODE as the database format? I'm just wondering if it is
doing something funny like casting it to an easier to sort type.

>
> Best Wishes,
> Chris Travers
> Metatron Technology Consulting

PS> Don't forget to Reply All so that your messages go back to the list.


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Sorting on longer key is faster ?

2005-07-11 Thread John A Meinel
jobapply wrote:
> The 2 queries are almost same, but ORDER BY x||t is FASTER than ORDER BY x..
>
> How can that be possible?
>
> Btw: x and x||t are same ordered
>
> phoeniks=> explain analyze SELECT * FROM test WHERE i<20 ORDER BY x || t;
> QUERY PLAN
>

What types are x and t, I have the feeling "x || t" is actually a
boolean, so it is only a True/False sort, while ORDER BY x has to do
some sort of string comparison (which might actually be a locale
depended comparison, and strcoll can be very slow on some locales)

John
=:->

> 
> --
>  Sort  (cost=2282.65..2284.92 rows=907 width=946) (actual
> time=74.982..79.114 rows=950 loops=1)
>Sort Key: (x || t)
>->  Index Scan using i_i on test  (cost=0.00..2238.09 rows=907 width=946)
> (actual time=0.077..51.015 rows=950 loops=1)
>  Index Cond: (i < 20)
>  Total runtime: 85.944 ms
> (5 rows)
>
> phoeniks=> explain analyze SELECT * FROM test WHERE i<20 ORDER BY x;
>QUERY PLAN
> 
> -
>  Sort  (cost=2280.38..2282.65 rows=907 width=946) (actual
> time=175.431..179.239 rows=950 loops=1)
>Sort Key: x
>->  Index Scan using i_i on test  (cost=0.00..2235.82 rows=907 width=946)
> (actual time=0.024..5.378 rows=950 loops=1)
>  Index Cond: (i < 20)
>  Total runtime: 183.317 ms
> (5 rows)
>
>
>
>
>
> phoeniks=> \d+ test
> Table "public.test"
>  Column |  Type   | Modifiers | Description
> +-+---+-
>  i  | integer |   |
>  t  | text|   |
>  x  | text|   |
> Indexes:
> "i_i" btree (i)
> "x_i" btree (xpath_string(x, 'data'::text))
> "x_ii" btree (xpath_string(x, 'movie/characters/character'::text))
> Has OIDs: no
>
>
> ---(end of broadcast)---
> TIP 3: Have you checked our extensive FAQ?
>
>http://www.postgresql.org/docs/faq
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] join and query planner

2005-07-11 Thread John A Meinel
Dario Pudlo wrote:
> (first at all, sorry for my english)
> Hi.
>- Does "left join" restrict the order in which the planner must join
> tables? I've read about join, but i'm not sure about left join...
>- If so: Can I avoid this behavior? I mean, make the planner resolve the
> query, using statistics (uniqueness, data distribution) rather than join
> order.
>
>   My query looks like:
>   SELECT ...
>   FROM a, b,
>   LEFT JOIN c ON (c.key = a.key)
>   LEFT JOIN d on (d.key=a.key)
>   WHERE (a.key = b.key)  AND (b.column <= 100)
>
>   b.column has a lot better selectivity, but planner insist on resolve
> first c.key = a.key.
>
>   Of course, I could rewrite something like:
>   SELECT ...
>   FROM
>(SELECT ...
> FROM a,b
> LEFT JOIN d on (d.key=a.key)
> WHERE (b.column <= 100)
> )
> as aa
>   LEFT JOIN c ON (c.key = aa.key)
>
>   but this is query is constructed by an application with a "multicolumn"
> filter. It's dynamic.
>   It means that a user could choose to look for "c.column = 1000". And
> also, combinations of filters.
>
>   So, I need the planner to choose the best plan...

Probably forcing the other join earlier could help:
SELECT ...
  FROM a JOIN b ON (a.key = b.key)
  LEFT JOIN c ON (c.key = a.key)
...

I think the problem is that postgresql can't break JOIN syntax very
easily. But you can make the JOIN earlier.

John
=:->
>
> I've already change statistics, I clustered tables with cluster, ran vacuum
> analyze, changed work_mem, shared_buffers...
>
> Greetings. TIA.
>
>
> ---(end of broadcast)---
> TIP 2: Don't 'kill -9' the postmaster
>



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] CURSOR slowes down a WHERE clause 100 times?

2005-07-06 Thread John A Meinel
Niccolo Rigacci wrote:

>Hi to all,
>
>I have a performace problem with the following query:
>
>  BEGIN;
>  DECLARE mycursor BINARY CURSOR FOR
>SELECT
>  toponimo,
>  wpt
>  FROM wpt_comuni_view
>WHERE (
>  wpt &&
>  setSRID('BOX3D(4.83 36, 20.16 47.5)'::BOX3D, 4326)
>);
>  FETCH ALL IN mycursor;
>  END;
>
>I get the results in about 108 seconds (8060 rows).
>
>If I issue the SELECT alone (without the CURSOR) I get the
>same results in less than 1 second.
>
>The wpt_comuni_view is a VIEW of a 3 tables JOIN, and the "wpt"
>field is a PostGIS geometry column. The "&&" is the PostGIS
>"overlaps" operator.
>
>If I CURSOR SELECT from a temp table instead of the JOIN VIEW the
>query time 1 second.
>
>If I omit the WHERE clause the CURSOR fetches results in 1
>second.
>
>Can the CURSOR on JOIN affects so heavly the WHERE clause? I
>suspect that - with the CURSOR - a sequential scan is performed
>on the entire data set for each fetched record...
>
>Any idea?
>
>
What does it say if you do "EXPLAIN ANALYZE SELECT..." both with and
without the cursor?
It may not say much for the cursor, but I think you can explain analyze
the fetch statements.

It is my understanding that Cursors generally favor using an
slow-startup style plan, which usually means using an index, because it
expects that you won't actually want all of the data. A seqscan is not
always slower, especially if you need to go through most of the data.

Without an explain analyze it's hard to say what the planner is thinking
and doing.

>This is the definition of the VIEW:
>
>  CREATE VIEW wpt_comuni_view AS
>SELECT istat_wpt.oid, istat_wpt.id, istat_wpt.toponimo,
>  istat_comuni.residenti, istat_wpt.wpt
>  FROM istat_comuni
>  JOIN istat_comuni2wpt
>USING (idprovincia, idcomune)
>  JOIN istat_wpt
>ON (idwpt = id);
>
>Thank you for any hint.
>
>
>
You might also try comparing your CURSOR to a prepared statement. There
are a few rare cases where preparing is worse than issuing the query
directly, depending on your data layout.

John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Planner constants for RAM resident databases

2005-07-01 Thread John A Meinel

Emil Briggs wrote:


I just mentioned random_page_cost, but you should also tune
effective_cache_size, since that is effectively most of your RAM. It
depends what else is going on in the system, but setting it as high as
say 12-14GB is probably reasonable if it is a dedicated machine. With
random_page_cost 1.5-2, and higher effective_cache_size, you should be
doing pretty well.
John
=:->




I tried playing around with these and they had no effect. It seems the only
thing that makes a difference is cpu_tuple_cost.




I'm surprised. I know cpu_tuple_cost can effect it as well, but usually
the recommended way to get indexed scans is the above two parameters.

When you do "explain analyze" of a query that you have difficulties
with, how are the planner's estimates. Are the estimated number of rows
about equal to the actual number of rows?
If the planner is mis-estimating, there is a whole different set of
tuning to do to help it estimate correctly.

John
=:->

PS> Use reply-all so that your comments go to the list.


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Planner constants for RAM resident databases

2005-07-01 Thread John A Meinel

Emil Briggs wrote:


I'm working with an application where the database is entirely resident in RAM
(the server is a quad opteron with 16GBytes of memory). It's a web
application and handles a high volume of queries. The planner seems to be
generating poor  plans for some of our queries which I can fix by raising
cpu_tuple_cost. I have seen some other comments in the archives saying that
this is a bad idea  but is that necessarily the case when the database is
entirely resident in RAM?

Emil





Generally, the key knob to twiddle when everything fits in RAM is
random_page_cost. If you truly have everything in RAM you could set it
almost to 1. 1 means that it costs exactly the same to go randomly
through the data then it does to go sequential. I would guess that even
in RAM it is faster to go sequential (since you still have to page and
deal with L1/L2/L3 cache, etc). But the default random_page_cost of 4 is
probably too high for you.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] planner picking more expensive plan

2005-07-01 Thread John A Meinel

Sam Mason wrote:


Hi,

I've just been referred here after a conversion on IRC and everybody
seemed to think I've stumbled upon some strangeness.

The planner (in PG version 8.0.2) is choosing what it thinks is a more
expensive plan.  I've got a table of animals (about 3M rows) and their
movements (about 16M rows), and I'm trying to execute this query:

 SELECT a.birthlocnid, m.locnid
 FROM animals a
   LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0)
 LIMIT 10;




Why are you using LIMIT without having an ORDER BY?
What are actually trying to get out of this query? Is it just trying to
determine where the 'home' locations are?
It just seems like this query isn't very useful. As it doesn't restrict
by animal id, and it just gets 10 randomly selected animals where
m.mtypeid=0.
And why a LEFT JOIN instead of a normal join?
Anyway, the general constraints you are applying seem kind of confusing.
What happens if you change the plan to:

 SELECT a.birthlocnid, m.locnid
 FROM animals a
   LEFT JOIN movements m ON (a.animalid = m.animalid AND m.mtypeid=0)
 ORDER BY a.animalid LIMIT 10;


I would guess that this would help the planner realize it should try to
use an index, since it can realize that it wants only a few rows by
a.animalid in order.
Though I also recognize that you aren't returning a.animalid so you
don't really know which animals you are returning.

I get the feeling you are trying to ask something like "do animals stay
at their birth location", or at least "how are animals moving around". I
don't know what m.typeid = 0 means, but I'm guessing it is something
like where their home is.

Anyway, I would say you need to put a little bit more restriction in, so
the planner can figure out how to get only 10 rows.

John
=:->


If I have "work_mem" set to something small (1000) it uses this plan:

   QUERY PLAN

Limit  (cost=0.00..202.52 rows=10 width=8) (actual time=0.221..0.600 rows=10 
loops=1)
  ->  Merge Left Join  (cost=0.00..6628.30 rows=3302780 width=8) (actual 
time=0.211..0.576 rows=10 loops=1)
Merge Cond: ("outer".animalid = "inner".animalid)
->  Index Scan using animals_pkey on animals a  (cost=0.00..10198983.91 
rows=3302780 width=8) (actual time=0.112..0.276 rows=10 loops=1)
->  Index Scan using movement_animal on movements m  
(cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.088..0.235 rows=10 
loops=1)
  Filter: (mtypeid = 0)
Total runtime: 0.413 ms

But if I increase "work_mem" to 1 it uses this plan:

   QUERY PLAN

Limit  (cost=565969.42..566141.09 rows=10 width=8) (actual 
time=27769.047..27769.246 rows=10 loops=1)
  ->  Merge Right Join  (cost=565969.42..57264070.77 rows=3302780 width=8) 
(actual time=27769.043..27769.228 rows=10 loops=1)
Merge Cond: ("outer".animalid = "inner".animalid)
->  Index Scan using movement_animal on movements m  
(cost=0.00..56642740.73 rows=3107737 width=8) (actual time=0.022..0.154 rows=10 
loops=1)
  Filter: (mtypeid = 0)
->  Sort  (cost=565969.42..574226.37 rows=3302780 width=8) (actual 
time=27768.991..27769.001 rows=10 loops=1)
  Sort Key: a.animalid
  ->  Seq Scan on animals a  (cost=0.00..77086.80 rows=3302780 
width=8) (actual time=0.039..5620.651 rows=3303418 loops=1)
Total runtime: 27851.097 ms


I've tried playing with the statistics as people suggested on IRC but to
no effect.  There was some discussion about why it would be doing this,
but nothing obvious came out of it.

SHOW ALL output is at the end of this mail but it should be pretty
standard apart from:

 shared_buffers = 1
 work_mem = 8192
 max_connections = 100
 effective_cache_size = 1

Hope that's enough information to be useful.

Thanks.

  Sam





signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


On Tue, Jun 28, 2005 at 12:02:09 -0400,
 Merlin Moncure <[EMAIL PROTECTED]> wrote:



Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response



and



had given up on it.  I think your solution (smallest X1 not in X) is



a



good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;



You need to rework this to check to see if row '1' is missing. The
above returns the start of the first gap after the first row that
isn't missing.




Correct.

In fact, I left out a detail in my original request in that I had a
starting value (easily supplied with where clause)...so what I was
really looking for was a query which started at a supplied value and
looped forwards looking for an empty slot.  John's supplied query is a
drop in replacement for a plpgsql routine which does exactly this.

The main problem with the generate_series approach is that there is no
convenient way to determine a supplied upper bound.  Also, in some
corner cases of my problem domain the performance was not good.

Merlin



Actually, if you already have a lower bound, then you can change it to:

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE t1.id > id_min
AND NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

This would actually really help performance if you have a large table
and then empty entries start late.

On my system, where the first entry is 64k, doing where id > 6
speeds it up back to 80ms instead of 1000ms.
John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


John Meinel wrote:



See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:->




Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response and
had given up on it.  I think your solution (smallest X1 not in X) is a
good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

Merlin



Just be aware that as your table fills it's holes, this query gets
slower and slower.
I've been doing some testing. And it starts at 0.00 when the first entry
is something like 3, but when you start getting to 16k it starts taking
more like 200 ms.

So it kind of depends how your table fills (and empties I suppose).

The earlier query was slower overall (since it took 460ms to read in the
whole table).
I filled up the table such that 63713 is the first empty space, and it
takes 969ms to run.
So actually if your table is mostly full, the first form is better.

But if you are going to have 100k rows, with basically random
distribution of empties, then the NOT EXISTS works quite well.

Just be aware of the tradeoff. I'm pretty sure the WHERE NOT EXISTS will
always use a looping structure, and go through the index in order.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] read block size

2005-06-28 Thread John A Meinel

Michael Stone wrote:


Is it possible to tweak the size of a block that postgres tries to read
when doing a sequential scan? It looks like it reads in fairly small
blocks, and I'd expect a fairly significant boost in i/o performance
when doing a large (multi-gig) sequential scan if larger blocks were
used.

Mike Stone



I believe postgres reads in one database page at a time, which defaults
to 8k IIRC. If you want bigger, you could recompile and set the default
page size to something else.

There has been discussion about changing the reading/writing code to be
able to handle multiple pages at once, (using something like vread())
but I don't know that it has been implemented.

Also, this would hurt cases where you can terminate as sequential scan
early. And if the OS is doing it's job right, it will already do some
read-ahead for you.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

John A Meinel wrote:


Merlin Moncure wrote:


I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin




Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
   WHERE (id+1) NOT IN (SELECT id FROM t)
   ORDER BY id LIMIT 1;

John
=:->


Well, I was able to improve it to using appropriate index scans.
Here is the query:

SELECT t1.id+1 as id_new FROM id_test t1
   WHERE NOT EXISTS
   (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
   ORDER BY t1.id LIMIT 1;

I created a test table which has 90k randomly inserted rows. And this is
what EXPLAIN ANALYZE says:

  QUERY PLAN


Limit  (cost=0.00..12.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 
loops=1)
  ->  Index Scan using id_test_pkey on id_test t1  (cost=0.00..544423.27 
rows=45000 width=4) (actual time=0.000..0.000 rows=1 loops=1)
Filter: (NOT (subplan))
SubPlan
  ->  Index Scan using id_test_pkey on id_test t2  (cost=0.00..6.01 
rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=15)
Index Cond: (id = ($0 + 1))
Total runtime: 0.000 ms
(7 rows)

The only thing I have is a primary key index on id_test(id);

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how



you



can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.




it does, but it is still faster than generate_series(), which requires
both a seqscan and a materialization of the function.




I don't know how big you want to scale to.




big. :)

merlin




See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tricky query

2005-06-28 Thread John A Meinel

Merlin Moncure wrote:


I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin




Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
   WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
   WHERE (id+1) NOT IN (SELECT id FROM t)
   ORDER BY id LIMIT 1;

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Too slow querying a table of 15 million records

2005-06-28 Thread John A Meinel

Tobias Brox wrote:


[EMAIL PROTECTED] - Tue at 08:33:58PM +0200]



I use FreeBSD 4.11 with PostGreSQL 7.3.8.



(...)



database=> explain select date_trunc('hour', time),count(*) as total from
test where p1=53 and time > now() - interval '24 hours' group by
date_trunc order by date_trunc ;




I haven't looked through all your email yet, but this phenomena have been up
at the list a couple of times.  Try replacing "now() - interval '24 hours'"
with a fixed time stamp, and see if it helps.

pg7 will plan the query without knowledge of what "now() - interval '24
hours'" will compute to.  This should be fixed in pg8.




The grandparent was a mailing list double send. Notice the date is 1
week ago. It has already been answered (though your answer is still
correct).

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Speed with offset clause

2005-06-24 Thread John A Meinel

Yves Vindevogel wrote:


Hi again all,

My queries are now optimised. They all use the indexes like they should.
However, there's still a slight problem when I issue the "offset" clause.

We have a table that contains 600.000 records
We display them by 25 in the webpage.
So, when I want the last page, which is: 600k / 25 = page 24000 - 1 = 
23999, I issue the offset of 23999 * 25
This take a long time to run, about 5-10 seconds whereas offset below 
100 take less than a second.


Can I speed this up ?


Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*

Postgres has the optimization that it will plan a query, and once it 
reaches the limit, it can stop even though there is more data available.
The problem you are having is that it has to go through "offset" rows 
first, before it can apply the limit.
If you can, (as mentioned in the other post), try to refine your index 
so that you can reverse it for the second half of the data.


This is probably tricky, as you may not know how many rows you have (or 
the amount might be changing).


A potentially better thing, is if you have an index you are using, you 
could use a subselect so that the only portion that needs to have 60k 
rows is a single column.


Maybe an example:
Instead of saying:

SELECT * FROM table1, table2 WHERE table1.id = table2.id ORDER BY 
table1.date OFFSET x LIMIT 25;


You could do:

SELECT * FROM
   (SELECT id FROM table1 OFFSET x LIMIT 25) as subselect
   JOIN table1 ON subselect.id = table1.id
   , table2
   WHERE table1.id = table2.id;

That means that the culling process is done on only a few rows of one 
table, and the rest of the real merging work is done on only a few rows.


It really depends on you query, though, as what rows you are sorting on 
has a big influence on how well this will work.


John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] max_connections / shared_buffers / effective_cache_size

2005-06-24 Thread John A Meinel

Puddle wrote:


Hello, I'm a Sun Solaris sys admin for a start-up
company.  I've got the UNIX background, but now I'm
having to learn PostgreSQL to support it on our
servers :)

Server Background:

Solaris 10 x86
PostgreSQL 8.0.3
Dell PowerEdge 2650 w/4gb ram.
This is running JBoss/Apache as well (I KNOW the bad
juju of running it all on one box, but it's all we
have currently for this project). I'm dedicating 1gb
for PostgreSQL alone.

So, far I LOVE it compared to MySQL it's solid.

The only things I'm kind of confused about (and I've
been searching for answers on lot of good perf docs,
but not too clear to me) are the following:

1.) shared_buffers I see lot of reference to making
this the size of available ram (for the DB).  However,
I also read to make it the size of pgdata directory.

I notice when I load postgres each daemon is using the
amount of shared memory (shared_buffers).  Our current
dataset (pgdata) is 85mb in size.  So, I'm curious
should this size reflect the pgdata or the 'actual'
memory given?

I currently have this at 128mb



You generally want shared_buffers to be no more than 10% of available
ram. Postgres expects the OS to do it's own caching. 128M/4G = 3% seems
reasonable to me. I would certainly never set it to 100% of ram.


2.) effective_cache_size - from what I read this is
the 'total' allowed memory for postgresql to use
correct? So, if I am willing to allow 1GB of memory
should I make this 1GB?



This is the effective amount of caching between the actual postgres
buffers, and the OS buffers. If you are dedicating this machine to
postgres, I would set it to something like 3.5G. If it is a mixed
machine, then you have to think about it.

This does not change how postgres uses RAM, it changes how postgres
estimates whether an Index scan will be cheaper than a Sequential scan,
based on the likelihood that the data you want will already be cached in
Ram.

If you dataset is only 85MB, and you don't think it will grow, you
really don't have to worry about this much. You have a very small database.


3.) max_connections, been trying to figure 'how' to
determine this #.  I've read this is buffer_size+500k
per a connection.

ie.  128mb(buffer) + 500kb = 128.5mb per connection?



Max connections is just how many concurrent connections you want to
allow. If you can get away with lower, do so.  Mostly this is to prevent
connections * work_mem to get bigger than your real working memory and
causing you to swap.


I was curious about 'sort_mem' I can't find reference
of it in the 8.0.3 documentation, has it been removed?



sort_mem changed to work_mem in 8.0, same thing with vacuum_mem ->
maintenance_work_mem.


work_mem and max_stack_depth set to 4096
maintenance_work_mem set to 64mb



Depends how much space you want to give per connection. 4M is pretty
small for a machine with 4G of RAM, but if your DB is only 85M it might
be plenty.
work_mem is how much memory a sort/hash/etc will use before it spills to
disk. So look at your queries. If you tend to sort most of your 85M db
in a single query, you might want to make it a little bit more. But if
all of your queries are very selective, 4M could be plenty.

I would make maintenance_work_mem more like 512M. It is only used for
CREATE INDEX, VACUUM, etc. Things that are not generally done by more
than one process at a time. And it's nice for them to have plenty of
room to run fast.


Thanks for any help on this.  I'm sure bombardment of
newbies gets old :)

-William



Good luck,
John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


I only add records, and most of the values are "random"
Except the columns for dates, 


I doubt that you would need to recreate indexes. That really only needs
to be done in pathological cases, most of which have been fixed in the
latest postgres.

If you are only inserting (never updating or deleting), the index can
never bloat, since you are only adding new stuff.
(You cannot get dead items to bloat your index if you never delete
anything.)

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


And, after let's say a week, would that index still be optimal or
would it be a good idea to drop it in the weekend and recreate it.



It depends a little bit on the postgres version you are using. If you
are only ever adding to the table, and you are not updating it or
deleting from it, I think the index is always optimal.
Once you start deleting from it there are a few cases where older
versions would not properly re-use the empty entries, requiring a
REINDEX. (Deleting low numbers and always adding high numbers was one of
the cases)

However, I believe that as long as you vacuum often enough, so that the
system knows where the unused entries are, you don't ever have to drop
and re-create the index.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Another question on indexes (drop and recreate)

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


Hi,

I have another question regarding indexes.

I have a table with a lot of indexes on it. Those are needed to 
perform my searches.

Once a day, a bunch of records is inserted in my table.

Say, my table has 1.000.000 records and I add 10.000 records (1% new)
What would be faster.

1) Dropping my indexes and recreating them after the inserts
2) Just inserting it and have PG manage the indexes

Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*



I'm guessing for 1% new that (2) would be faster.
John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] slow growing table

2005-06-21 Thread John A Meinel

Jone C wrote:


On second thought... Does a VACUUM FULL help? If so, you might want to
increase your FSM settings.




Thank you for the reply, sorry for delay I was on holiday.

I tried that it had no effect. I benchmarked 2x before, peformed
VACUUM FULL on the table in question post inserts, then benchmarked 2x
after. Same results...

Should I try your suggestion on deleting the indexes? This table needs
to be accessible for reads at all times however though...

thank you kindly




I believe dropping an index inside a transaction is only visible to that
transaction. (Can someone back me up on this?)
Which means if you did:

BEGIN;
DROP INDEX ;
CREATE INDEX  ON ;
COMMIT;

The only problem is that if you are using a unique or primary key index,
a foreign key which is referencing that index would have to be dropped
and re-created as well. So you could have a pretty major cascade effect.

A better thing to do if your table only has one (or at least only a few)
indexes, would be to CLUSTER, which is effectively a VACUUM FULL + a
REINDEX (plus sorting the rows so that they are in index order). It
holds a full lock on the table, and takes a while, but when you are
done, things are cleaned up quite a bit.

You might also try just a REINDEX on the indexes in question, but this
also holds a full lock on the table. (My DROP + CREATE might also as
well, I'm not really sure, I just think of it as a way to recreate
without losing it for other transactions)

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Limit clause not using index

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


rvponp=# explain analyze select * from tblPrintjobs order by
loginuser, desceventdate, desceventtime ;
QUERY PLAN


Sort (cost=345699.06..347256.49 rows=622972 width=203) (actual
time=259438.952..268885.586 rows=622972 loops=1)
Sort Key: loginuser, desceventdate, desceventtime
-> Seq Scan on tblprintjobs (cost=0.00..25596.72 rows=622972
width=203) (actual time=21.155..8713.810 rows=622972 loops=1)
Total runtime: 271583.422 ms
(4 rows)



Can you post it with the limit? I realize the query takes a long time,
but that is the more important query to look at.

Also, just as a test, if you can, try dropping most of the indexes
except for the important one. It might be that the planner is having a
hard time because there are too many permutations to try.
I believe if you drop the indexes inside a transaction, they will still
be there for other queries, and if you rollback instead of commit, you
won't lose anything.

BEGIN;
DROP INDEX ...
EXPLAIN ANALYZE SELECT *...
ROLLBACK;

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Do Views execute underlying query everytime ??

2005-06-21 Thread John A Meinel

Amit V Shah wrote:


After I sent out this email, I found this article from google

http://jonathangardner.net/PostgreSQL/materialized_views/matviews.html

Looks like we can control as to when the views refresh... I am still kind of
confused, and would appreciate help !!

The create/drop table does sound a solution that can work, but the thing is
I want to get manual intervention out, and besides, my work flow is very
complex so this might not be an option for me :-(

Thanks,
Amit



Just to make it clear, a view is not the same as a materialized view.
A view is just a set of rules to the planner so that it can simplify
interactions with the database. A materialized view is a query which has
been saved into a table.

To set it up properly, really depends on what your needs are.

  1. How much time can elapse between an update to the system, and an
 update to the materialized views?
  2. How many updates / (sec, min, hour, month) do you expect. Is
 insert performance critical, or secondary.

For instance, if you get a lot of updates, but you can have a 1 hour lag
between the time a new row is inserted and the view is updated, you can
just create a cron job that runs every hour to regenerate the
materialized view.

If you don't get many updates, but you need them to show up right away,
then you can add triggers to the affected tables, such that
inserting/updating to a specific table causes an update to the
materialized view.

There are quite a few potential tradeoffs. Rather than doing a
materialized view, you could just improve your filters. If you are doing
a query to show people the results, you generally have some sort of
upper bound on how much data you can display. Humans don't like reading
more than 100 or 1000 rows. So create your normal query, and just take
on a LIMIT 100 at the end. If you structure your query properly, and
have appropriate indexes, you should be able to make the LIMIT count,
and allow you to save a lot of overhead of generating rows that you
don't use.

I would probably start by posting the queries you are currently using,
along with an EXPLAIN ANALYZE, and a description of what you actually
need from the query. Then this list can be quite helpful in
restructuring your query to make it faster.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Limit clause not using index

2005-06-21 Thread John A Meinel

Yves Vindevogel wrote:


Hi,

I have a very simple query on a big table. When I issue a "limit" 
and/or "offset" clause, the query is not using the index.

Can anyone explain me this ?


You didn't give enough information. What does you index look like that 
you are expecting it to use?

Generally, you want to have matching columns. So you would want
CREATE INDEX blah ON tblprintjobs(loginuser, desceventdate, desceventtime);

Next, you should post EXPLAIN ANALYZE instead of regular explain, so we 
can have an idea if the planner is actually making correct estimations.


John
=:->



rvponp=# explain select * from tblprintjobs order by loginuser, 
desceventdate, desceventtime offset 25 limit 25 ;

QUERY PLAN
--- 


Limit (cost=349860.62..349860.68 rows=25 width=206)
-> Sort (cost=349860.56..351416.15 rows=622236 width=206)
Sort Key: loginuser, desceventdate, desceventtime
-> Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206)
(4 rows)

rvponp=# explain select * from tblprintjobs order by loginuser, 
desceventdate, desceventtime ;

QUERY PLAN
- 


Sort (cost=349860.56..351416.15 rows=622236 width=206)
Sort Key: loginuser, desceventdate, desceventtime
-> Seq Scan on tblprintjobs (cost=0.00..25589.36 rows=622236 width=206)
(3 rows)

Met vriendelijke groeten,
Bien à vous,
Kind regards,

*Yves Vindevogel*
*Implements*






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] How does the transaction buffer work?

2005-06-16 Thread John A Meinel

Veikko Mäkinen wrote:


Hey,

How does Postgres (8.0.x) buffer changes to a database within a 
transaction? I need to insert/update more than a thousand rows (mayde 
even more than 1 rows, ~100 bytes/row) in a table but the changes 
must not be visible to other users/transactions before every row is 
updated. One way of doing this that I thought of was start a 
transaction, delete everything and then just dump new data in (copy 
perhaps). The old data would be usable to other transactions until I 
commit my insert. This would be the fastest way, but how much memory 
would this use? Will this cause performance issues on a heavily loaded 
server with too little memory even to begin with :)


Postgres does indeed keep track of who can see what. Such that changes 
won't be seen until a final commit.

If you are trying to insert bulk data, definitely consider COPY.

But UPDATE should also be invisible until the commit. So if you are only 
changing data, there really isn't any reason to do a DELETE and INSERT. 
Especially since you'll have problems with foreign keys at the DELETE stage.


John
=:->



-veikko






signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] PostgreSQL using the wrong Index

2005-06-13 Thread John A Meinel

Alex Stapleton wrote:


Oh, we are running 7.4.2 btw. And our random_page_cost = 1


Which is only correct if your entire db fits into memory. Also, try
updating to a later 7.4 version if at all possible.


On 13 Jun 2005, at 14:02, Alex Stapleton wrote:


We have two index's like so

l1_historical=# \d "N_intra_time_idx"
   Index "N_intra_time_idx"
Column |Type
+-
time   | timestamp without time zone
btree


Just so you are aware, writing this as: "We have an index on
N_intra(time) and one on N_Intra(symbol, time)" is a lot more succinct.



l1_historical=# \d "N_intra_pkey"
 Index "N_intra_pkey"
Column |Type
+-
symbol | text
time   | timestamp without time zone
unique btree (primary key)

and on queries like this

select * from "N_intra" where symbol='SOMETHING WHICH DOESNT EXIST'
order by time desc limit 1;

PostgreSQL takes a very long time to complete, as it effectively
scans the entire table, backwards. And the table is huge, about 450
million rows. (btw, there are no triggers or any other exciting
things like that on our tables in this db.)

but on things where the symbol does exist in the table, it's more  or
less fine, and nice and fast.



What happens if you do:
SELECT * FROM "N_intra" WHERE symbol='doesnt exist' ORDER BY symbol,
time DESC LIMIT 1;

Yes, symbol is constant, but it frequently helps the planner realize it
can use an index scan if you include all terms in the index in the ORDER
BY clause.



Whilst the option the planner has taken might be faster most of the
time, the worst case scenario is unacceptable for obvious reasons.
I've googled for trying to force the use of a specific index, but
can't find anything relevant. Does anyone have any suggestions on
getting it to use an index which hopefully will have better worst
case performance?



Try the above first. You could also create a new index on symbol
   CREATE INDEX "N_intra_symbol_idx" ON "N_intra"(symbol);

Then the WHERE clause should use the symbol index, which means it can
know quickly that an entry doesn't exist. I'm not sure how many entries
you have per symbol, though, so this might cause problems in the ORDER
BY time portion.

I'm guessing what you really want is to just do the ORDER BY symbol, time.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Index ot being used

2005-06-13 Thread John A Meinel

Kevin Grittner wrote:


It sure would be nice if the optimizer would consider that it had the
leeway to add any column which was restricted to a single value to any
point in the ORDER BY clause.  Without that, the application programmer
has to know what indexes are on the table, rather than being able to
just worry about the set of data they want.  Obviously, if a column can
have only one value in the result set, adding to any point in the ORDER
BY can't change anything but performance.  That sure sounds like
something which should fall within the scope of an optimizer.

It really should be a DBA function to add or drop indexes to tune the
performance of queries, without requiring application programmers to
modify the queries for every DBA adjustment.  (When you have a database
with over 350 tables and thousands of queries, you really begin to
appreciate the importance of this.)



I agree that having a smarter optimizer, which can recognize when an
index can be used for ORDER BY would be useful.

I don't know if there are specific reasons why not, other than just not
being implemented yet. It might be tricky to get it correct (for
instance, how do you know which columns can be added, which ones will be
constant) Perhaps you could just potentially add the WHERE items if they
have an equality constraint with a constant. But I'm guessing there are
more cases than that where the optimization could be performed.

Also, the more options you give the planner, the longer it takes on
average to plan any single query. Yes, it is beneficial for this use
case, but does that balance out slowing down all the other queries by a
tiny bit.

I'm guessing the optimization wasn't as important as some of the others
that have been done, so it hasn't been implemented yet.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster search

2005-06-10 Thread John A Meinel
Clark Slater wrote:
> hmm, i'm baffled.  i simplified the query
> and it is still taking forever...
> 
> 
>   test
> -
>  id| integer
>  partnumber| character varying(32)
>  productlistid | integer
>  typeid| integer
> 
> 
> Indexes:
> "test_productlistid" btree (productlistid)
> "test_typeid" btree (typeid)
> "test_productlistid_typeid" btree (productlistid, typeid)
> 
> 
> explain analyze select * from test where (productlistid=3 and typeid=9);
> 
>   QUERY PLAN
> ---
>  Seq Scan on test  (cost=0.00..96458.27 rows=156194 width=725) (actual
> time=516.459..41930.250 rows=132528 loops=1)
>Filter: ((productlistid = 3) AND (typeid = 9))
>  Total runtime: 41975.154 ms
> (3 rows)
> 
> 

This query is still going to take a long time, because you have to scan
the whole table. Your WHERE clause is not very specific (it takes 25% of
the table). Convention says that any time you want > 5-10% of a table, a
sequential scan is better, because it does it in order.

Now if you did:

explain analyze select * from test where (productlistid=3 and typeid=9)
limit 15;

I think that would be very fast.

I am a little surprised that it is taking 40s to scan only 400k rows,
though. On an older machine of mine (with only 256M ram and dual 450MHz
Celerons), I have a table with 74k rows which takes about .5 sec. At
those numbers it should take more like 4s not 40.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] faster search

2005-06-10 Thread John A Meinel
Clark Slater wrote:
> Hi-
> 
> Would someone please enlighten me as
> to why I'm not seeing a faster execution
> time on the simple scenario below?
> 
> there are 412,485 rows in the table and the
> query matches on 132,528 rows, taking
> almost a minute to execute.  vaccuum
> analyze was just run.

Well, if you are matching 130k out of 400k rows, then a sequential scan
is certainly prefered to an index scan. And then you have to sort those
130k rows by partnumber. This *might* be spilling to disk depending on
what your workmem/sortmem is set to.

I would also say that what you would really want is some way to get the
whole thing from an index. And I think the way to do that is:

CREATE INDEX test_partnum_listid_typeid_idx ON
test(partnumber, productlistid, typeid);

VACUUM ANALYZE test;

EXPLAIN ANALYZE SELECT * FROM test
WHERE productlistid=3 AND typeid=9
ORDER BY partnumber, productlistid, typeid
LIMIT 15
;

The trick is that you have to match the order by exactly with the index,
so the planner realizes it can do an indexed lookup to get the information.

You could also just create an index on partnumber, and see how that
affects your original query. I think the planner could use an index
lookup on partnumber to get the ordering correct. But it will have to do
filtering after the fact based on productlistid and typeid.
With my extended index, I think the planner can be smarter and lookup
all 3 by the index.

> 
> Thanks!
> Clark

Good luck,
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Postgresql on an AMD64 machine

2005-06-07 Thread John A Meinel
Neil Conway wrote:
> Tom Arthurs wrote:
> 
>> Yes, shared buffers in postgres are not used for caching
> 
> 
> Shared buffers in Postgres _are_ used for caching, they just form a
> secondary cache on top of the kernel's IO cache. Postgres does IO
> through the filesystem, which is then cached by the kernel. Increasing
> shared_buffers means that less memory is available for the kernel to
> cache IO -- increasing shared_buffers has been shown to be a net
> performance loss beyond a certain point. Still, there is value in
> shared_buffers as it means we can avoid a read() system call for hot
> pages. We can also do better buffer replacement in the PG shared buffer
> than the kernel can do (e.g. treating IO caused by VACUUM specially).
> 

As I recall, one of the performance problems with a large shared_buffers
is that there are some commands which require looking at *all* of the
shared buffer space. So the larger it gets, the longer those functions take.

>> My biggest challenge with solaris/sparc is trying to reduce context
>> switching.
> 
> 
> It would be interesting to see if this is improved with current sources,
> as Tom's bufmgr rewrite should have hopefully have reduced this problem.
> 

These might be what was fixed with Tom's rewrite. I don't really know.

John
=:->

> -Neil
> 
> ---(end of broadcast)---
> TIP 4: Don't 'kill -9' the postmaster



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Postgresql and Software RAID/LVM

2005-06-06 Thread John A Meinel
Marty Scholes wrote:
>> Has anyone ran Postgres with software RAID or LVM on a production box?
>> What have been your experience?
> 
> Yes, we have run for a couple years Pg with software LVM (mirroring)
> against two hardware RAID5 arrays.  We host a production Sun box that
> runs 24/7.
> 
> My experience:
> * Software RAID (other than mirroring) is a disaster waiting to happen.
>  If the metadata for the RAID set gives out for any reason (CMOS
> scrambles, card dies, power spike, etc.) then you are hosed beyond
> belief.  In most cases it is almost impossible to recover.  With
> mirroring, however, you can always boot and operate on a single mirror,
> pretending that no LVM/RAID is underway.  In other words, each mirror is
> a fully functional copy of the data which will operate your server.

Isn't this actually more of a problem for the meta-data to give out in a
hardware situation? I mean, if the card you are using dies, you can't
just get another one.
With software raid, because the meta-data is on the drives, you can pull
it out of that machine, and put it into any machine that has a
controller which can read the drives, and a similar kernel, and you are
back up and running.
> 
> * Hardware RAID5 is a terrific way to boost performance via write
> caching and spreading I/O across multiple spindles.  Each of our
> external arrays operates 14 drives (12 data, 1 parity and 1 hot spare).
>  While RAID5 protects against single spindle failure, it will not hedge
> against multiple failures in a short time period, SCSI contoller
> failure, SCSI cable problems or even wholesale failure of the RAID
> controller.  All of these things happen in a 24/7 operation.  Using
> software RAID1 against the hardware RAID5 arrays hedges against any
> single failure.

No, it hedges against *more* than one failure. But you can also do a
RAID1 over a RAID5 in software. But if you are honestly willing to
create a full RAID1, just create a RAID1 over RAID0. The performance is
much better. And since you have a full RAID1, as long as both drives of
a pairing don't give out, you can lose half of your drives.

If you want the space, but you feel that RAID5 isn't redundant enough,
go to RAID6, which uses 2 parity locations, each with a different method
of storing parity, so not only is it more redundant, you have a better
chance of finding problems.

> 
> * Software mirroring gives you tremendous ability to change the system
> while it is running, by taking offline the mirror you wish to change and
> then synchronizing it after the change.
>

That certainly is a nice ability. But remember that LVM also has the
idea of "snapshot"ing a running system. I don't know the exact details,
just that there is a way to have some processes see the filesystem as it
existed at an exact point in time. Which is also a great way to handle
backups.

> On a fully operational production server, we have:
> * restriped the RAID5 array
> * replaced all RAID5 media with higher capacity drives
> * upgraded RAID5 controller
> * moved all data from an old RAID5 array to a newer one
> * replaced host SCSI controller
> * uncabled and physically moved storage to a different part of data center
> 
> Again, all of this has taken place (over the years) while our machine
> was fully operational.
> 
So you are saying that you were able to replace the RAID controller
without turning off the machine? I realize there does exist
hot-swappable PCI cards, but I think you are overstating what you mean
by "fully operational". For instance, it's not like you can access your
data while it is being physically moved.

I do think you had some nice hardware. But I know you can do all of this
in software as well. It is usually a price/performance tradeoff. You
spend quite a bit to get a hardware RAID card that can keep up with a
modern CPU. I know we have an FC raid box at work which has a full 512MB
of cache on it, but it wasn't that much cheaper than buying a dedicated
server.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)

2005-06-06 Thread John A Meinel
Michael Stone wrote:
> On Mon, Jun 06, 2005 at 10:08:23AM -0500, John A Meinel wrote:
> 
>> I don't know if you can do it, but it would be nice to see this be 1
>> RAID1 for OS, 1 RAID10 for pg_xlog, 
> 
> 
> That's probably overkill--it's a relatively small sequential-write
> partition with really small writes; I don't see how pg_xlog would
> benefit from raid10 as opposed to raid1.
> Mike Stone
> 

pg_xlog benefits from being super fast. Because it has to be fully
synced before the rest of the data can be committed. Yes they are small,
but if you can make it fast, you eliminate that overhead. It also
benefits from having it's own spindle, because you eliminate the seek
time. (Since it is always appending)

Anyway, my point is that pg_xlog isn't necessarily tiny. Many people
seem to set it as high as 100-200, and each one is 16MB.

But one other thing to consider is to make pg_xlog on a battery backed
ramdisk. Because it really *can* use the extra speed. I can't say that a
ramdisk is more cost effective than faster db disks. But if you aren't
using many checkpoint_segments, it seems like you could get a 1GB
ramdisk, and probably have a pretty good performance boost. (I have not
tested this personally, though).

Since he is using the default settings (mostly) for dspam, he could
probably get away with something like a 256MB ramdisk.

The only prices I could find with a few minutes of googleing was:
http://www.cenatek.com/store/category.cfm?Category=15
Which is $1.6k for 2GB.

But there is also a product that is being developed, which claims $60
for the PCI card, you supply the memory. It has 4 DDR slots
http://www.engadget.com/entry/1234000227045399/
And you can get a 128MB SDRAM ECC module for around $22
http://www.newegg.com/Product/Product.asp?Item=N82E16820998004
So that would put the total cost of a 512MB battery backed ramdisk at
$60 + 4*22 = $150.

That certainly seems less than what you would pay for the same speed in
hard-drives.
Unfortunately the Giga-byte iRam seems to just be in the demo stage. But
if they aren't lying in the press releases, it would certainly be
something to keep an eye on.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Performance nightmare with dspam (urgent) (resolved)

2005-06-06 Thread John A Meinel
Casey Allen Shobe wrote:
> On Wednesday 01 June 2005 20:19, Casey Allen Shobe wrote:
> 
...
> Long-term, whenever we hit the I/O limit again, it looks like we really don't 
> have much of a solution except to throw more hardware (mainly lots of disks 
> in RAID0's) at the problem. :(  Fortunately, with the above two changes I/O 
> usage on the PG data disk is a quarter of what it was, so theoretically we 
> should be able to quadruple the number of users on current hardware.
> 

Be very careful in this situation. If any disks in a RAID0 fails, the
entire raid is lost. You *really* want a RAID10. It takes more drives,
but then if anything dies you don't lose everything.

If you are running RAID0 and you *really* want performance, and aren't
concerned about safety (at all), you could also set fsync=false. That
should also speed things up. But you are really risking corruption/data
loss on your system.

> Our plan forward is to increase the number of disks in the two redundant mail 
> servers, so that each has a single ultra320 disk for O/S and pg_xlog, and a 
> 3-disk RAID0 for the data.  This should triple our current capacity.

I don't know if you can do it, but it would be nice to see this be 1
RAID1 for OS, 1 RAID10 for pg_xlog, and another RAID10 for data. That is
the recommended performance layout. It takes quite a few drives (minimum
of 10). But it means your data is safe, and your performance should be
very good.

> 
> The general opinion of the way dspam uses the database among people I've 
> talked to on #postgresql is not very good, but of course the dspam folk blame 
> PostgreSQL and say to use MySQL if you want reasonable performance.  Makes it 
> real fun to be a DSpam+PostgreSQL user when limits are reached, since 
> everyone denies responsibility.  Fortunately, PostgreSQL people are pretty 
> helpful even if they think the client software sucks. :)
> 

I can't say how dspam uses the database. But they certainly could make
assumptions about how certain actions are done by the db, which are not
quite true with postgres. (For instance MySQL can use an index to return
information, because Postgres supports transactions, it cannot, because
even though a row is in the index, it may not be visible to the current
transaction.)

They also might be doing stuff like "select max(row)" instead of "select
row ORDER BY row DESC LIMIT 1". In postgres the former will be a
sequential scan, the latter will be an index scan. Though I wonder about
"select max(row) ORDER BY row DESC LIMIT 1". to me, that should still
return the right answer, but I'm not sure.

> Cheers,

Good luck,
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Moving pg_xlog

2005-06-01 Thread John A Meinel

Tom Lane wrote:
...


Now that I think about it, you were (if I understood your layout
correctly) proposing to put the xlog on your system's root disk.
This is probably a bad idea for performance, because there will always
be other traffic to the root disk.  What you are really trying to
accomplish is to make sure the xlog is on a disk spindle that has no
other traffic besides xlog, so that the disk heads never have to move
off the current xlog file.  The xlog traffic is 100% sequential writes
and so you can cut the seeks involved to near nil if you can dedicate
a spindle to it.



I certainly agree with what you wrote. But my understanding is that if
you only have 2 arrays, then moving xlog onto the array not on the
database is better than having it with the database. It isn't optimum,
but it is better. Because that way there isn't as much contention
between the database and xlog.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] postgresql-8.0.1 performance tuning

2005-05-31 Thread John A Meinel

Martin Fandel wrote:


Hi @ all,

i'm trying to tune my postgresql-db but i don't know if the values are
right
set.

I use the following environment for the postgres-db:

# Hardware 
cpu: 2x P4 3Ghz
ram: 1024MB DDR 266Mhz

partitions:
/dev/sda3  23G  9,6G   13G  44% /
/dev/sda1  11G  156M  9,9G   2% /var
/dev/sdb1  69G   13G   57G  19% /var/lib/pgsql

/dev/sda is in raid 1  (2x 35GB / 1upm / sca)
/dev/sdb is in raid 10 (4x 35GB / 1upm / sca)
# /Hardware 


You probably want to put the pg_xlog file onto /dev/sda rather than
having it in /dev/sdb. Having it separate from the data usually boosts
performance a lot. I believe you can just mv it to a different
directory, and then recreate it as a symlink. (Stop the database first :)



# Config 
/etc/sysctl.conf:
kernel.shmall = 786432000
kernel.shmmax = 786432000


Not really sure about these two.


/etc/fstab:
/dev/sdb1 /var/lib/pgsql reiserfs
acl,user_xattr,noatime,data=writeback 1 2


Seems decent.


/var/lib/pgsql/data/postgresql.conf
superuser_reserved_connections  = 2
shared_buffers  = 3000
work_mem= 131072
maintenance_work_mem= 131072


These both seem pretty large. But it depends on how many concurrent
connections doing sorting/hashing/etc you expect. If you are only
expecting 1 connection, these are probably fine. Otherwise with 1GB of
RAM I would probably make work_mem more like 4096/8192.
Remember, running out of work_mem means postgres will spill to disk,
slowing that query. Running out of RAM causes the system to swap, making
everything slow.


max_stack_depth = 2048
max_fsm_pages   = 2
max_fsm_relations   = 1000
max_files_per_process   = 1000
vacuum_cost_delay   = 10
vacuum_cost_page_hit= 1
vacuum_cost_page_miss   = 10
vacuum_cost_page_dirty  = 20
vacuum_cost_limit   = 200
bgwriter_delay  = 200
bgwriter_percent= 1
bgwriter_maxpages   = 100
fsync   = true
wal_sync_method = fsync
wal_buffers = 64
commit_delay= 0
commit_siblings = 5
checkpoint_segments = 256
checkpoint_timeout  = 900
checkpoint_warning  = 30
effective_cache_size= 1
random_page_cost= 4
cpu_tuple_cost  = 0.01
cpu_index_tuple_cost= 0.001
cpu_operator_cost   = 0.0025
geqo= true
geqo_threshold  = 12
geqo_effort = 5
geqo_pool_size  = 0
geqo_generations= 0
geqo_selection_bias = 2.0
deadlock_timeout= 1000
max_locks_per_transaction   = 64
# /Config 

# Transactions 
we have about 115-300 transactions/min in about 65 tables.
# /Transactions 

I'm really new at using postgres. So i need some experience to set this
parameters in the postgresql- and the system-config. I can't find standard
calculations for this. :/ The postgresql-documentation doesn't help me to
set the best values for this.

The database must be high-availble. I configured rsync to sync the
complete
/var/lib/pgsql-directory to my hot-standby. On the hotstandby i will
make the
dumps of the database to improve the performance of the master-db.


I didn't think an rsync was completely valid. Probably you should look
more into Slony.
http://slony.info

It is a single-master asynchronous replication system. I believe it is
pretty easy to setup, and does what you really want.


In my tests the synchronization works fine. I synchronised the hole
directory
and restarted the database of the hotstandby. While restarting,
postgresql turned
back the old (not archived) wals and the database of my hotstandby was
consistent. Is this solution recommended? Or must i use archived wal's
with
real system-snapshots?

best regards,

Martin Fandel


John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Inner join on two OR conditions dont use index

2005-05-25 Thread John A Meinel

Jocelyn Turcotte wrote:


Hi all
i dont know if this is normal, but if yes i would like to know why and
how I could do it another way other than using unions.




The only thing that *might* work is if you used an index on both keys.
So if you did:

CREATE INDEX rt_edge_start_end_node ON rt_edge(start_node_id,end_node_id);

The reason is that in an "OR" construct, you have to check both for being true. 
So in the general case where you don't know the correlation between the columns, you have 
to check all of the entries, because even if you know the status of one side of the OR, 
you don't know the other.

Another possibility would be to try this index:

CREATE INDEX rt_edge_stare_or_end ON rt_edge(start_node_id OR end_node_id);

I'm not sure how smart the planner can be, though.

John
=:->




signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Can anyone explain this: duplicate dbs.

2005-05-24 Thread John A Meinel

SpaceBallOne wrote:


What version of postgres?



8.0.2 ... but I think I've seen this before on 7.3 ...


There are a few possibilities. If you are having a lot of updates to the
table, you can get index bloat. And vacuum doesn't fix indexes. You have
to "REINDEX" to do that. Though REINDEX has the same lock that VACUUM
FULL has, so you need to be a little careful with it.




Probably better is to do CLUSTER, as it does a REINDEX and a sort, so
your table ends up nicer when you are done.



Thanks, will try those next time this problem crops up (i just deleted
/ recreated the database to speed things for its users in the office
... probably should have held off to see if I could find a solution
first!).

Yes, the database / table-in-question does have a lot of updates,
deletes, and new rows (relatively speaking for a small business).

Would CLUSTER / REINDEX still have an effect if our queries were done
via sequential scan? This is a old database (as in built by me when i
was just starting to learn unix / postgres) so the database design is
pretty horrible (little normalisation, no indexes).


Well, my first recommendation is to put in some indexes. :) They are
relatively easy to setup and can drastically improve select performance.

What version of postgres are you using?
What does it say at the end of "VACUUM FULL ANALYZE VERBOSE", that
should tell you how many free pages were reclaimed and how big your free
space map should be.

If you only did 1 VACUUM FULL, you might try another, as it sounds like
your tables aren't properly filled. I'm pretty sure vacuum only removes
empty pages/marks locations for the free space map so they can be
re-used, while vacuum full will move entries around to create free pages.

It sounds like it didn't do it properly.

But even so, CLUSTER is still your friend, as it allows you to "presort"
the rows in your tables.



Have taken Chris's advice onboard too and setup cron to do a vacuumdb
hourly instead of my weekly vacuum.

Cheers,

Dave.



John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Can anyone explain this: duplicate dbs.

2005-05-24 Thread John A Meinel

SpaceBallOne wrote:


Wondering if someone could explain a pecularity for me:

We have a database which takes 1000ms to perform a certain query on.

If I pg_dump that database then create a new database (e.g. "tempdb")
and upload the dump file (thus making a duplicate) then the same query
only takes 190ms !!
Vacuum, vacuum analyse, and vacuum full analyse does not seem to have
an impact on these times.

Can anyone explain why this may be occurring and how I might be able
to keep the original database running at the same speed as "tempdb"?

Thanks in advance,

Dave.


What version of postgres?

There are a few possibilities. If you are having a lot of updates to the
table, you can get index bloat. And vacuum doesn't fix indexes. You have
to "REINDEX" to do that. Though REINDEX has the same lock that VACUUM
FULL has, so you need to be a little careful with it.

Probably better is to do CLUSTER, as it does a REINDEX and a sort, so
your table ends up nicer when you are done.

Also, older versions of postgres had a worse time with index bloat. One
thing that caused a lot of problem is a table that you insert into over
time, so that all the values are incrementing. If you are deleting older
entries, that area won't be re-used because they fall at the back end. I
believe newer versions have been fixed.

By the way, I think doing:

CREATE DATABASE tempdb WITH TEMPLATE = originaldb;

Is a much faster way of doing dump and load. I *think* it would recreate
indexes, etc. If it just does a copy it may not show the dump/restore
improvement.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Select performance vs. mssql

2005-05-24 Thread John A Meinel

Michael Stone wrote:


On Tue, May 24, 2005 at 04:35:14PM -0700, Josh Berkus wrote:


Pretty much.  There has been discussion about allowing index-only
access to "frozen" tables, i.e. archive partitions.  But it all sort
of hinges on someone implementing it and testing 



Is there any way to expose the planner estimate? For some purposes it's
enough to just give a rough ballpark (e.g., a google-esque "results 1-10
of approximately 1000") so a user knows whether its worth even
starting to page through.

Mike Stone

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Well, you could always do:

EXPLAIN SELECT ...

And then parse out the rows= in the first line.

John
=:->



signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] seqential vs random io

2005-05-23 Thread John A Meinel
David Parker wrote:
> I just got a question from one our QA guys who is configuring a RAID 10
> disk that is destined to hold a postgresql database. The disk
> configuration procedure is asking him if he wants to optimize for
> sequential or random access. My first thought is that random is what we
> would want, but then I started wondering if it's not that simple, and my
> knowledge of stuff at the hardware level is, well, limited.
>  
> If it were your QA guy, what would you tell him?
> 
> - DAP

Random. Sequential is always pretty fast, it is random that hurts.

The only time I would say sequential is if you were planning on
streaming large files (like iso images) with low load.

But for a DB, even a sequential scan will probably not be that much data.

At least, that's my 2c.

John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Optimize complex join to use where condition before

2005-05-13 Thread John A Meinel
Greg Stark wrote:
Sebastian Hennebrueder <[EMAIL PROTECTED]> writes:

User-Agent: Mozilla Thunderbird 1.0 (Windows/20041206)
...
"Nested Loop  (cost=1349.13..1435.29 rows=1 width=2541) (actual
time=1640.000..3687.000 rows=62 loops=1)"
"  Join Filter: ("inner".fid = "outer".faufgaben_id)"
"  ->  Index Scan using idx_tauf_mit_mitid on taufgaben_mitarbeiter am
(cost=0.00..80.65 rows=35 width=4) (actual time=0.000..0.000 rows=765
loops=1)"

Is it really Mozilla Thunderbird that's causing this new craptastic mangling
of plans in people's mails? I was assuming it was some new idea of how to mess
up people's mail coming out of Exchange or Lotus or some other such "corporate
messaging" software that only handled SMTP mail as an afterthought. This is,
uh, disappointing.
Are you talking about the quotes, or just the fact that it is wrapped?
I don't know where the quotes came from, but in Thunderbird if you are
writing in text mode (not html) it defaults to wrapping the text at
something like 78 characters. That includes copy/paste text.
If you want it to *not* wrap, it turns out that "Paste as quotation"
will not wrap, but then you have to remove the "> " from the beginning
of every line.
In html mode, it also defaults to wrapping, but if you switch to
PREFORMAT text first, it doesn't wrap.
At least, those are the tricks that I've found. Safest bet is to just
use an attachment, though.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Recommendations for set statistics

2005-05-12 Thread John A Meinel
Sebastian Hennebrueder wrote:
Hello,
I could not find any recommandations for the level of set statistics and
what a specific level does actually mean.
What is the difference between 1, 50 and 100? What is recommanded for a
table or column?
Default I believe is 10. The higher the number, the more statistics are
kept, with a maximum of 1000.
The default is a little bit low for columns used in foreign keys, though
frequently it is okay.
When problems start, try setting them to 100 or 200. Higher is more
accurate, but takes longer to compute, *and* takes longer when planning
the optimal query method. It can be worth it, though.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] Partitioning / Clustering

2005-05-12 Thread John A Meinel
Alex Turner wrote:
Ok - my common sense alarm is going off here...
There are only 6.446 billion people worldwide.  100 Billion page views
would require every person in the world to view 18 pages of yahoo
every day.  Not very likely.
http://www.internetworldstats.com/stats.htm
suggests that there are around 1 billion people actualy on the internet.
That means each and every person on the internet has to view 100 pages
per day of yahoo.
pretty unlikely IMHO.  I for one don't even use Yahoo ;)
100 million page views per day suggests that 1 in 100 people on the
internet each viewed 10 pages of a site.  Thats a pretty high
percentage if you ask me.
In general I think your point is valid. Just remember that it probably
also matters how you count page views. Because technically images are a
separate page (and this thread did discuss serving up images). So if
there are 20 graphics on a specific page, that is 20 server hits just
for that one page.
I could easily see an image heavy site getting 100 hits / page. Which
starts meaning that if 1M users hit 10 pages, then you get 1M*10*100 = 1G.
I still think 100G views on a single website is a lot, but 100M is
certainly possible.
John
=:->


signature.asc
Description: OpenPGP digital signature


Re: [PERFORM] tuning Postgres for large data import (using Copy from)

2005-05-12 Thread John A Meinel
Marc Mamin wrote:
Hello,
I'm not an expert, but I'll give some suggestions.
I'd like to tune Postgres for large data import (using Copy from).
I believe that COPY FROM  is supposed to be faster than COPY FROM
STDIN, but  must be available to the backend process. If you can
do it, you should think about it, as it eliminates the communication
between the client and the backend.
here are a few steps already done:

1) use 3 different disks for:
-1: source data
-2: index tablespaces
-3: data tablespaces

Make sure pg_xlog is on it's own filesystem. It contains the
write-ahead-log, and putting it by itself keeps the number of seeks
down. If you are constrained, I think pg_xlog is more important than
moving the index tablespaces.

2) define all foreign keys as initially deferred
3) tune some parameters:

max_connections =20
shared_buffers =3
work_mem = 8192
maintenance_work_mem = 32768
checkpoint_segments = 12
(I also modified the kernel accordingly)
Don't forget to increase your free space map if you are going to be
doing deletes frequently.

4) runs VACUUM regulary
The server runs RedHat and has 1GB RAM
In the production (which may run on a better server), I plan to:
- import a few millions rows per day,
- keep up to ca 100 millions rows in the db
- delete older data

I've seen a few posting on hash/btree indexes, which say that hash index do
not work very well in Postgres;
currently, I only use btree indexes. Could I gain performances whole using
hash indexes as well ?
I doubt it.
How does Postgres handle concurrent copy from on: same table / different
tables ?
I think it is better with different tables. If using the same table, and
there are indexes, it has to grab a lock for updating the index, which
causes contention between 2 processes writing to the same table.
I'd be glad on any further suggestion on how to further increase my
performances.
Since you are deleting data often, and copying often, I might recommend
using a partition scheme with a view to bind everything together. That
way you can just drop the old table rather than doing a delete. I don't
know how this would affect foreign key references.
But basically you can create a new table, and do a copy without having
any indexes, then build the indexes, analyze, update the view.
And when deleting you can update the view, and drop the old table.
Something like this:
CREATE TABLE table_2005_05_11 AS (blah);
COPY FROM ... ;
CREATE INDEX blah ON table_2005_05_11(blah);
CREATE OR REPLACE VIEW table AS
SELECT * FROM table_2005_05_10
UNION ALL SELECT * FROM table_2005_05_11;
VACUUM ANALYZE table_2005_05_11;
...
John
=:->

Marc





signature.asc
Description: OpenPGP digital signature


  1   2   >