Hi John,

It sounds like a disk-bound operation, so cpu is not maxed out. I'm not clear on all the details of your operation but it sounds like you're using Java to do row-by-row based inserts, selects and updates within a transaction, from a file. This can be a very slow process if you have many rows. The OS stats you describe fits that theory (but not conclusively).

If you are using (psuedo-)code such as:

Open file {
  Read line {
select from Pg: "select from [other_table] where val = [line[colN]]" exec to Pg: "insert into [table] (col1, col2, ...) values (line[col1], line[col2]..."
  }
}

You can radically speed up such a system by using the "copy" (http://www.postgresql.org/docs/8.2/interactive/sql-copy.html) command to load all the data at once from the file into Pg and then do post-processing with Java/SQL to get all the fields looking right. Doing a bulk update with a join across several tables is so much faster than looping through them with a wrapper in Java (or other lang) you won't believe it.

I hope this helps and is on-topic for you.

Steve

At 09:38 AM 4/3/2007, [EMAIL PROTECTED] wrote:
Date: Tue, 03 Apr 2007 22:16:13 +0800
From: John Summerfield <[EMAIL PROTECTED]>
To: pgsql-sql@postgresql.org
Subject: A long-running transaction
Message-ID: <[EMAIL PROTECTED]>

I have a Java (java 1.1) program that I wrote some years ago, to read
records from a text file and insert it into a ostgresql database.

One of the assumptions I made was that one file contained one day's
data, maybe as many as 1500 records, and I coded it to do the whole lot as one transaction so either a single file was loaded in its entirity,
or none of its data was.

I lost the Java code, but revived the idea and I've collected about two years' data using (Linux) shell scripts, and loading the data using psql.

Then, I found the Java code on a disused hard disk:-)

I made the necessary changes for it to build in java 1.5, and used psql to extract data from my new database in the correct format for the old program. This time, I have a little more data than I ever loaded at once
before:
[EMAIL PROTECTED]:~$ wc -l testdata
6242217 testdata
[EMAIL PROTECTED]:~$ \ls -hl testdata
-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
[EMAIL PROTECTED]:~$

Now, it wouldn't surprise me if postgresql used lots of memory - but how
much could it possibly need? My laptop, where I first tried this, has
1.25 Gbytes, so I could allow it some.

It wouldn't surprise me a lot if it used lots of memory and caused all sorts of problems, but no, it's chugging away, still using no more RAM
than it could have had on my old Pentium 133 all those years ago.

In the process of checking it out, I've set it running on a machine with
a AMD Sempron(tm)   2400+ running Kubuntu 6.10 (kernel is
2.6.17-6-server-xen0) and 512 Mbytes of RAM.

This is the java program:-)
summer   pts/6    :0.0             Thu20    5days  1:07   1:07
/usr/bin/gij-4.1 -cp /usr/s
It's been running five days so far, and I can see where it's up to by
attaching strace. It's reading 2k of the input file every few seconds.

Okay, clearly something's wrong, and I don't think it's all my crddu code.
No probs swapping:
[EMAIL PROTECTED]:~$ free
total used free shared buffers cached Mem: 460800 456472 4328 0 860 262164
-/+ buffers/cache:     193448     267352
Swap:      1461872        284    1461588
[EMAIL PROTECTED]:~$

It is hitting the disk pretty hard now on this machine, but the laptop's still going too, and the disk seems to run about half the time, part of
a second running, part idle (but the intervals are getting shorter).

It struck me as fairly curious that neither postgresql nor the
application was hogging the CPU.

Perhaps the laptop is more interesting: look at the size of the buffer pool:
[EMAIL PROTECTED]:~> free
total used free shared buffers cached Mem: 1295528 1268548 26980 0 3976 392388
-/+ buffers/cache:     872184     423344
Swap:      1941496      32656    1908840
[EMAIL PROTECTED]:~>
Again, no problem with over-use of RAM, and I'm logged on using KDE too
and that's running fine.

It's been running a little longer here:
summer   pts/23    28Mar07  5days 25:12  25:11  java -cp
/home/summer/Classes/:/usr/share/p

This is Sun's Java 1.5 on OpenSUSE 10.2.


This is what suggested I should write:
[EMAIL PROTECTED]:~> procinfo
Linux 2.6.18.8-0.1-default ([EMAIL PROTECTED]) (gcc 4.1.2 20061115) #1
1CPU [Echidna.]

Memory:      Total        Used        Free      Shared     Buffers
Mem:       1295528     1271720       23808           0        3716
Swap:      1941496       32656     1908840

Bootup: Tue Mar 27 18:50:19 2007 Load average: 2.21 2.65 2.69 2/243 19305

user  :   1d  3:17:04.03  16.0%  page in :  131097310  disk 1:
3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251
IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
hw irq:       0:44:46.71   0.4%  page flt:  306255213
sw irq:       0:50:04.69   0.5%  swap in :       4026
idle  :   1d  0:36:29.73  14.4%  swap out:       9552
uptime:   7d  2:59:20.97         context :  702502442

irq  0: 153880209 timer                 irq  7:         0 parport0
irq  1:     69402 i8042                 irq  8:         2 rtc
irq  2:         0 cascade [4]           irq  9:   1696942 acpi
irq  3:         4                       irq 10:         1
irq 4: 4 irq 11: 71842329 ehci_hcd:usb1,
uhci_
irq  5:  28545863 Intel 82801DB-ICH4    irq 12:    467432 i8042
irq  6:         1                       irq 14:  25021586 ide0

[EMAIL PROTECTED]:~>


Look at that line beginning "System:" two days 20 hours in the Linux
kernel. It's my guess that the Linux kernel is spending a great deal of
time manipulating that buffer pool.

This shows postgresql taking 60% CPU:
[EMAIL PROTECTED]:~> ps xaru
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
summer   20237  0.2  1.2 263716 15988 pts/23   Rl+  Mar28  25:11 java
-cp /home/summer/Class
postgres 19321  0.1  0.3  19844  3984 ?        D    21:50   0:00
postgres: summer stocksshar
summer 19344 0.0 0.0 2484 852 pts/25 R+ 21:52 0:00 ps xaru
[EMAIL PROTECTED]:~> ps xaru
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 20248 60.0  0.8  20760 11536 ?        D    Mar28 5167:27
postgres: summer sharetrad
postgres 19321  0.1  0.3  19844  4008 ?        D    21:50   0:00
postgres: summer stocksshar
summer 19348 0.0 0.0 2480 848 pts/25 R+ 21:52 0:00 ps xaru
[EMAIL PROTECTED]:~>

Note that pid 20248 is the one, the other has nothing to do with the
problem.

This is the SQL I used to create the table:
       String createTableStatement = "create table TJS0102_trades ("
                   + "   trade date"
                   + ",  ttime int2"
                   + ",  ASX varchar(7)"
                   + ",  thigh int4"
                   + ",  tlow int4"
                   + ",  tclose int4"
                   + ",  topen int4"
                   + ",  tvolume int4"
                   + ",  tvalue int4"
//                 + ")"
                   + ",  unique(ASX,trade,ttime)"
                   + ");";


For each record, I update a non-key field in another table; the source
data for that other table is less than a megabyte.


I have changed the program so as to load each day's data as a single
transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in about
three and an half hours. This is more-or-less what I expected.

If you agree with me that postgresql should do better, now is a good
time to ask for this as a test case.

Note, you could also get suitable test data from float.com.au - it's not
where mine came from, but it should be substantially the same.


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Reply via email to