I have a Java (java 1.1) program that I wrote some years ago, to read records from a text file and insert it into a ostgresql database.

One of the assumptions I made was that one file contained one day's data, maybe as many as 1500 records, and I coded it to do the whole lot as one transaction so either a single file was loaded in its entirity, or none of its data was.

I lost the Java code, but revived the idea and I've collected about two years' data using (Linux) shell scripts, and loading the data using psql.

Then, I found the Java code on a disused hard disk:-)

I made the necessary changes for it to build in java 1.5, and used psql to extract data from my new database in the correct format for the old program. This time, I have a little more data than I ever loaded at once before:
[EMAIL PROTECTED]:~$ wc -l testdata
6242217 testdata
[EMAIL PROTECTED]:~$ \ls -hl testdata
-rw-r--r-- 1 summer summer 285M 2007-03-28 22:32 testdata
[EMAIL PROTECTED]:~$

Now, it wouldn't surprise me if postgresql used lots of memory - but how much could it possibly need? My laptop, where I first tried this, has 1.25 Gbytes, so I could allow it some.

It wouldn't surprise me a lot if it used lots of memory and caused all sorts of problems, but no, it's chugging away, still using no more RAM than it could have had on my old Pentium 133 all those years ago.

In the process of checking it out, I've set it running on a machine with a AMD Sempron(tm) 2400+ running Kubuntu 6.10 (kernel is 2.6.17-6-server-xen0) and 512 Mbytes of RAM.

This is the java program:-)
summer pts/6 :0.0 Thu20 5days 1:07 1:07 /usr/bin/gij-4.1 -cp /usr/s It's been running five days so far, and I can see where it's up to by attaching strace. It's reading 2k of the input file every few seconds.

Okay, clearly something's wrong, and I don't think it's all my crddu code.
No probs swapping:
[EMAIL PROTECTED]:~$ free
             total       used       free     shared    buffers     cached
Mem:        460800     456472       4328          0        860     262164
-/+ buffers/cache:     193448     267352
Swap:      1461872        284    1461588
[EMAIL PROTECTED]:~$

It is hitting the disk pretty hard now on this machine, but the laptop's still going too, and the disk seems to run about half the time, part of a second running, part idle (but the intervals are getting shorter).

It struck me as fairly curious that neither postgresql nor the application was hogging the CPU.

Perhaps the laptop is more interesting: look at the size of the buffer pool:
[EMAIL PROTECTED]:~> free
             total       used       free     shared    buffers     cached
Mem:       1295528    1268548      26980          0       3976     392388
-/+ buffers/cache:     872184     423344
Swap:      1941496      32656    1908840
[EMAIL PROTECTED]:~>
Again, no problem with over-use of RAM, and I'm logged on using KDE too and that's running fine.

It's been running a little longer here:
summer pts/23 28Mar07 5days 25:12 25:11 java -cp /home/summer/Classes/:/usr/share/p

This is Sun's Java 1.5 on OpenSUSE 10.2.


This is what suggested I should write:
[EMAIL PROTECTED]:~> procinfo
Linux 2.6.18.8-0.1-default ([EMAIL PROTECTED]) (gcc 4.1.2 20061115) #1 1CPU [Echidna.]

Memory:      Total        Used        Free      Shared     Buffers
Mem:       1295528     1271720       23808           0        3716
Swap:      1941496       32656     1908840

Bootup: Tue Mar 27 18:50:19 2007    Load average: 2.21 2.65 2.69 2/243 19305

user : 1d 3:17:04.03 16.0% page in : 131097310 disk 1: 3079516r20087664w
nice  :       0:05:39.64   0.1%  page out:  197016649
system:   2d 20:38:37.13  40.1%  page act:   87906251
IOwait:   2d  0:46:37.33  28.5%  page dea:   16218135
hw irq:       0:44:46.71   0.4%  page flt:  306255213
sw irq:       0:50:04.69   0.5%  swap in :       4026
idle  :   1d  0:36:29.73  14.4%  swap out:       9552
uptime:   7d  2:59:20.97         context :  702502442

irq  0: 153880209 timer                 irq  7:         0 parport0
irq  1:     69402 i8042                 irq  8:         2 rtc
irq  2:         0 cascade [4]           irq  9:   1696942 acpi
irq  3:         4                       irq 10:         1
irq 4: 4 irq 11: 71842329 ehci_hcd:usb1, uhci_
irq  5:  28545863 Intel 82801DB-ICH4    irq 12:    467432 i8042
irq  6:         1                       irq 14:  25021586 ide0

[EMAIL PROTECTED]:~>


Look at that line beginning "System:" two days 20 hours in the Linux kernel. It's my guess that the Linux kernel is spending a great deal of time manipulating that buffer pool.

This shows postgresql taking 60% CPU:
[EMAIL PROTECTED]:~> ps xaru
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
summer 20237 0.2 1.2 263716 15988 pts/23 Rl+ Mar28 25:11 java -cp /home/summer/Class postgres 19321 0.1 0.3 19844 3984 ? D 21:50 0:00 postgres: summer stocksshar
summer   19344  0.0  0.0   2484   852 pts/25   R+   21:52   0:00 ps xaru
[EMAIL PROTECTED]:~> ps xaru
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres 20248 60.0 0.8 20760 11536 ? D Mar28 5167:27 postgres: summer sharetrad postgres 19321 0.1 0.3 19844 4008 ? D 21:50 0:00 postgres: summer stocksshar
summer   19348  0.0  0.0   2480   848 pts/25   R+   21:52   0:00 ps xaru
[EMAIL PROTECTED]:~>

Note that pid 20248 is the one, the other has nothing to do with the problem.

This is the SQL I used to create the table:
      String createTableStatement = "create table TJS0102_trades ("
                  + "   trade date"
                  + ",  ttime int2"
                  + ",  ASX varchar(7)"
                  + ",  thigh int4"
                  + ",  tlow int4"
                  + ",  tclose int4"
                  + ",  topen int4"
                  + ",  tvolume int4"
                  + ",  tvalue int4"
//                 + ")"
                  + ",  unique(ASX,trade,ttime)"
                  + ");";


For each record, I update a non-key field in another table; the source data for that other table is less than a megabyte.


I have changed the program so as to load each day's data as a single transaction; it ran on a Pentium IV 2.8 Ghz with HT, 512 Mbytes in about three and an half hours. This is more-or-less what I expected.

If you agree with me that postgresql should do better, now is a good time to ask for this as a test case.

Note, you could also get suitable test data from float.com.au - it's not where mine came from, but it should be substantially the same.




---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to