Mark,
You should definitely not be doing this sort of thing, I believe:
CREATE TABLE orders (
o_orderkey INTEGER,
o_custkey INTEGER,
o_orderstatus CHAR(1),
o_totalprice REAL,
o_orderDATE DATE,
o_orderpriority CHAR(15),
o_clerk CHAR(15),
o_shippriority INTEGER,
o_comment VARCHAR(79),
PRIMARY KEY (o_orderkey))
Create the table with no constraints, load the data, then set up primary keys
and whatever other constraints you want using ALTER TABLE. Last time I did a
load like this (albeit 2 orders of magnitude smaller) I saw a 50% speedup from
deferring constarint creation.
cheers
andrew
Mark Wong wrote:
Hi Alon,
Yeah, that helps. I just need to break up my scripts a little to just
load the data and not build indexes.
Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;) Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in
File sizes:
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
-rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
-rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
-rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
-rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
-rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
-rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
-rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
Number of rows:
# wc -l *.tbl
15000000 customer.tbl
600037902 lineitem.tbl
25 nation.tbl
150000000 orders.tbl
20000000 part.tbl
80000000 partsupp.tbl
5 region.tbl
1000000 supplier.tbl
Thanks,
Mark
On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
Hi Mark,
I improved the data *parsing* capabilities of COPY, and didn't touch the
data conversion or data insertion parts of the code. The parsing improvement
will vary largely depending on the ratio of parsing -to- converting and
inserting.
Therefore, the speed increase really depends on the nature of your data:
100GB file with
long data rows (lots of parsing)
Small number of columns (small number of attr conversions per row)
less rows (less tuple insertions)
Will show the best performance improvements.
However, same file size 100GB with
Short data rows (minimal parsing)
large number of columns (large number of attr conversions per row)
AND/OR
more rows (more tuple insertions)
Will show improvements but not as significant.
In general I'll estimate 40%-95% improvement in load speed for the 1st case
and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
etc... This is for TEXT format. As for CSV, it may be faster but not as much
as I specified here. BINARY will stay the same as before.
HTH
Alon.
On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
On Thu, 14 Jul 2005 17:22:18 -0700
"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
I revisited my patch and removed the code duplications that were there, and
added support for CSV with buffered input, so CSV now runs faster too
(although it is not as optimized as the TEXT format parsing). So now
TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original file.
Hi Alon,
I'm curious, what kind of system are you testing this on? I'm trying to
load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
interested in the results you would expect.
Mark
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?
http://archives.postgresql.org