Re: [PATCHES] COPY FROM performance improvements

Mark Wong Tue, 19 Jul 2005 14:38:30 -0700

Hi Alon,

Yeah, that helps.  I just need to break up my scripts a little to just
load the data and not build indexes.


Is the following information good enough to give a guess about the data
I'm loading, if you don't mind? ;)  Here's a link to my script to create
tables:
http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f7f23437e432&path=scripts/pgsql/create_tables.sh.in

File sizes:
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
-rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
-rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
-rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
-rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
-rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
-rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
-rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl

Number of rows:
# wc -l *.tbl
    15000000 customer.tbl
   600037902 lineitem.tbl
          25 nation.tbl
   150000000 orders.tbl
    20000000 part.tbl
    80000000 partsupp.tbl
           5 region.tbl
     1000000 supplier.tbl

Thanks,
Mark

On Tue, 19 Jul 2005 14:05:56 -0700
"Alon Goldshuv" <[EMAIL PROTECTED]> wrote:

> Hi Mark,
> 
> I improved the data *parsing* capabilities of COPY, and didn't touch the
> data conversion or data insertion parts of the code. The parsing improvement
> will vary largely depending on the ratio of parsing -to- converting and
> inserting. 
> 
> Therefore, the speed increase really depends on the nature of your data:
> 
> 100GB file with
> long data rows (lots of parsing)
> Small number of columns (small number of attr conversions per row)
> less rows (less tuple insertions)
> 
> Will show the best performance improvements.
> 
> However, same file size 100GB with
> Short data rows (minimal parsing)
> large number of columns (large number of attr conversions per row)
> AND/OR
> more rows (more tuple insertions)
> 
> Will show improvements but not as significant.
> In general I'll estimate 40%-95% improvement in load speed for the 1st case
> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
> etc... This is for TEXT format. As for CSV, it may be faster but not as much
> as I specified here. BINARY will stay the same as before.
> 
> HTH
> Alon.
> 
> 
> 
> 
> 
> 
> On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
> 
> > On Thu, 14 Jul 2005 17:22:18 -0700
> > "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
> > 
> >> I revisited my patch and removed the code duplications that were there, and
> >> added support for CSV with buffered input, so CSV now runs faster too
> >> (although it is not as optimized as the TEXT format parsing). So now
> >> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original 
> >> file.
> > 
> > Hi Alon,
> > 
> > I'm curious, what kind of system are you testing this on?  I'm trying to
> > load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
> > interested in the results you would expect.
> > 
> > Mark
> > 
> 

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: [PATCHES] COPY FROM performance improvements

Reply via email to