Mark,

Thanks for the info.

Yes, isolating indexes out of the picture is a good idea for this purpose.

I can't really give a guess to how fast the load rate should be. I don't
know how your system is configured, and all the hardware characteristics
(and even if I knew that info I may not be able to guess...). I am pretty
confident that the load will be faster than before, I'll risk that ;-)
Looking into your TPC-H size and metadata I'll estimate that
partsupp,customer and orders will have the most significant increase in load
rate. You could start with those.

I guess the only way to really know is to try... Load several times with the
existing PG-COPY and then load several times with the patched COPY and
compare. I'll be curious to hear your results.

Thx,
Alon.

 


On 7/19/05 2:37 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:

> Hi Alon,
> 
> Yeah, that helps.  I just need to break up my scripts a little to just
> load the data and not build indexes.
> 
> Is the following information good enough to give a guess about the data
> I'm loading, if you don't mind? ;)  Here's a link to my script to create
> tables:
> http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f
> 7f23437e432&path=scripts/pgsql/create_tables.sh.in
> 
> File sizes:
> -rw-r--r--  1 markw 50 2.3G Jul  8 15:03 customer.tbl
> -rw-r--r--  1 markw 50  74G Jul  8 15:03 lineitem.tbl
> -rw-r--r--  1 markw 50 2.1K Jul  8 15:03 nation.tbl
> -rw-r--r--  1 markw 50  17G Jul  8 15:03 orders.tbl
> -rw-r--r--  1 markw 50 2.3G Jul  8 15:03 part.tbl
> -rw-r--r--  1 markw 50  12G Jul  8 15:03 partsupp.tbl
> -rw-r--r--  1 markw 50  391 Jul  8 15:03 region.tbl
> -rw-r--r--  1 markw 50 136M Jul  8 15:03 supplier.tbl
> 
> Number of rows:
> # wc -l *.tbl
>     15000000 customer.tbl
>    600037902 lineitem.tbl
>           25 nation.tbl
>    150000000 orders.tbl
>     20000000 part.tbl
>     80000000 partsupp.tbl
>            5 region.tbl
>      1000000 supplier.tbl
> 
> Thanks,
> Mark
> 
> On Tue, 19 Jul 2005 14:05:56 -0700
> "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
> 
>> Hi Mark,
>> 
>> I improved the data *parsing* capabilities of COPY, and didn't touch the
>> data conversion or data insertion parts of the code. The parsing improvement
>> will vary largely depending on the ratio of parsing -to- converting and
>> inserting. 
>> 
>> Therefore, the speed increase really depends on the nature of your data:
>> 
>> 100GB file with
>> long data rows (lots of parsing)
>> Small number of columns (small number of attr conversions per row)
>> less rows (less tuple insertions)
>> 
>> Will show the best performance improvements.
>> 
>> However, same file size 100GB with
>> Short data rows (minimal parsing)
>> large number of columns (large number of attr conversions per row)
>> AND/OR
>> more rows (more tuple insertions)
>> 
>> Will show improvements but not as significant.
>> In general I'll estimate 40%-95% improvement in load speed for the 1st case
>> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>> etc... This is for TEXT format. As for CSV, it may be faster but not as much
>> as I specified here. BINARY will stay the same as before.
>> 
>> HTH
>> Alon.
>> 
>> 
>> 
>> 
>> 
>> 
>> On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
>> 
>>> On Thu, 14 Jul 2005 17:22:18 -0700
>>> "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
>>> 
>>>> I revisited my patch and removed the code duplications that were there, and
>>>> added support for CSV with buffered input, so CSV now runs faster too
>>>> (although it is not as optimized as the TEXT format parsing). So now
>>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
>>>> file.
>>> 
>>> Hi Alon,
>>> 
>>> I'm curious, what kind of system are you testing this on?  I'm trying to
>>> load 100GB of data in our dbt3 workload on a 4-way itanium2.  I'm
>>> interested in the results you would expect.
>>> 
>>> Mark
>>> 
>> 
> 



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to