Thanks for the info.
Yes, isolating indexes out of the picture is a good idea for this purpose.
I can't really give a guess to how fast the load rate should be. I don't
know how your system is configured, and all the hardware characteristics
(and even if I knew that info I may not be able to guess...). I am pretty
confident that the load will be faster than before, I'll risk that ;-)
Looking into your TPC-H size and metadata I'll estimate that
partsupp,customer and orders will have the most significant increase in load
rate. You could start with those.
I guess the only way to really know is to try... Load several times with the
existing PG-COPY and then load several times with the patched COPY and
compare. I'll be curious to hear your results.
On 7/19/05 2:37 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
> Hi Alon,
> Yeah, that helps. I just need to break up my scripts a little to just
> load the data and not build indexes.
> Is the following information good enough to give a guess about the data
> I'm loading, if you don't mind? ;) Here's a link to my script to create
> File sizes:
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl
> -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl
> -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl
> -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl
> -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl
> -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl
> -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl
> -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl
> Number of rows:
> # wc -l *.tbl
> 15000000 customer.tbl
> 600037902 lineitem.tbl
> 25 nation.tbl
> 150000000 orders.tbl
> 20000000 part.tbl
> 80000000 partsupp.tbl
> 5 region.tbl
> 1000000 supplier.tbl
> On Tue, 19 Jul 2005 14:05:56 -0700
> "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
>> Hi Mark,
>> I improved the data *parsing* capabilities of COPY, and didn't touch the
>> data conversion or data insertion parts of the code. The parsing improvement
>> will vary largely depending on the ratio of parsing -to- converting and
>> Therefore, the speed increase really depends on the nature of your data:
>> 100GB file with
>> long data rows (lots of parsing)
>> Small number of columns (small number of attr conversions per row)
>> less rows (less tuple insertions)
>> Will show the best performance improvements.
>> However, same file size 100GB with
>> Short data rows (minimal parsing)
>> large number of columns (large number of attr conversions per row)
>> more rows (more tuple insertions)
>> Will show improvements but not as significant.
>> In general I'll estimate 40%-95% improvement in load speed for the 1st case
>> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed
>> etc... This is for TEXT format. As for CSV, it may be faster but not as much
>> as I specified here. BINARY will stay the same as before.
>> On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote:
>>> On Thu, 14 Jul 2005 17:22:18 -0700
>>> "Alon Goldshuv" <[EMAIL PROTECTED]> wrote:
>>>> I revisited my patch and removed the code duplications that were there, and
>>>> added support for CSV with buffered input, so CSV now runs faster too
>>>> (although it is not as optimized as the TEXT format parsing). So now
>>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original
>>> Hi Alon,
>>> I'm curious, what kind of system are you testing this on? I'm trying to
>>> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm
>>> interested in the results you would expect.
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend