Mark, Thanks for the info.
Yes, isolating indexes out of the picture is a good idea for this purpose. I can't really give a guess to how fast the load rate should be. I don't know how your system is configured, and all the hardware characteristics (and even if I knew that info I may not be able to guess...). I am pretty confident that the load will be faster than before, I'll risk that ;-) Looking into your TPC-H size and metadata I'll estimate that partsupp,customer and orders will have the most significant increase in load rate. You could start with those. I guess the only way to really know is to try... Load several times with the existing PG-COPY and then load several times with the patched COPY and compare. I'll be curious to hear your results. Thx, Alon. On 7/19/05 2:37 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote: > Hi Alon, > > Yeah, that helps. I just need to break up my scripts a little to just > load the data and not build indexes. > > Is the following information good enough to give a guess about the data > I'm loading, if you don't mind? ;) Here's a link to my script to create > tables: > http://developer.osdl.org/markw/mt/getfile.py?id=eaf16b7831588729780645b2bb44f > 7f23437e432&path=scripts/pgsql/create_tables.sh.in > > File sizes: > -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 customer.tbl > -rw-r--r-- 1 markw 50 74G Jul 8 15:03 lineitem.tbl > -rw-r--r-- 1 markw 50 2.1K Jul 8 15:03 nation.tbl > -rw-r--r-- 1 markw 50 17G Jul 8 15:03 orders.tbl > -rw-r--r-- 1 markw 50 2.3G Jul 8 15:03 part.tbl > -rw-r--r-- 1 markw 50 12G Jul 8 15:03 partsupp.tbl > -rw-r--r-- 1 markw 50 391 Jul 8 15:03 region.tbl > -rw-r--r-- 1 markw 50 136M Jul 8 15:03 supplier.tbl > > Number of rows: > # wc -l *.tbl > 15000000 customer.tbl > 600037902 lineitem.tbl > 25 nation.tbl > 150000000 orders.tbl > 20000000 part.tbl > 80000000 partsupp.tbl > 5 region.tbl > 1000000 supplier.tbl > > Thanks, > Mark > > On Tue, 19 Jul 2005 14:05:56 -0700 > "Alon Goldshuv" <[EMAIL PROTECTED]> wrote: > >> Hi Mark, >> >> I improved the data *parsing* capabilities of COPY, and didn't touch the >> data conversion or data insertion parts of the code. The parsing improvement >> will vary largely depending on the ratio of parsing -to- converting and >> inserting. >> >> Therefore, the speed increase really depends on the nature of your data: >> >> 100GB file with >> long data rows (lots of parsing) >> Small number of columns (small number of attr conversions per row) >> less rows (less tuple insertions) >> >> Will show the best performance improvements. >> >> However, same file size 100GB with >> Short data rows (minimal parsing) >> large number of columns (large number of attr conversions per row) >> AND/OR >> more rows (more tuple insertions) >> >> Will show improvements but not as significant. >> In general I'll estimate 40%-95% improvement in load speed for the 1st case >> and 10%-40% for the 2nd. But that also depends on the hardware, disk speed >> etc... This is for TEXT format. As for CSV, it may be faster but not as much >> as I specified here. BINARY will stay the same as before. >> >> HTH >> Alon. >> >> >> >> >> >> >> On 7/19/05 12:54 PM, "Mark Wong" <[EMAIL PROTECTED]> wrote: >> >>> On Thu, 14 Jul 2005 17:22:18 -0700 >>> "Alon Goldshuv" <[EMAIL PROTECTED]> wrote: >>> >>>> I revisited my patch and removed the code duplications that were there, and >>>> added support for CSV with buffered input, so CSV now runs faster too >>>> (although it is not as optimized as the TEXT format parsing). So now >>>> TEXT,CSV and BINARY are all parsed in CopyFrom(), like in the original >>>> file. >>> >>> Hi Alon, >>> >>> I'm curious, what kind of system are you testing this on? I'm trying to >>> load 100GB of data in our dbt3 workload on a 4-way itanium2. I'm >>> interested in the results you would expect. >>> >>> Mark >>> >> > ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend