On Wed, Apr 2, 2014 at 5:11 PM, Rob Sargent <robjsarg...@gmail.com> wrote:
> On 04/02/2014 04:36 PM, Jeff Janes wrote: > > On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <robjsarg...@gmail.com>wrote: > >> I'm playing with various data models to compare performance and >> practicalities and not sure if I should be surprised by the numbers I'm >> getting. I hope this report isn't too wishy-washy for reasoned comment. >> >> One model says a genotype is defined as follows: >> >> Table "public.oldstyle" >> +-------------+--------------+-----------+ >> | Column | Type | Modifiers | >> +-------------+--------------+-----------+ >> | id | uuid | not null | >> | sample_name | text | not null | >> | marker_name | text | not null | >> | allele1 | character(1) | | >> | allele2 | character(1) | | >> +-------------+--------------+-----------+ >> (0. id is a Primary Key) >> (1. Take what you will from the table name.) >> (2. I hadn't thought of "char" type at this point) >> (3. Ultimately the names would become ids, RI included) >> (4. We're loading 39 samples and ~950K markers) >> >> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours >> (800+ records/sec). Then I tried COPY and killed that after 11.25 hours >> when I realised that I had added on non-unque index on the name fields >> after the first load. By that point is was on line 28301887, so ~0.75 done >> which implies it would have take ~15hours to complete. >> >> Would the overhead of the index likely explain this decrease in >> throughput? >> > > Absolutely. > > >> >> Impatience got the better of me and I killed the second COPY. This time >> it had done 54% of the file in 6.75 hours, extrapolating to roughly 12 >> hours to do the whole thing. >> > > Are you sure you actually dropped the indices? (And the primary key?) > > I get about 375,000 lines per second with no indexes, triggers, > constraints. > > perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t", > "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate > oldstyle; copy oldstyle from stdin;' > > (More if I truncate it in the same transaction as the copy) > > If you can't drop the pk constraint, can you at least generate the > values in sort-order? > > Cheers, > > Jeff > > I restarted the java-based loading. Still with pk in place, (and > actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one > "LOG: > checkpoints are occurring too frequently (24 seconds apart)" with > checkpoint_segment now at 6 (v. default 3). In plotting the records v. > time in that minute at least it's pretty linear. The time per batch is > relatively constant with a few hiccups. (.5 sec per 10K lines). So I've > improved things immensely but not entirely sure that the simple config > change is the reason. If this continued I would be done inside 40 minutes. > > With copy interuptus I now get 2.9M records per minute so the load would > take only 12 or so minutes. I did get four reports of too-frequent > checkpoints 2 at 15 seconds 2 at 9 seconds. > > I'll need to let each on go to completion. > > If these numbers are at all accurate and realistic, I'm still impressed > with jOOQ, though COPY is rightfully back to its proper place as fastest > way to load. > JOOQ is probably hooking into the same API that COPY uses (or if not, it should be), so it isn't surprising that they perform similarly. Cheers, Jeff