Re: [GENERAL] COPY v. java performance comparison

Jeff Janes Thu, 03 Apr 2014 12:01:49 -0700

On Wed, Apr 2, 2014 at 5:11 PM, Rob Sargent <[email protected]> wrote:


>  On 04/02/2014 04:36 PM, Jeff Janes wrote:
>
>  On Wed, Apr 2, 2014 at 12:37 PM, Rob Sargent <[email protected]>wrote:
>
>>  I'm playing with various data models to compare performance and
>> practicalities and not sure if I should be surprised by the numbers I'm
>> getting. I hope this report isn't too wishy-washy for reasoned comment.
>>
>> One model says a genotype is defined as follows:
>>
>>          Table "public.oldstyle"
>> +-------------+--------------+-----------+
>> |   Column    |     Type     | Modifiers |
>> +-------------+--------------+-----------+
>> | id          | uuid         | not null  |
>> | sample_name | text         | not null  |
>> | marker_name | text         | not null  |
>> | allele1     | character(1) |           |
>> | allele2     | character(1) |           |
>> +-------------+--------------+-----------+
>> (0. id is a Primary Key)
>> (1. Take what you will from the table name.)
>> (2. I hadn't thought of "char" type at this point)
>> (3. Ultimately the names would become ids, RI included)
>> (4. We're loading 39 samples and ~950K markers)
>>
>> I loaded 37M+ records using jOOQ (batching every 1000 lines) in 12+ hours
>> (800+ records/sec).  Then I tried COPY and killed that after 11.25 hours
>> when I realised that I had added on non-unque index on the name fields
>> after the first load. By that point is was on line 28301887, so ~0.75 done
>> which implies it would have take ~15hours to complete.
>>
>> Would the overhead of the index likely explain this decrease in
>> throughput?
>>
>
>  Absolutely.
>
>
>>
>> Impatience got the better of me and I killed the second COPY.  This time
>> it had done 54% of the file in 6.75 hours, extrapolating to roughly 12
>> hours to do the whole thing.
>>
>
>  Are you sure you actually dropped the indices?  (And the primary key?)
>
>  I get about 375,000 lines per second with no indexes, triggers,
> constraints.
>
>  perl -le 'my $x="000000000000"; foreach(1..37e6) {$x++; print join "\t",
> "a0eebc99-9c0b-4ef8-bb6d-$x",$_,$_,"A","T"}'|time psql -c 'truncate
> oldstyle; copy oldstyle from stdin;'
>
>  (More if I truncate it in the same transaction as the copy)
>
>  If you can't drop the pk constraint, can you at least generate the
> values in sort-order?
>
>  Cheers,
>
>  Jeff
>
> I restarted the java-based loading.  Still with pk in place,  (and
> actually generating the UUIDs etc) I got ~1.1M rows in one minute. And one 
> "LOG:
> checkpoints are occurring too frequently (24 seconds apart)" with
> checkpoint_segment now at 6 (v. default 3). In  plotting the records v.
> time in that minute at least it's pretty linear.  The time per batch is
> relatively constant with a few hiccups. (.5 sec per 10K lines).  So I've
> improved things immensely but not entirely sure that the simple config
> change is the reason.  If this continued I would be done inside 40 minutes.
>
> With copy interuptus I now get 2.9M records per minute so the load would
> take only 12 or so minutes. I did get four reports of too-frequent
> checkpoints 2 at 15 seconds 2 at 9 seconds.
>
> I'll need to let each on go to completion.
>
> If these numbers are at all accurate and realistic, I'm still impressed
> with jOOQ, though COPY is rightfully back to its proper place as fastest
> way to load.
>

JOOQ is probably hooking into the same API that COPY uses (or if not, it
should be), so it isn't surprising that they perform similarly.

Cheers,

Jeff

Re: [GENERAL] COPY v. java performance comparison

Reply via email to