We had the same problems with performance when testing with more than 100 
billion weather observations. We now have a solution where we can push between 
300 and 400 million weather observations pr. Second into the database.

We download date from NetCDF files. The date are joined based on time and 
geolocation, so data from many different NetCDF should end up in the same row 
and table.

Doing this based on database join and table update was taking a long time as 
you also have noticed.

To get high performance we ended up this solution

- Uses os commands like awk,sed,join,cut,.. to prepare CSV file for the 

- Use multithreads.

- Insert data directly into child tables.

- No triggers, constraints and indexes on working table.

- Don't update rows.

- Unlogged tables.

We first download NetCDF and make CSV files that fits in perfect for the copy a 
command and with complete files for each child tables it's created for, this is 
a time consuming operation.

So before the copy in into database we just do a truncate on the selected 
table. We are then able to insert between 300 and 400 mill. weather 
observations pr. Second. We have 11 observations pr row so it means around 35 
mill rows pr second. We have one child table for each year and month.

The database we working on have 16 dual core CPU's and SSD discs. When testing 
I was running 11 threads in parallel.

Indexes and constraints are added later based on needs.

How can you take on chance om using something like unlogged tables?

Linux system are quite stable and if we keep the a copy of the CVS files it 
does not take long time to insert data after crash.

You can also change your table to logged later if you need to secure your data 
in the database.


Fra: Mariel Cherkassky <mariel.cherkas...@gmail.com>
Sendt: 28. januar 2018 10:11
Til: PostgreSQL mailing lists
Emne: copy csv into partitioned table with unique index

I configured a master table that is called "year_2018" :
create table year_2018(a int,b int c date);

The master table has a unique constraint on those 3 columns so that I wont have 
any duplicated rows. Moreover, I configured a before insert trigger on that 
table that creates a child table for each day in the year. The child should 
include all the data related to that specific day.

Now, every day I got a csv file that I need to load to the year table. I must 
load the data as fast as possible but I have 2 problems :
1)I must load the data as a bulk - via the copy command. However, the copy 
command fails because sometimes I have duplicated rows.
2)I tried to use the pgloader extension but it fails because I have a trigger 
before each insert.

-I cant load all the data into a temp table and then run insert into year_2018 
select * from temp because it takes too much time.

Any idea ?

Reply via email to