We had the same problems with performance when testing with more than 100
billion weather observations. We now have a solution where we can push between
300 and 400 million weather observations pr. Second into the database.
We download date from NetCDF files. The date are joined based on time and
geolocation, so data from many different NetCDF should end up in the same row
Doing this based on database join and table update was taking a long time as
you also have noticed.
To get high performance we ended up this solution
- Uses os commands like awk,sed,join,cut,.. to prepare CSV file for the
- Use multithreads.
- Insert data directly into child tables.
- No triggers, constraints and indexes on working table.
- Don't update rows.
- Unlogged tables.
We first download NetCDF and make CSV files that fits in perfect for the copy a
command and with complete files for each child tables it's created for, this is
a time consuming operation.
So before the copy in into database we just do a truncate on the selected
table. We are then able to insert between 300 and 400 mill. weather
observations pr. Second. We have 11 observations pr row so it means around 35
mill rows pr second. We have one child table for each year and month.
The database we working on have 16 dual core CPU's and SSD discs. When testing
I was running 11 threads in parallel.
Indexes and constraints are added later based on needs.
How can you take on chance om using something like unlogged tables?
Linux system are quite stable and if we keep the a copy of the CVS files it
does not take long time to insert data after crash.
You can also change your table to logged later if you need to secure your data
in the database.
Fra: Mariel Cherkassky <mariel.cherkas...@gmail.com>
Sendt: 28. januar 2018 10:11
Til: PostgreSQL mailing lists
Emne: copy csv into partitioned table with unique index
I configured a master table that is called "year_2018" :
create table year_2018(a int,b int c date);
The master table has a unique constraint on those 3 columns so that I wont have
any duplicated rows. Moreover, I configured a before insert trigger on that
table that creates a child table for each day in the year. The child should
include all the data related to that specific day.
Now, every day I got a csv file that I need to load to the year table. I must
load the data as fast as possible but I have 2 problems :
1)I must load the data as a bulk - via the copy command. However, the copy
command fails because sometimes I have duplicated rows.
2)I tried to use the pgloader extension but it fails because I have a trigger
before each insert.
-I cant load all the data into a temp table and then run insert into year_2018
select * from temp because it takes too much time.
Any idea ?