I have a database of hundreds of millions of web links (between sites)
in Postgres.  For each link, we record the url, the referer, and the
most recent date the link exists.  I'm having some serious performance
issues when it comes to writing new data into the database.

One machine is simply not going to be able to scale with the quantities
of links we hope to store information about and we want to move to some
kind of cluster.  Because of the quantities of data, it seems to make
sense to go for a cluster setup such that in a 4 machine cluster, each
machine has a quarter of the data (is this "Share nothing," or, "Share
everything"?).  To that end, we figured a good first step was to
partition the data on one machine into multiple tables defining the
logic which would find the appropriate table given a piece of data.
Then, we assumed, adding the logic to find the appropriate machine and
database in our cluster would only be an incremental upgrade.

We implemented a partitioning scheme that segments the data by the
referring domain of each link.  This is clearly not the most regular
(in terms of even distribution) means of partitioning, but the data in
each table is most likely related to each other, so queries would hit
the smallest number of tables.  We currently have around 400,000 tables
and I would estimate that the vast majority of these tables are
relatively small (less than 200 rows).

Our queries use UNION ALL to combine data from multiple tables (when
that's applicable, never more than 1000 tables at once, usually much
fewer).  When writing to the database, the table for the referring
domain is locked while data is added and updated for the whole
referring domain at once.  We only store one copy of each link, so when
loading we have to do a SELECT (for the date) then INSERT or UPDATE
where applicable for each link.

At this point, the primary performance bottleneck is in adding
additional data to the database.  Our loader program (we load text
files of link information) is currently getting about 40 rows a second,
which is nowhere near the performance we need to be seeing.  In theory,
we want to be able to re-write our entire archive of data within on a
1-2 month cycle, so this is a very heavy write application (though
we're also constantly generating reports from the data, so its not
write only).

Is the total number of tables prohibitively affecting our write speed
or is that an IO problem that can only be addressed by better drive
partitioning (all data is on one drive, which I've already read is a
problem)?  Is this approach to data partitioning one which makes any
sense for performance, or should we move to a more normal distribution
of links across fewer tables which house more rows each?

Thanks in advance for your advice.


---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not

Reply via email to