[GENERAL] Optimizing data layout for reporting in postgres

Doug El Thu, 24 Dec 2009 13:08:05 -0800

Hi,

I have to summarize some data to be queried and there are about 5 million raw 
records a day I need to summarize. In a nutshell I don't think I'm laying it 
out in an optimal fashion, or not taking advantage of some postgres features 
perhaps, I'm looking for feedback.


The raw incoming data is in the form of

ip string uint uint uint uint

So for any given record say:

8.8.8.8 helloworld 1 2 3 4

First, I need to be able to query how many total and how many unique requests 
there were (unique by ip), over given time frame.

So for the below data on the same day that's total two, but one unique

8.8.8.8 helloworld 1 2 3 4
8.8.8.8 helloworld 1 2 3 4

Further for all fields (but ip which is not stored) I need to be able to query 
and get total/unique counts based off any combination of criteria.

So if I refer to them as columns A-E

A               B               C               D               E
string  uint    uint    uint    uint

I need to be able and say how many where col A = 'helloworld' and say col C = 4.
Or  perhaps col E = 4 and col c < 3 etc, any combination.

The only way I could see to do this was to take the 5 million daily raw 
records, sort them, then summarize that list with total and unique counts as so:

A               B                       C               D               E       
        F               G               H
date    stringid        uint    uint    uint    uint    total   unique

Primary key is A-F (date stringid uint uint uint uint)

This gives me a summary of about 900k records a day from the 4 million raw.

I have things organized with monthly tables and yearly schemas. The string 
column also has its own monthly lookup table, so there's just a string id 
that's looked up.

The database however is still quite huge and grows very fast, even simple daily 
queries are fairly slow even on a fast server. I have a few indexes on what I 
know are common columns queried against but again, any combination of data can 
be queried, and  indexes do increase db size of course.

I feel like there's got to be some better way to organize this data and make it 
searchable.  Overall speed is more important than disk space usage for this 
application. 

Perhaps there are some native features in postgres I'm not taking advantage of 
here, that would tip the scales in my favor. I've done a fair amount of 
research on the configuration file settings and feel like I have a fairly 
optimized config for it as far as that goes, and have done the things mentioned 
here: http://wiki.postgresql.org/wiki/SlowQueryQuestions

Very much appreciate any suggestions, thank you in advance.

Doug


      


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[GENERAL] Optimizing data layout for reporting in postgres

Reply via email to