Re: [GENERAL] Optimizing data layout for reporting in postgres

Alban Hertroys Sun, 27 Dec 2009 04:25:22 -0800

On 24 Dec 2009, at 21:01, Doug El wrote:

> Hi,
> 
> I have to summarize some data to be queried and there are about 5 million raw 
> records a day I need to summarize. In a nutshell I don't think I'm laying it 
> out in an optimal fashion, or not taking advantage of some postgres features 
> perhaps, I'm looking for feedback.
> 
> The raw incoming data is in the form of
> 
> ip string uint uint uint uint
> 
> So for any given record say:
> 
> 8.8.8.8 helloworld 1 2 3 4
> 
> First, I need to be able to query how many total and how many unique requests 
> there were (unique by ip), over given time frame.
> 
> So for the below data on the same day that's total two, but one unique
> 
> 8.8.8.8 helloworld 1 2 3 4
> 8.8.8.8 helloworld 1 2 3 4
> 
> Further for all fields (but ip which is not stored) I need to be able to 
> query and get total/unique counts based off any combination of criteria.
> 
> So if I refer to them as columns A-E
> 
> A             B               C               D               E
> string        uint    uint    uint    uint


That's going to be a sequential scan no matter how you write it - even if you 
follow Scott's advice, albeit the scan happens at a more convenient time and 
only once. To optimise this you first minimise the number of scans you need to 
do to get the desired results and next minimise the amount of work that the 
database needs to perform per row.

I think a good start would be:
SELECT COUNT(*) AS total, COUNT(DISTINCT A||B::text||C::text||D::text||E::text)
  FROM table;

Be wary of NULL values as those will make the result of the concatenation NULL 
as well. Coalesce() is your friend if that happens.

To further speed up the query you could pre-generate the concatenation of those 
columns, either when you insert the data or with a nightly cron-job. In the 
latter case make sure it runs after all the data has arrived and before anyone 
queries that column or there will be NULLs in it.

> I need to be able and say how many where col A = 'helloworld' and say col C = 
> 4.
> Or  perhaps col E = 4 and col c < 3 etc, any combination.

I don't see what the problem is here? Aren't those just queries like:
SELECT COUNT(*) FROM table WHERE A = 'helloworld' AND C = 4;
SELECT COUNT(*) FROM table WHERE E = 4 AND C < 3;

If you know beforehand which conditions you want to query you can do them all 
in one go - it will result in a sequential scan though:

SELECT SUM((A = 'helloworld' AND C = 4)::int), SUM((E = 4 AND C < 3)::int), ...
  FROM table;

This makes use of the fact that a boolean result cast to int results in 0 for 
False and 1 for True respectively.

If the conditions you want to summarise are fixed (never change) then you could 
store their respective values in a separate column using a bit-field or 
something of the kind, or in multiple columns with descriptive names.

> The database however is still quite huge and grows very fast, even simple 
> daily queries are fairly slow even on a fast server. I have a few indexes on 
> what I know are common columns queried against but again, any combination of 
> data can be queried, and  indexes do increase db size of course.

Indexes don't help for queries where a sequential scan is used, so you don't 
need to worry about intricate index designs for such queries. They certainly do 
help (a lot!) for queries that query a minority of the total rows.

> I feel like there's got to be some better way to organize this data and make 
> it searchable.  Overall speed is more important than disk space usage for 
> this application. 

Anything that you can calculate at a convenient time before you need to query 
it is a win to store, as long as querying the results isn't more work than 
querying the original data.

I've used cron jobs and triggers for such cases both successfully and 
unsuccessfully - the risk here is that the derived data has to match the data 
it was derived from or you end up with wrong results. At one point (at a 
previous job) we were creating a materialised view from data from about ten 
tables using triggers, but we noticed the results didn't always match what we 
expected. Figuring out where it went wrong took more time than we could afford 
so we ended up using a nightly cron-job instead.
I suppose I mean to say to use triggers to pre-calculate data for simple cases 
but to prefer cron jobs for the complicated ones. Debugging complicated 
triggers can be time-consuming.

Regards,
Alban Hertroys

--
Screwing up is the best way to attach something to the ceiling.


!DSPAM:737,4b3752129951606741641!



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Optimizing data layout for reporting in postgres

Reply via email to