Re: [PERFORM] efficient data reduction (and deduping)

Alessandro Gagliardi Thu, 01 Mar 2012 11:35:05 -0800

Hah! Yeah, that might would work. Except that I suck at grep. :(
Perhaps that's a weakness I should remedy.


On Thu, Mar 1, 2012 at 10:35 AM, Craig James <[email protected]> wrote:

> On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi
> <[email protected]> wrote:
> > Hi folks,
> >
> > I have a system that racks up about 40M log lines per day. I'm able to
> COPY
> > the log files into a PostgreSQL table that looks like this:
>
> Since you're using a COPY command and the table has a simple column
> with exactly the value you want, why not filter it using grep(1) or
> something similar and load the filtered result directly into the
> hourly table?
>
> Craig
>
> >
> > CREATE TABLE activity_unlogged
> > (
> >   user_id character(24) NOT NULL,
> >   client_ip inet,
> >   hr_timestamp timestamp without time zone,
> >   locale character varying,
> >   log_id character(36),
> >   method character varying(6),
> >   server_ip inet,
> >   uri character varying,
> >   user_agent character varying
> > )
> >
> > Now, I want to reduce that data to get the last activity that was
> performed
> > by each user in any given hour. It should fit into a table like this:
> >
> > CREATE TABLE hourly_activity
> > (
> >   activity_hour timestamp without time zone NOT NULL,
> >   user_id character(24) NOT NULL,
> >   client_ip inet,
> >   hr_timestamp timestamp without time zone,
> >   locale character varying,
> >   log_id character(36),
> >   method character varying(6),
> >   server_ip inet,
> >   uri character varying,
> >   user_agent character varying,
> >   CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id )
> > )
> >
> > where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the
> primary
> > key constraint)
> >
> > I am attempting to do that with the following:
> >
> > INSERT INTO hourly_activity
> >     SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour,
> > activity_unlogged.user_id,
> >                     client_ip, hr_timestamp, locale, log_id, method,
> > server_ip, uri, user_agent
> >         FROM activity_unlogged,
> >             (SELECT user_id, MAX(hr_timestamp) AS last_timestamp
> >                 FROM activity_unlogged GROUP BY user_id,
> date_trunc('hour',
> > hr_timestamp)) AS last_activity
> >     WHERE activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp;
> >
> > I have two problems:
> >
> > It's incredibly slow (like: hours). I assume this is because I am
> scanning
> > through a huge unindexed table twice. I imagine there is a more efficient
> > way to do this, but I can't think of what it is. If I were doing this in
> a
> > procedural programming language, it might look something like:
> > for row in activity_unlogged:
> >     if (date_trunc('hour', hr_timestamp), user_id) in
> > hourly_activity[(activity_hour, user_id)]:
> >         if hr_timestamp > hourly_activity[(date_trunc('hour',
> > hr_timestamp), user_id)][hr_timestamp]:
> >             hourly_activity <- row # UPDATE
> >     else:
> >         hourly_activity <- row # INSERT
> > I suspect some implementation of this (hopefully my pseudocode is at
> least
> > somewhat comprehensible) would be very slow as well, but at least it
> would
> > only go through activity_unlogged once. (Then again, it would have to
> > rescan hourly_activity each time, so it really wouldn't be any faster at
> > all, would it?) I feel like there must be a more efficient way to do
> this in
> > SQL though I can't put my finger on it.
> > Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE
> > activity_unlogged.user_id = last_activity.user_id AND
> > activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to
> > multiple records leading to a primary key collision. In such cases, I
> don't
> > really care which of the two rows are picked, I just want to make sure
> that
> > no more than one row is inserted per user per hour. In fact, though I
> would
> > prefer to get the last row for each hour, I could probably get much the
> same
> > effect if I just limited it to one per hour. Though I don't know if that
> > really helps at all.
>

Re: [PERFORM] efficient data reduction (and deduping)

Reply via email to