Hah! Yeah, that might would work. Except that I suck at grep. :( Perhaps that's a weakness I should remedy.
On Thu, Mar 1, 2012 at 10:35 AM, Craig James <cja...@emolecules.com> wrote: > On Thu, Mar 1, 2012 at 10:27 AM, Alessandro Gagliardi > <alessan...@path.com> wrote: > > Hi folks, > > > > I have a system that racks up about 40M log lines per day. I'm able to > COPY > > the log files into a PostgreSQL table that looks like this: > > Since you're using a COPY command and the table has a simple column > with exactly the value you want, why not filter it using grep(1) or > something similar and load the filtered result directly into the > hourly table? > > Craig > > > > > CREATE TABLE activity_unlogged > > ( > > user_id character(24) NOT NULL, > > client_ip inet, > > hr_timestamp timestamp without time zone, > > locale character varying, > > log_id character(36), > > method character varying(6), > > server_ip inet, > > uri character varying, > > user_agent character varying > > ) > > > > Now, I want to reduce that data to get the last activity that was > performed > > by each user in any given hour. It should fit into a table like this: > > > > CREATE TABLE hourly_activity > > ( > > activity_hour timestamp without time zone NOT NULL, > > user_id character(24) NOT NULL, > > client_ip inet, > > hr_timestamp timestamp without time zone, > > locale character varying, > > log_id character(36), > > method character varying(6), > > server_ip inet, > > uri character varying, > > user_agent character varying, > > CONSTRAINT hourly_activity_pkey PRIMARY KEY (activity_hour , user_id ) > > ) > > > > where activity_hour is date_trunc('hour', hr_timestamp); (N.B. the > primary > > key constraint) > > > > I am attempting to do that with the following: > > > > INSERT INTO hourly_activity > > SELECT DISTINCT date_trunc('hour', hr_timestamp) AS activity_hour, > > activity_unlogged.user_id, > > client_ip, hr_timestamp, locale, log_id, method, > > server_ip, uri, user_agent > > FROM activity_unlogged, > > (SELECT user_id, MAX(hr_timestamp) AS last_timestamp > > FROM activity_unlogged GROUP BY user_id, > date_trunc('hour', > > hr_timestamp)) AS last_activity > > WHERE activity_unlogged.user_id = last_activity.user_id AND > > activity_unlogged.hr_timestamp = last_activity.last_timestamp; > > > > I have two problems: > > > > It's incredibly slow (like: hours). I assume this is because I am > scanning > > through a huge unindexed table twice. I imagine there is a more efficient > > way to do this, but I can't think of what it is. If I were doing this in > a > > procedural programming language, it might look something like: > > for row in activity_unlogged: > > if (date_trunc('hour', hr_timestamp), user_id) in > > hourly_activity[(activity_hour, user_id)]: > > if hr_timestamp > hourly_activity[(date_trunc('hour', > > hr_timestamp), user_id)][hr_timestamp]: > > hourly_activity <- row # UPDATE > > else: > > hourly_activity <- row # INSERT > > I suspect some implementation of this (hopefully my pseudocode is at > least > > somewhat comprehensible) would be very slow as well, but at least it > would > > only go through activity_unlogged once. (Then again, it would have to > > rescan hourly_activity each time, so it really wouldn't be any faster at > > all, would it?) I feel like there must be a more efficient way to do > this in > > SQL though I can't put my finger on it. > > Turns out (hr_timestamp, user_id) is not unique. So selecting WHERE > > activity_unlogged.user_id = last_activity.user_id AND > > activity_unlogged.hr_timestamp = last_activity.last_timestamp leads to > > multiple records leading to a primary key collision. In such cases, I > don't > > really care which of the two rows are picked, I just want to make sure > that > > no more than one row is inserted per user per hour. In fact, though I > would > > prefer to get the last row for each hour, I could probably get much the > same > > effect if I just limited it to one per hour. Though I don't know if that > > really helps at all. >