It seems a user visit cookie's expiration has been improperly set and so the tables which represent a user's clickstream through the website are grouped improperly. I'd like to run a set-wise query to regroup them based on periods of inactivity greather than 1 hour. So, what we want to do is adjust the visit_id's in the page_view table to point to separate visits whenever the user has been idle for more than 1 hour. The query I've come up with looks like this:

CREATE TABLE visit_clean AS
SELECT id
     ,
       CASE
       WHEN stamp - last_stamp > INTERVAL '1 hour'
    OR last_stamp IS NULL THEN nextval('tracking_cleanup_seq')
       ELSE currval('tracking_cleanup_seq')
       END AS visit_id
     , visit_id AS old_visit_id
     , uri
     , params
     , stamp
     , cindex
     , tindex
     , method
     , source_address
     , server_name
  FROM (
    SELECT id
        , visit_id
        , uri
        , params
        , stamp
        , cindex
        , tindex
        , method
        , source_address
        , (
        SELECT max(pv2.stamp)
          FROM page_view pv2
        WHERE pv2.visit_id            = pv1.visit_id
          AND pv2.stamp               < pv1.stamp
          ) AS last_stamp
        , server_name
      FROM page_view pv1
       ) x;

It works reasonably well on a small data set. But, the plan for this is horrible on the real data.

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on page_view pv1  (cost=0.00..2634763281.70 rows=3588374 width=239)
   SubPlan
     ->  Result  (cost=367.09..367.10 rows=1 width=0)
           InitPlan
             ->  Limit  (cost=0.00..367.09 rows=1 width=8)
                   ->  Index Scan Backward using page_view_stamp_idx on 
page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
                         Index Cond: (stamp < $1)
                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))
     ->  Result  (cost=367.09..367.10 rows=1 width=0)
           InitPlan
             ->  Limit  (cost=0.00..367.09 rows=1 width=8)
                   ->  Index Scan Backward using page_view_stamp_idx on 
page_view pv2  (cost=0.00..158215.86 rows=431 width=8)
                         Index Cond: (stamp < $1)
                         Filter: ((stamp IS NOT NULL) AND (visit_id = $0))

The page_view table is 829MB in size and has 3,590,185 rows.

I let the query run on my test server for about 26hrs without finishing, so I can't provide explain analyze output yet.

Does anyone have a better method of separating this data out?

---
Jeff Frost, Owner       <[EMAIL PROTECTED]>
Frost Consulting, LLC   http://www.frostconsultingllc.com/
Phone: 650-780-7908     FAX: 650-649-1954


---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to