>I have an events table that records page views and purchases (type = 'viewed'
>or type='purchased'). I have a query that figures out "people who
>bought/viewed this also bought/viewed that".
>
>It worked fine, taking about 0.1 seconds to complete, until a few hours ago
>when it started taking hours to complete. Vacuum/analyze didn't help. Turned
>out there was one session_id that had 400k rows in the system. Deleting that
>made the query performant again.
>
>Is there anything I can do to make the query work better in cases like that?
>Missing index, or better query?
>
>This is on 9.3.5.
>
>The below is reproduced at the following URL if it's not formatted correctly
>in the email.
>https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
Hello,
here are 2 variations that should be somewhat faster.
It seems you may have duplicate (site_id,session_id,product_id)
which would false the result. In that case you'll need some more logic in the
query.
select
'82503' as product_id,
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
from events e1
join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
where
e1.product_id = '82503' and
e2.product_id != '82503'
group by e2.product_id, e2.site_id;
OR:
WITH SALL as(
select
e2.site_id,
e2.product_id,
count(nullif(e2.type='viewed', false)) view_count,
count(nullif(e2.type='purchased', false)) purchase_count
from events e1
join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
where
e1.product_id = '82503'
group by e2.product_id, e2.site_id
)
SELECT
'82503' as product_id_1,
site_id,
product_id,
view_count,
purchase_count
FROM SALL
WHERE product_id != '82503';
regards,
Marc Mamin
>explain select
> e1.product_id,
> e2.site_id,
> e2.product_id,
> count(nullif(e2.type='viewed', false)) view_count,
> count(nullif(e2.type='purchased', false)) purchase_count
> from events e1
> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
> where
> e1.product_id = '82503' and
> e1.product_id != e2.product_id
> group by e1.product_id, e2.product_id, e2.site_id;
> QUERY PLAN
>----------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19)
> -> Sort (cost=828395.67..840117.89 rows=4688885 width=19)
> Sort Key: e1.product_id, e2.product_id, e2.site_id
> -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19)
> -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31
> rows=369 width=49)
> Recheck Cond: (product_id = '82503'::citext)
> -> Bitmap Index Scan on events_product_id_site_id_idx
> (cost=0.00..11.20 rows=369 width=0)
> Index Cond: (product_id = '82503'::citext)
> -> Index Scan using events_session_id_type_product_id_idx on
> events e2 (cost=0.56..51.28 rows=12 width=51)
> Index Cond: ((session_id = e1.session_id) AND (type =
> e1.type))
> Filter: (e1.product_id <> product_id)
>(11 rows)
>
>recommender_production=> \d events
> Table "public.events"
> Column | Type | Modifiers
>-------------+--------------------------+-----------------------------------------------------
> id | bigint | not null default
> nextval('events_id_seq'::regclass)
> user_id | citext |
> session_id | citext | not null
> product_id | citext | not null
> site_id | citext | not null
> type | text | not null
> happened_at | timestamp with time zone | not null
> created_at | timestamp with time zone | not null
>Indexes:
> "events_pkey" PRIMARY KEY, btree (id)
> "events_product_id_site_id_idx" btree (product_id, site_id)
> "events_session_id_type_product_id_idx" btree (session_id, type,
> product_id)
>Check constraints:
> "events_session_id_check" CHECK (length(session_id::text) < 255)
> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text,
> 'viewed'::text]))
> "events_user_id_check" CHECK (length(user_id::text) < 255)
>
>
>
>