2015-01-25 8:20 GMT+01:00 Joe Van Dyk <j...@tanga.com>: > On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.steh...@gmail.com> > wrote: > >> >> >> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk <j...@tanga.com>: >> >>> >>> >>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.steh...@gmail.com >>> > wrote: >>> >>>> Hi >>>> >>>> this plan looks well >>>> >>>> Regards >>>> >>>> Pavel >>>> >>> >>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT >>> >> >> I see a possible issue >> >> (product_id <> '81716'::citext) .. this operation is CPU expensive and >> maybe nonsense >> >> product_id should be integer -- and if it isn't - it should not be on 4M >> rows extremly fast - mainly on citext >> >> try to force a opposite cast - you will safe a case insensitive text >> comparation >> >> product_id::int <> 81716 >> > > It might not always be an integer, just happens to be so here. Should I > try text instead? I don't have to have the case-insensitive matching. >
text can be better this design is unhappy, but you cannot to change ot probably > > Joe > > >> >> Regards >> >> Pavel >> >> >> >> >>> >>> Joe >>> >>> >>>> >>>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk <j...@tanga.com>: >>>> >>>>> Oops, didn't run vacuum analyze after deleting the events. Here is >>>>> another 'explain analyze': http://explain.depesz.com/s/AviN >>>>> >>>>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <j...@tanga.com> wrote: >>>>> >>>>>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <j...@tanga.com> wrote: >>>>>> >>>>>>> I have an events table that records page views and purchases (type = >>>>>>> 'viewed' or type='purchased'). I have a query that figures out "people >>>>>>> who >>>>>>> bought/viewed this also bought/viewed that". >>>>>>> >>>>>>> It worked fine, taking about 0.1 seconds to complete, until a few >>>>>>> hours ago when it started taking hours to complete. Vacuum/analyze >>>>>>> didn't >>>>>>> help. Turned out there was one session_id that had 400k rows in the >>>>>>> system. Deleting that made the query performant again. >>>>>>> >>>>>>> Is there anything I can do to make the query work better in cases >>>>>>> like that? Missing index, or better query? >>>>>>> >>>>>>> This is on 9.3.5. >>>>>>> >>>>>>> The below is reproduced at the following URL if it's not formatted >>>>>>> correctly in the email. >>>>>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt >>>>>>> >>>>>>> explain select >>>>>>> e1.product_id, >>>>>>> e2.site_id, >>>>>>> e2.product_id, >>>>>>> count(nullif(e2.type='viewed', false)) view_count, >>>>>>> count(nullif(e2.type='purchased', false)) purchase_count >>>>>>> from events e1 >>>>>>> join events e2 on e1.session_id = e2.session_id and e1.type = e2.type >>>>>>> where >>>>>>> e1.product_id = '82503' and >>>>>>> e1.product_id != e2.product_id >>>>>>> group by e1.product_id, e2.product_id, e2.site_id; >>>>>>> QUERY PLAN >>>>>>> ---------------------------------------------------------------------------------------------------------------------------- >>>>>>> GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) >>>>>>> -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) >>>>>>> Sort Key: e1.product_id, e2.product_id, e2.site_id >>>>>>> -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) >>>>>>> -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 >>>>>>> rows=369 width=49) >>>>>>> Recheck Cond: (product_id = '82503'::citext) >>>>>>> -> Bitmap Index Scan on >>>>>>> events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) >>>>>>> Index Cond: (product_id = '82503'::citext) >>>>>>> -> Index Scan using >>>>>>> events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 >>>>>>> rows=12 width=51) >>>>>>> Index Cond: ((session_id = e1.session_id) AND >>>>>>> (type = e1.type)) >>>>>>> Filter: (e1.product_id <> product_id) >>>>>>> (11 rows) >>>>>>> >>>>>>> recommender_production=> \d events >>>>>>> Table "public.events" >>>>>>> Column | Type | Modifiers >>>>>>> -------------+--------------------------+----------------------------------------------------- >>>>>>> id | bigint | not null default >>>>>>> nextval('events_id_seq'::regclass) >>>>>>> user_id | citext | >>>>>>> session_id | citext | not null >>>>>>> product_id | citext | not null >>>>>>> site_id | citext | not null >>>>>>> type | text | not null >>>>>>> happened_at | timestamp with time zone | not null >>>>>>> created_at | timestamp with time zone | not null >>>>>>> Indexes: >>>>>>> "events_pkey" PRIMARY KEY, btree (id) >>>>>>> "events_product_id_site_id_idx" btree (product_id, site_id) >>>>>>> "events_session_id_type_product_id_idx" btree (session_id, type, >>>>>>> product_id) >>>>>>> Check constraints: >>>>>>> "events_session_id_check" CHECK (length(session_id::text) < 255) >>>>>>> "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, >>>>>>> 'viewed'::text])) >>>>>>> "events_user_id_check" CHECK (length(user_id::text) < 255) >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> After removing the session with 400k events, I was able to do an >>>>>> explain analyze, here is one of them: >>>>>> http://explain.depesz.com/s/PFNk >>>>>> >>>>> >>>>> >>>> >>> >> >