2015-01-25 8:20 GMT+01:00 Joe Van Dyk <j...@tanga.com>:

> On Sat, Jan 24, 2015 at 11:14 PM, Pavel Stehule <pavel.steh...@gmail.com>
> wrote:
>
>>
>>
>> 2015-01-25 7:38 GMT+01:00 Joe Van Dyk <j...@tanga.com>:
>>
>>>
>>>
>>> On Sat, Jan 24, 2015 at 10:12 PM, Pavel Stehule <pavel.steh...@gmail.com
>>> > wrote:
>>>
>>>> Hi
>>>>
>>>> this plan looks well
>>>>
>>>> Regards
>>>>
>>>> Pavel
>>>>
>>>
>>> Here's one that's not quite as well: http://explain.depesz.com/s/SgT
>>>
>>
>> I see a possible issue
>>
>> (product_id <> '81716'::citext) .. this operation is CPU expensive and
>> maybe nonsense
>>
>> product_id should be integer -- and if it isn't - it should not be on 4M
>> rows extremly fast - mainly on citext
>>
>> try to force a opposite cast - you will safe a case insensitive text
>> comparation
>>
>> product_id::int <> 81716
>>
>
> It might not always be an integer, just happens to be so here. Should I
> try text instead? I don't have to have the case-insensitive matching.
>

text can be better

this design is unhappy, but you cannot to change ot probably



>
> Joe
>
>
>>
>> Regards
>>
>> Pavel
>>
>>
>>
>>
>>>
>>> Joe
>>>
>>>
>>>>
>>>> 2015-01-25 6:45 GMT+01:00 Joe Van Dyk <j...@tanga.com>:
>>>>
>>>>> Oops, didn't run vacuum analyze after deleting the events. Here is
>>>>> another 'explain analyze': http://explain.depesz.com/s/AviN
>>>>>
>>>>> On Sat, Jan 24, 2015 at 9:43 PM, Joe Van Dyk <j...@tanga.com> wrote:
>>>>>
>>>>>> On Sat, Jan 24, 2015 at 9:41 PM, Joe Van Dyk <j...@tanga.com> wrote:
>>>>>>
>>>>>>> I have an events table that records page views and purchases (type =
>>>>>>> 'viewed' or type='purchased'). I have a query that figures out "people 
>>>>>>> who
>>>>>>> bought/viewed this also bought/viewed that".
>>>>>>>
>>>>>>> It worked fine, taking about 0.1 seconds to complete, until a few
>>>>>>> hours ago when it started taking hours to complete. Vacuum/analyze 
>>>>>>> didn't
>>>>>>> help.  Turned out there was one session_id that had 400k rows in the
>>>>>>> system. Deleting that made the query performant again.
>>>>>>>
>>>>>>> Is there anything I can do to make the query work better in cases
>>>>>>> like that? Missing index, or better query?
>>>>>>>
>>>>>>> This is on 9.3.5.
>>>>>>>
>>>>>>> The below is reproduced at the following URL if it's not formatted
>>>>>>> correctly in the email.
>>>>>>> https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt
>>>>>>>
>>>>>>> explain  select
>>>>>>>    e1.product_id,
>>>>>>>    e2.site_id,
>>>>>>>    e2.product_id,
>>>>>>>    count(nullif(e2.type='viewed', false)) view_count,
>>>>>>>    count(nullif(e2.type='purchased', false)) purchase_count
>>>>>>>  from events e1
>>>>>>>  join events e2 on e1.session_id = e2.session_id and e1.type = e2.type
>>>>>>>  where
>>>>>>>    e1.product_id = '82503' and
>>>>>>>    e1.product_id != e2.product_id
>>>>>>>  group by e1.product_id, e2.product_id, e2.site_id;
>>>>>>>                                                          QUERY PLAN
>>>>>>> ----------------------------------------------------------------------------------------------------------------------------
>>>>>>>  GroupAggregate  (cost=828395.67..945838.90 rows=22110 width=19)
>>>>>>>    ->  Sort  (cost=828395.67..840117.89 rows=4688885 width=19)
>>>>>>>          Sort Key: e1.product_id, e2.product_id, e2.site_id
>>>>>>>          ->  Nested Loop  (cost=11.85..20371.14 rows=4688885 width=19)
>>>>>>>                ->  Bitmap Heap Scan on events e1  (cost=11.29..1404.31 
>>>>>>> rows=369 width=49)
>>>>>>>                      Recheck Cond: (product_id = '82503'::citext)
>>>>>>>                      ->  Bitmap Index Scan on 
>>>>>>> events_product_id_site_id_idx  (cost=0.00..11.20 rows=369 width=0)
>>>>>>>                            Index Cond: (product_id = '82503'::citext)
>>>>>>>                ->  Index Scan using 
>>>>>>> events_session_id_type_product_id_idx on events e2  (cost=0.56..51.28 
>>>>>>> rows=12 width=51)
>>>>>>>                      Index Cond: ((session_id = e1.session_id) AND 
>>>>>>> (type = e1.type))
>>>>>>>                      Filter: (e1.product_id <> product_id)
>>>>>>> (11 rows)
>>>>>>>
>>>>>>> recommender_production=> \d events
>>>>>>>                                     Table "public.events"
>>>>>>>    Column    |           Type           |                      Modifiers
>>>>>>> -------------+--------------------------+-----------------------------------------------------
>>>>>>>  id          | bigint                   | not null default 
>>>>>>> nextval('events_id_seq'::regclass)
>>>>>>>  user_id     | citext                   |
>>>>>>>  session_id  | citext                   | not null
>>>>>>>  product_id  | citext                   | not null
>>>>>>>  site_id     | citext                   | not null
>>>>>>>  type        | text                     | not null
>>>>>>>  happened_at | timestamp with time zone | not null
>>>>>>>  created_at  | timestamp with time zone | not null
>>>>>>> Indexes:
>>>>>>>     "events_pkey" PRIMARY KEY, btree (id)
>>>>>>>     "events_product_id_site_id_idx" btree (product_id, site_id)
>>>>>>>     "events_session_id_type_product_id_idx" btree (session_id, type, 
>>>>>>> product_id)
>>>>>>> Check constraints:
>>>>>>>     "events_session_id_check" CHECK (length(session_id::text) < 255)
>>>>>>>     "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 
>>>>>>> 'viewed'::text]))
>>>>>>>     "events_user_id_check" CHECK (length(user_id::text) < 255)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> After removing the session with 400k events, I was able to do an
>>>>>> explain analyze, here is one of them:
>>>>>> http://explain.depesz.com/s/PFNk
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to