Re: [PERFORM] Unexpected expensive index scan

Jake Nielsen Tue, 27 Sep 2016 18:25:01 -0700

On Tue, Sep 27, 2016 at 6:03 PM, Jake Nielsen <jake.k.niel...@gmail.com>
wrote:


>
> On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen <mso...@runbox.com> wrote:
>
>> *From:* Jake Nielsen    *Sent:* Tuesday, September 27, 2016 5:22 PM
>>
>>
>> the query
>>
>> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
>> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>>
>>
>>
>> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen <jake.k.niel...@gmail.com>
>> wrote:
>>
>> I've got a query that takes a surprisingly long time to run, and I'm
>> having a really rough time trying to figure it out.
>>
>>
>>
>> Before I get started, here are the specifics of the situation:
>>
>>
>>
>> Here is the table that I'm working with (apologies for spammy indices,
>> I've been throwing shit at the wall)
>>
>>                             Table "public.syncerevent"
>>
>>     Column    |  Type   |                        Modifiers
>>
>>
>> --------------+---------+-----------------------------------
>> -----------------------
>>
>>  id           | bigint  | not null default nextval('syncerevent_id_seq'::
>> regclass)
>>
>>  userid       | text    |
>>
>>  event        | text    |
>>
>>  eventid      | text    |
>>
>>  originatorid | text    |
>>
>>  propogatorid | text    |
>>
>>  kwargs       | text    |
>>
>>  conflicted   | integer |
>>
>> Indexes:
>>
>>     "syncerevent_pkey" PRIMARY KEY, btree (id)
>>
>>     "syncereventidindex" UNIQUE, btree (eventid)
>>
>>     "anothersyncereventidindex" btree (userid)
>>
>>     "anothersyncereventidindexwithascending" btree (userid, id)
>>
>>     "asdfasdgasdf" btree (userid, id DESC)
>>
>>     "syncereventuseridhashindex" hash (userid)
>>
>>
>>
>> To provide some context, as per the wiki,
>>
>> there are 3,290,600 rows in this table.
>>
>> It gets added to frequently, but never deleted from.
>>
>> The "kwargs" column often contains mid-size JSON strings (roughly 30K
>> characters on average)
>>
>> As of right now, the table has 53 users in it. About 20% of those have a
>> negligible number of events, but the rest of the users have a fairly even
>> smattering.
>>
>>
>>
>> EXPLAIN (ANALYZE, BUFFERS) says:
>>
>>
>>   QUERY PLAN
>>
>>
>> ------------------------------------------------------------
>> ------------------------------------------------------------
>> --------------------------------------
>>
>>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
>> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
>> read=160198
>>
>>    ->  Index Scan using syncerevent_pkey on syncerevent
>> (cost=0.43..388147.29 rows=1274560 width=615) (actual
>> time=3352.386..3383.100 rows=4000 loops=1)
>>
>>          Index Cond: (id > 12468)
>>
>>          Filter: ((propogatorid <> 
>> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
>> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>>
>>          Rows Removed by Filter: 1685801
>>
>>          Buffers: shared hit=120244 read=160198
>>
>>  Planning time: 0.833 ms
>>
>>  Execution time: 3407.633 ms
>>
>> (9 rows)
>>
>> If it matters/interests you, here is my underlying confusion:
>>
>> From some internet sleuthing, I've decided that having a table per user
>> (which would totally make this problem a non-issue) isn't a great idea.
>> Because there is a file per table, having a table per user would not scale.
>> My next thought was partial indexes (which would also totally help), but
>> since there is also a table per index, this really doesn't side-step the
>> problem. My rough mental model says: If there exists a way that a
>> table-per-user scheme would make this more efficient, then there should
>> also exist an index that could achieve the same effect (or close enough to
>> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
>> utilize at least one of the two indexes on the userId column, but clearly
>> I'm not understanding something.
>>
>> Any help in making this query more efficient would be greatly
>> appreciated, and any conceptual insights would be extra awesome.
>>
>> Thanks for reading.
>>
>> -Jake
>>
>> ----------------------
>>
>>
>>
>> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>>
>> As does this from the analyze:  Rows Removed by Filter: 1685801
>>
>>
>>
>> The propogaterid is practically the only column NOT indexed and it’s used
>> in a “not in”.  It looks like it’s having to do a table scan for all the
>> rows above the id cutoff to see if any meet the filter requirement.  “not
>> in” can be very expensive.  An index might help on this column.  Have you
>> tried that?
>>
>>
>>
>> Your rowcounts aren’t high enough to require partitioning or any other
>> changes to your table that I can see right now.
>>
>>
>>
>> Mike Sofen  (Synthetic Genomics)
>>
>>
>>
>
> Thanks Mike, that's true, I hadn't thought of non-indexed columns forcing
> a scan. Unfortunately, just to test this out, I tried pulling out the more
> suspect parts of the query, and it still seems to want to do an index scan:
>
>
> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERE userId =
> '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;
>
>
> QUERY PLAN
>
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> ----------------------------------
>
>  Limit  (cost=0.43..1140.62 rows=4000 width=615) (actual
> time=2706.365..2732.308 rows=4000 loops=1)
>
>    Buffers: shared hit=120239 read=161924
>
>    ->  Index Scan using syncerevent_pkey on syncerevent
> (cost=0.43..364982.77 rows=1280431 width=615) (actual
> time=2706.360..2715.514 rows=4000 loops=1)
>
>          Filter: (userid = '57dc984f1c87461c0967e228'::text)
>
>          Rows Removed by Filter: 1698269
>
>          Buffers: shared hit=120239 read=161924
>
>  Planning time: 0.131 ms
>
>  Execution time: 2748.526 ms
>
> (8 rows)
>
> It definitely looks to me like it's starting at the ID = 12468 row, and
> just grinding up the rows. The filter is (unsurprisingly) false for most of
> the rows, so it ends up having to chew up half the table before it actually
> finds 4000 rows that match.
>
> After creating a partial index using that userId, things go way faster.
> This is more-or-less what I assumed I'd get by making having that
> multi-column index of (userId, Id), but alas:
>
> remoteSyncerLogistics=> CREATE INDEX sillyIndex ON syncerevent (ID) where
> userId = '57dc984f1c87461c0967e228';
>
> CREATE INDEX
>
> remoteSyncerLogistics=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
> SyncerEvent WHERe userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT
> 4000;
>
>                                                                   QUERY
> PLAN
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> ----------------------
>
>  Limit  (cost=0.43..443.21 rows=4000 width=615) (actual time=0.074..13.349
> rows=4000 loops=1)
>
>    Buffers: shared hit=842 read=13
>
>    ->  Index Scan using sillyindex on syncerevent  (cost=0.43..141748.41
> rows=1280506 width=615) (actual time=0.071..5.372 rows=4000 loops=1)
>
>          Buffers: shared hit=842 read=13
>
>  Planning time: 0.245 ms
>
>  Execution time: 25.404 ms
>
> (6 rows)
>
>
> remoteSyncerLogistics=> EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM
> SyncerEvent WHERe userId = '57dc984f1c87461c0967e228' AND ID > 12468 ORDER
> BY ID LIMIT 4000;
>
>                                                                   QUERY
> PLAN
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> ----------------------
>
>  Limit  (cost=0.43..453.34 rows=4000 width=615) (actual time=0.023..13.244
> rows=4000 loops=1)
>
>    Buffers: shared hit=855
>
>    ->  Index Scan using sillyindex on syncerevent  (cost=0.43..144420.43
> rows=1275492 width=615) (actual time=0.020..5.392 rows=4000 loops=1)
>
>          Index Cond: (id > 12468)
>
>          Buffers: shared hit=855
>
>  Planning time: 0.253 ms
>
>  Execution time: 29.371 ms
>
> (7 rows)
>
>
> Any thoughts?
>
> -Jake
>

Hmmm, here's another unexpected piece of information:


EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERE userid =
'57dc984f1c87461c0967e228';

                                                        QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------

 Seq Scan on syncerevent  (cost=0.00..311251.51 rows=1248302 width=618)
(actual time=0.008..4970.507 rows=1259137 loops=1)

   Filter: (userid = '57dc984f1c87461c0967e228'::text)

   Rows Removed by Filter: 2032685

   Buffers: shared hit=108601 read=161523

 Planning time: 0.092 ms

 Execution time: 7662.845 ms

(6 rows)


Looks like even just using userid='blah' doesn't actually result in the
index being used, despite the fact that there are indexes on the userId
column:


    "syncerevent_pkey" PRIMARY KEY, btree (id)

    "syncereventidindex" UNIQUE, btree (eventid)

    "anothersyncereventidindex" btree (userid)

    "anothersyncereventidindexwithascending" btree (userid, id)

    "asdfasdgasdf" btree (userid, id DESC)

    "syncereventuseridhashindex" hash (userid)


-Jake

Re: [PERFORM] Unexpected expensive index scan

Reply via email to