Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

Caio Casimiro Mon, 04 Nov 2013 15:36:44 -0800

Hello, thank you for your answer. I will give it a try and then I post here
the results.
In the original email I post the output of \d+ tweet, which contains the
indexes and constraints.


Best regards,
Caio


On Mon, Nov 4, 2013 at 8:59 PM, desmodemone <[email protected]> wrote:

> Hello,
>              I think you could try with an index on tweet table columns
> "user_id, creation_time" [in this order , because the first argument is for
> the equality predicate and the second with the range scan predicate, the
> index tweet_user_id_creation_time_index is not ok because it has the
> reverse order ]  so the Hash Join between relationship and tweet   will
> become in theory a netsted loop and so the filter relationship.followed_id
> = t.user_id   will be pushed on the new index search condition with also
> the creation_time > .. and creation_time < ... . In this manner you will
> reduce the random i/o of the scanning of 1759645 rows from tweet that are
> filter later now in hash join to 1679.
>
> I hope it will work, if not, I hope you could attach the DDL of the table
> ( with constraints and indexes) to better understand the problem.
>
> Bye
>
>
> 2013/11/4 Caio Casimiro <[email protected]>
>
>> Hi Elliot, thank you for your answer.
>>
>> I tried this query but it still suffer with index scan on
>> tweet_creation_time_index:
>>
>> "Sort  (cost=4899904.57..4899913.19 rows=3447 width=20) (actual
>> time=37560.938..37562.503 rows=1640 loops=1)"
>>  "  Sort Key: tt.tweet_id"
>> "  Sort Method: quicksort  Memory: 97kB"
>> "  Buffers: shared hit=1849 read=32788"
>> "  ->  Nested Loop  (cost=105592.06..4899702.04 rows=3447 width=20)
>> (actual time=19151.036..37555.227 rows=1640 loops=1)"
>> "        Buffers: shared hit=1849 read=32788"
>> "        ->  Hash Join  (cost=105574.10..116461.68 rows=1679 width=8)
>> (actual time=19099.848..19127.606 rows=597 loops=1)"
>> "              Hash Cond: (relationship.followed_id = t.user_id)"
>> "              Buffers: shared hit=3 read=31870"
>> "              ->  Index Only Scan using relationship_id on relationship
>>  (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721
>> rows=106 loops=1)"
>>  "                    Index Cond: (follower_id = 335093362)"
>> "                    Heap Fetches: 0"
>> "                    Buffers: shared hit=2 read=3"
>> "              ->  Hash  (cost=83308.25..83308.25 rows=1781234 width=16)
>> (actual time=19031.916..19031.916 rows=1759645 loops=1)"
>> "                    Buckets: 262144  Batches: 1  Memory Usage: 61863kB"
>> "                    Buffers: shared hit=1 read=31867"
>> "                    ->  Index Scan using tweet_creation_time_index on
>> tweet t  (cost=0.57..83308.25 rows=1781234 width=16) (actual
>> time=48.595..13759.768 rows=1759645 loops=1)"
>>  "                          Index Cond: ((creation_time >= '2013-05-05
>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
>> 00:00:00-03'::timestamp with time zone))"
>> "                          Buffers: shared hit=1 read=31867"
>> "        ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
>> rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)"
>>  "              Recheck Cond: (tweet_id = t.id)"
>> "              Buffers: shared hit=1846 read=918"
>> "              ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
>> rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)"
>>  "                    Index Cond: (tweet_id = t.id)"
>> "                    Buffers: shared hit=1763 read=632"
>>
>> You said that I would need B-Tree indexes on the fields that I want the
>> planner to use index only scan, and I think I have them already on the
>> tweet table:
>>
>> "tweet_ios_index" btree (id, user_id, creation_time)
>>
>> Shouldn't the tweet_ios_index be enough to make the scan over
>> tweet_creation_time_index be a index only scan? And, more important, would
>> it be really faster?
>>
>> Thank you very much,
>> Caio
>>
>>
>> On Mon, Nov 4, 2013 at 7:22 PM, Elliot <[email protected]>wrote:
>>
>>>  On 2013-11-04 16:10, Caio Casimiro wrote:
>>>
>>> Hi Neyman, thank you for your answer.
>>>
>>> Unfortunately this query runs almost at the same time:
>>>
>>>   Sort  (cost=4877693.98..4877702.60 rows=3449 width=20) (actual
>>> time=25820.291..25821.845 rows=1640 loops=1)
>>>   Sort Key: tt.tweet_id
>>>   Sort Method: quicksort  Memory: 97kB
>>>   Buffers: shared hit=1849 read=32788
>>>   ->  Nested Loop  (cost=247.58..4877491.32 rows=3449 width=20) (actual
>>> time=486.839..25814.120 rows=1640 loops=1)
>>>         Buffers: shared hit=1849 read=32788
>>>         ->  Hash Semi Join  (cost=229.62..88553.23 rows=1681 width=8)
>>> (actual time=431.654..13209.159 rows=597 loops=1)
>>>               Hash Cond: (t.user_id = relationship.followed_id)
>>>               Buffers: shared hit=3 read=31870
>>>               ->  Index Scan using tweet_creation_time_index on tweet t
>>>  (cost=0.57..83308.25 rows=1781234 width=16) (actual
>>> time=130.144..10037.764 rows=1759645 loops=1)
>>>                     Index Cond: ((creation_time >= '2013-05-05
>>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06
>>> 00:00:00-03'::timestamp with time zone))
>>>                     Buffers: shared hit=1 read=31867
>>>               ->  Hash  (cost=227.12..227.12 rows=154 width=8) (actual
>>> time=94.365..94.365 rows=106 loops=1)
>>>                     Buckets: 1024  Batches: 1  Memory Usage: 3kB
>>>                     Buffers: shared hit=2 read=3
>>>                     ->  Index Only Scan using relationship_id on
>>> relationship  (cost=0.42..227.12 rows=154 width=8) (actual
>>> time=74.540..94.101 rows=106 loops=1)
>>>                           Index Cond: (follower_id = 335093362)
>>>                           Heap Fetches: 0
>>>                           Buffers: shared hit=2 read=3
>>>         ->  Bitmap Heap Scan on tweet_topic tt  (cost=17.96..2841.63
>>> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597)
>>>               Recheck Cond: (tweet_id = t.id)
>>>               Buffers: shared hit=1846 read=918
>>>               ->  Bitmap Index Scan on tweet_topic_pk  (cost=0.00..17.78
>>> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597)
>>>                     Index Cond: (tweet_id = t.id)
>>>                     Buffers: shared hit=1763 read=632
>>> Total runtime: 25823.386 ms
>>>
>>>  I have noticed that in both queries the index scan on
>>> tweet_creation_time_index is very expensive. Is there anything I can do to
>>> make the planner choose a index only scan?
>>>
>>>
>>>  Yes, because that part of the query is kicking back so many rows, many
>>> of which are totally unnecessary anyway - you're first getting all the
>>> tweets in a particular time range, then limiting them down to just users
>>> that are followed. Here's clarification on the approach I mentioned
>>> earlier. All you should really need are basic (btree) indexes on your
>>> different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id,
>>> relationship.follower_id, relationship.followed_id). I also changed the
>>> left join to an inner join as somebody pointed out that your logic amounted
>>> to reducing the match to an inner join anyway.
>>>
>>> SELECT tt.tweet_id, tt.topic, tt.topic_value
>>> FROM tweet_topic AS tt
>>>   JOIN tweet AS t
>>>     ON tt.tweet_id = t.id
>>>   join relationship
>>>     on t.user_id = relationship.followed_id
>>>
>>> WHERE creation_time BETWEEN 'D1' AND 'D2'
>>>   AND relationship.follower_id = N
>>> ORDER BY tt.tweet_id
>>> ;
>>>
>>>
>>
>

Re: [PERFORM] Slow index scan on B-Tree index over timestamp field

Reply via email to