Hello, thank you for your answer. I will give it a try and then I post here the results. In the original email I post the output of \d+ tweet, which contains the indexes and constraints.
Best regards, Caio On Mon, Nov 4, 2013 at 8:59 PM, desmodemone <desmodem...@gmail.com> wrote: > Hello, > I think you could try with an index on tweet table columns > "user_id, creation_time" [in this order , because the first argument is for > the equality predicate and the second with the range scan predicate, the > index tweet_user_id_creation_time_index is not ok because it has the > reverse order ] so the Hash Join between relationship and tweet will > become in theory a netsted loop and so the filter relationship.followed_id > = t.user_id will be pushed on the new index search condition with also > the creation_time > .. and creation_time < ... . In this manner you will > reduce the random i/o of the scanning of 1759645 rows from tweet that are > filter later now in hash join to 1679. > > I hope it will work, if not, I hope you could attach the DDL of the table > ( with constraints and indexes) to better understand the problem. > > Bye > > > 2013/11/4 Caio Casimiro <casimiro.lis...@gmail.com> > >> Hi Elliot, thank you for your answer. >> >> I tried this query but it still suffer with index scan on >> tweet_creation_time_index: >> >> "Sort (cost=4899904.57..4899913.19 rows=3447 width=20) (actual >> time=37560.938..37562.503 rows=1640 loops=1)" >> " Sort Key: tt.tweet_id" >> " Sort Method: quicksort Memory: 97kB" >> " Buffers: shared hit=1849 read=32788" >> " -> Nested Loop (cost=105592.06..4899702.04 rows=3447 width=20) >> (actual time=19151.036..37555.227 rows=1640 loops=1)" >> " Buffers: shared hit=1849 read=32788" >> " -> Hash Join (cost=105574.10..116461.68 rows=1679 width=8) >> (actual time=19099.848..19127.606 rows=597 loops=1)" >> " Hash Cond: (relationship.followed_id = t.user_id)" >> " Buffers: shared hit=3 read=31870" >> " -> Index Only Scan using relationship_id on relationship >> (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721 >> rows=106 loops=1)" >> " Index Cond: (follower_id = 335093362)" >> " Heap Fetches: 0" >> " Buffers: shared hit=2 read=3" >> " -> Hash (cost=83308.25..83308.25 rows=1781234 width=16) >> (actual time=19031.916..19031.916 rows=1759645 loops=1)" >> " Buckets: 262144 Batches: 1 Memory Usage: 61863kB" >> " Buffers: shared hit=1 read=31867" >> " -> Index Scan using tweet_creation_time_index on >> tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual >> time=48.595..13759.768 rows=1759645 loops=1)" >> " Index Cond: ((creation_time >= '2013-05-05 >> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 >> 00:00:00-03'::timestamp with time zone))" >> " Buffers: shared hit=1 read=31867" >> " -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 >> rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)" >> " Recheck Cond: (tweet_id = t.id)" >> " Buffers: shared hit=1846 read=918" >> " -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 >> rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)" >> " Index Cond: (tweet_id = t.id)" >> " Buffers: shared hit=1763 read=632" >> >> You said that I would need B-Tree indexes on the fields that I want the >> planner to use index only scan, and I think I have them already on the >> tweet table: >> >> "tweet_ios_index" btree (id, user_id, creation_time) >> >> Shouldn't the tweet_ios_index be enough to make the scan over >> tweet_creation_time_index be a index only scan? And, more important, would >> it be really faster? >> >> Thank you very much, >> Caio >> >> >> On Mon, Nov 4, 2013 at 7:22 PM, Elliot <yields.falseh...@gmail.com>wrote: >> >>> On 2013-11-04 16:10, Caio Casimiro wrote: >>> >>> Hi Neyman, thank you for your answer. >>> >>> Unfortunately this query runs almost at the same time: >>> >>> Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual >>> time=25820.291..25821.845 rows=1640 loops=1) >>> Sort Key: tt.tweet_id >>> Sort Method: quicksort Memory: 97kB >>> Buffers: shared hit=1849 read=32788 >>> -> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual >>> time=486.839..25814.120 rows=1640 loops=1) >>> Buffers: shared hit=1849 read=32788 >>> -> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) >>> (actual time=431.654..13209.159 rows=597 loops=1) >>> Hash Cond: (t.user_id = relationship.followed_id) >>> Buffers: shared hit=3 read=31870 >>> -> Index Scan using tweet_creation_time_index on tweet t >>> (cost=0.57..83308.25 rows=1781234 width=16) (actual >>> time=130.144..10037.764 rows=1759645 loops=1) >>> Index Cond: ((creation_time >= '2013-05-05 >>> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 >>> 00:00:00-03'::timestamp with time zone)) >>> Buffers: shared hit=1 read=31867 >>> -> Hash (cost=227.12..227.12 rows=154 width=8) (actual >>> time=94.365..94.365 rows=106 loops=1) >>> Buckets: 1024 Batches: 1 Memory Usage: 3kB >>> Buffers: shared hit=2 read=3 >>> -> Index Only Scan using relationship_id on >>> relationship (cost=0.42..227.12 rows=154 width=8) (actual >>> time=74.540..94.101 rows=106 loops=1) >>> Index Cond: (follower_id = 335093362) >>> Heap Fetches: 0 >>> Buffers: shared hit=2 read=3 >>> -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 >>> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597) >>> Recheck Cond: (tweet_id = t.id) >>> Buffers: shared hit=1846 read=918 >>> -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 >>> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597) >>> Index Cond: (tweet_id = t.id) >>> Buffers: shared hit=1763 read=632 >>> Total runtime: 25823.386 ms >>> >>> I have noticed that in both queries the index scan on >>> tweet_creation_time_index is very expensive. Is there anything I can do to >>> make the planner choose a index only scan? >>> >>> >>> Yes, because that part of the query is kicking back so many rows, many >>> of which are totally unnecessary anyway - you're first getting all the >>> tweets in a particular time range, then limiting them down to just users >>> that are followed. Here's clarification on the approach I mentioned >>> earlier. All you should really need are basic (btree) indexes on your >>> different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, >>> relationship.follower_id, relationship.followed_id). I also changed the >>> left join to an inner join as somebody pointed out that your logic amounted >>> to reducing the match to an inner join anyway. >>> >>> SELECT tt.tweet_id, tt.topic, tt.topic_value >>> FROM tweet_topic AS tt >>> JOIN tweet AS t >>> ON tt.tweet_id = t.id >>> join relationship >>> on t.user_id = relationship.followed_id >>> >>> WHERE creation_time BETWEEN 'D1' AND 'D2' >>> AND relationship.follower_id = N >>> ORDER BY tt.tweet_id >>> ; >>> >>> >> >