Hello, I think you could try with an index on tweet table columns "user_id, creation_time" [in this order , because the first argument is for the equality predicate and the second with the range scan predicate, the index tweet_user_id_creation_time_index is not ok because it has the reverse order ] so the Hash Join between relationship and tweet will become in theory a netsted loop and so the filter relationship.followed_id = t.user_id will be pushed on the new index search condition with also the creation_time > .. and creation_time < ... . In this manner you will reduce the random i/o of the scanning of 1759645 rows from tweet that are filter later now in hash join to 1679.
I hope it will work, if not, I hope you could attach the DDL of the table ( with constraints and indexes) to better understand the problem. Bye 2013/11/4 Caio Casimiro <casimiro.lis...@gmail.com> > Hi Elliot, thank you for your answer. > > I tried this query but it still suffer with index scan on > tweet_creation_time_index: > > "Sort (cost=4899904.57..4899913.19 rows=3447 width=20) (actual > time=37560.938..37562.503 rows=1640 loops=1)" > " Sort Key: tt.tweet_id" > " Sort Method: quicksort Memory: 97kB" > " Buffers: shared hit=1849 read=32788" > " -> Nested Loop (cost=105592.06..4899702.04 rows=3447 width=20) > (actual time=19151.036..37555.227 rows=1640 loops=1)" > " Buffers: shared hit=1849 read=32788" > " -> Hash Join (cost=105574.10..116461.68 rows=1679 width=8) > (actual time=19099.848..19127.606 rows=597 loops=1)" > " Hash Cond: (relationship.followed_id = t.user_id)" > " Buffers: shared hit=3 read=31870" > " -> Index Only Scan using relationship_id on relationship > (cost=0.42..227.12 rows=154 width=8) (actual time=66.102..89.721 > rows=106 loops=1)" > " Index Cond: (follower_id = 335093362)" > " Heap Fetches: 0" > " Buffers: shared hit=2 read=3" > " -> Hash (cost=83308.25..83308.25 rows=1781234 width=16) > (actual time=19031.916..19031.916 rows=1759645 loops=1)" > " Buckets: 262144 Batches: 1 Memory Usage: 61863kB" > " Buffers: shared hit=1 read=31867" > " -> Index Scan using tweet_creation_time_index on > tweet t (cost=0.57..83308.25 rows=1781234 width=16) (actual > time=48.595..13759.768 rows=1759645 loops=1)" > " Index Cond: ((creation_time >= '2013-05-05 > 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 > 00:00:00-03'::timestamp with time zone))" > " Buffers: shared hit=1 read=31867" > " -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 > rows=723 width=20) (actual time=30.774..30.847 rows=3 loops=597)" > " Recheck Cond: (tweet_id = t.id)" > " Buffers: shared hit=1846 read=918" > " -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 > rows=723 width=0) (actual time=23.084..23.084 rows=3 loops=597)" > " Index Cond: (tweet_id = t.id)" > " Buffers: shared hit=1763 read=632" > > You said that I would need B-Tree indexes on the fields that I want the > planner to use index only scan, and I think I have them already on the > tweet table: > > "tweet_ios_index" btree (id, user_id, creation_time) > > Shouldn't the tweet_ios_index be enough to make the scan over > tweet_creation_time_index be a index only scan? And, more important, would > it be really faster? > > Thank you very much, > Caio > > > On Mon, Nov 4, 2013 at 7:22 PM, Elliot <yields.falseh...@gmail.com> wrote: > >> On 2013-11-04 16:10, Caio Casimiro wrote: >> >> Hi Neyman, thank you for your answer. >> >> Unfortunately this query runs almost at the same time: >> >> Sort (cost=4877693.98..4877702.60 rows=3449 width=20) (actual >> time=25820.291..25821.845 rows=1640 loops=1) >> Sort Key: tt.tweet_id >> Sort Method: quicksort Memory: 97kB >> Buffers: shared hit=1849 read=32788 >> -> Nested Loop (cost=247.58..4877491.32 rows=3449 width=20) (actual >> time=486.839..25814.120 rows=1640 loops=1) >> Buffers: shared hit=1849 read=32788 >> -> Hash Semi Join (cost=229.62..88553.23 rows=1681 width=8) >> (actual time=431.654..13209.159 rows=597 loops=1) >> Hash Cond: (t.user_id = relationship.followed_id) >> Buffers: shared hit=3 read=31870 >> -> Index Scan using tweet_creation_time_index on tweet t >> (cost=0.57..83308.25 rows=1781234 width=16) (actual >> time=130.144..10037.764 rows=1759645 loops=1) >> Index Cond: ((creation_time >= '2013-05-05 >> 00:00:00-03'::timestamp with time zone) AND (creation_time <= '2013-05-06 >> 00:00:00-03'::timestamp with time zone)) >> Buffers: shared hit=1 read=31867 >> -> Hash (cost=227.12..227.12 rows=154 width=8) (actual >> time=94.365..94.365 rows=106 loops=1) >> Buckets: 1024 Batches: 1 Memory Usage: 3kB >> Buffers: shared hit=2 read=3 >> -> Index Only Scan using relationship_id on >> relationship (cost=0.42..227.12 rows=154 width=8) (actual >> time=74.540..94.101 rows=106 loops=1) >> Index Cond: (follower_id = 335093362) >> Heap Fetches: 0 >> Buffers: shared hit=2 read=3 >> -> Bitmap Heap Scan on tweet_topic tt (cost=17.96..2841.63 >> rows=723 width=20) (actual time=21.014..21.085 rows=3 loops=597) >> Recheck Cond: (tweet_id = t.id) >> Buffers: shared hit=1846 read=918 >> -> Bitmap Index Scan on tweet_topic_pk (cost=0.00..17.78 >> rows=723 width=0) (actual time=15.012..15.012 rows=3 loops=597) >> Index Cond: (tweet_id = t.id) >> Buffers: shared hit=1763 read=632 >> Total runtime: 25823.386 ms >> >> I have noticed that in both queries the index scan on >> tweet_creation_time_index is very expensive. Is there anything I can do to >> make the planner choose a index only scan? >> >> >> Yes, because that part of the query is kicking back so many rows, many >> of which are totally unnecessary anyway - you're first getting all the >> tweets in a particular time range, then limiting them down to just users >> that are followed. Here's clarification on the approach I mentioned >> earlier. All you should really need are basic (btree) indexes on your >> different keys (tweet_topic.tweet_id, tweet.id, tweet.user_id, >> relationship.follower_id, relationship.followed_id). I also changed the >> left join to an inner join as somebody pointed out that your logic amounted >> to reducing the match to an inner join anyway. >> >> SELECT tt.tweet_id, tt.topic, tt.topic_value >> FROM tweet_topic AS tt >> JOIN tweet AS t >> ON tt.tweet_id = t.id >> join relationship >> on t.user_id = relationship.followed_id >> >> WHERE creation_time BETWEEN 'D1' AND 'D2' >> AND relationship.follower_id = N >> ORDER BY tt.tweet_id >> ; >> >> >