Re: [GENERAL] Large data and slow queries

Alban Hertroys Wed, 19 Apr 2017 08:21:50 -0700

> On 19 Apr 2017, at 12:58, Samuel Williams <[email protected]> 
> wrote:
> 
> It's interesting you talk about using multiple indexes. In
> MySQL/MariaDB and derivatives, I've never seen the query planner
> consider using multiple indexes. So, it's possible that Postgres may
> use multiple indexes if it saves time? Or do you mean, doing something
> like manually joining the data and leveraging the different indexes
> explicitly?


PG is capable of doing bitmap heap scans to combine results from multiple 
indices, among other things.
Whether that will actually improve performance in this case I don't know, but 
it's worth a try I think.

> The correlation between user_id and location... well, it's somewhat
> temporally related.

So users are constantly moving around but happen to be at the same locations at 
regular intervals?

In my experience, people don't usually move around much, so you should 
certainly be able to pinpoint them mostly to a specific area, right? (Hence my 
suggestions for a country column or partitioning in squares)


> On 19 April 2017 at 22:50, Alban Hertroys <[email protected]> wrote:
>> 
>>> On 19 Apr 2017, at 6:01, Samuel Williams <[email protected]> 
>>> wrote:
>>> 
>>> Hi.
>>> 
>>> We have 400,000,000 records in a table (soon to be 800,000,000), here
>>> is the schema (\d+)
>>> 
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121
>>> 
>>> We want the following kinds of query to be fast:
>>> 
>>> SELECT DISTINCT "user_event"."user_id" FROM "user_event" WHERE
>>> "user_event"."what" IN ('poll', 'location_change',
>>> 'midnight_location') AND ("user_event"."created_at" >= '2016-04-19
>>> 01:23:55') AND (latitude > -37.03079375089291 AND latitude <
>>> -36.67086424910709 AND longitude > 174.6307139779924 AND longitude <
>>> 175.0805140220076);
>>> 
>>> We have a btree index and it appears to be working. However, it's
>>> still pretty slow.
>>> 
>>> EXPLAIN ANALYZE gives the following:
>>> https://gist.github.com/ioquatix/bddda36d9e4ffaceb7a62d7b62259121#gistcomment-2065314
>>> 
>>> I'm thinking that I need to do the following to help:
>>> 
>>> CLUSTER user_event ON index_user_event_for_visits_3 followed by
>>> analyze... Our data is mostly time series but sometimes we get some
>>> dumps with historical records.
>> 
>> It seems to me that a large part of the problem is that the server has to 
>> scan all index entries from that date and within those location bounds to 
>> find that the distinct user id's in that set are about 114,000 out of 1.7M 
>> rows matching the selection-criteria. If it could stop at the first location 
>> for each user, it would have to scan less than a 10th of the index entries 
>> that it has to scan now...
>> 
>> How high is the correlation between user id and location? That can probably 
>> be leveraged…
>> Perhaps you'd get better performance if you'd use multiple indices instead 
>> of stuffing everything into a single purpose-specific one? I would suggest 
>> one on (user_id, latitude, longitude) or (latitude, longitude, user_id) and 
>> one on (created_at, user_id), or perhaps (created_at, latitude, longitude). 
>> That approach could also reduce the number of indices you have on that 
>> table, as well as their sizes, making it all fit into cache a little easier. 
>> Then again, additional operations will be required to combine them.
>> 
>> For a different approach; It may be possible to enrich your data with 
>> something that is easy to index and query, with a high correlation to 
>> (latitude, longitude). That could also be used to partition over. Country 
>> seems a good candidate here, unless all your data-points are in New Zealand 
>> like the above?
>> Then again, some countries are a lot larger, with a higher population, than 
>> others. And populations can be highly concentrated (Delhi, Moscow to name a 
>> few).
>> Another option is to divide the location space up into squares of a fixed 
>> size, with a partition for each square. About 80% of those squares are 
>> unpopulated though, being at sea.
>> 
>> Created_at is a very good candidate for partitioning too, especially if you 
>> don't intend to keep data older than a certain age. Truncating or dropping a 
>> partition that you no longer need is quick and easy.
>> 
>> With data-sets this large, I'd think you would want to partition on multiple 
>> dimensions, creating a matrix of partitions under a single master table. I 
>> don't think PG has a simple way of doing that (yet) though; perhaps it's 
>> possible by abusing multiple levels of inheritance, but that sounds like a 
>> bad idea.
>> 
>> And of course, create your partitions sufficiently course to prevent 
>> overburdening the system tables, which would slow down the query planner.
>> 
>> Hopefully there's something useful in my ramblings!
>> 
>> Alban Hertroys
>> --
>> If you can't see the forest for the trees,
>> cut the trees and you'll find there is no forest.
>> 
> 
> 
> -- 
> Sent via pgsql-general mailing list ([email protected])
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



-- 
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: [GENERAL] Large data and slow queries

Reply via email to