[PERFORM] Slow performance when querying millions of rows

Craig McIlwee Tue, 28 Jun 2011 14:29:33 -0700

Hello,

I have a handful of queries that are performing very slowly.  I realize that I 
will be hitting hardware limits at some point, but want to make sure Ive 
squeezed out every bit of performance I can before calling it quits.


Our database is collecting traffic data at the rate of about 3 million rows a 
day.  The most important columns in the table being queried are timestamp, 
volume, occupancy, and speed.  I have also denormalized the table by adding 
road name, direction, mile marker, and lane type values to eliminate joins with 
other tables that contain information about the devices that collect this 
information.  A typical query will involve segments of roadway (i.e. road 
names, directions, and mile marker bounds) over a certain period of time (e.g. 
morning rush hour), and will have filters to exclude questionable data such 
(e.g. speed > 100 MPH).  Unfortunately, there are also a few cases in which a 
user will query data for many full days on all roadways, essentially querying 
everything for a large period of time.  One other thing to note is that we only 
ever query readings with lane_type = through_lanes, although we are collecting 
ramp and reversible lane data to facilitate future reporting needs.

Table Metadata:
- Volume typically ranges anywhere from 0 to 20, averages around 4  5.  A small 
percentage of the rows have null volume.
- Occupancy ranges from 1 to 10, averages around 1 or 2
- Speed is about what you would expect, ranging from 30  70 with an average 
somewhere near the middle
- There are 17 roads
- There are 2 directions per road
- Mile marker ranges vary by roadway, typical ranges are something like 0 to 40 
or 257 to 290
- Most (80 to 90% +) of the readings have lane_type = through_lanes
- Size of a daily table is about 360MB, a half month table is 5 to 6 GB

Full Table and Index Schema:

Ive experimented with partitioning using a table per day and 2 tables per month 
(1st through 15th, 16th to end of month).  2 tables/month was the original 
approach to keep the number of tables from growing too rapidly, and shows about 
3x slower performance.  Using daily tables incurs extra planning overhead as 
expected, but isnt all that bad.  Im OK with taking a 1 second planning hit if 
my overall query time decreases significantly.  Furthermore, we will only be 
storing raw data for about a year and can aggregate old data.  This means that 
I can do daily tables for raw data and larger/fewer tables for older data.  The 
table and index structure is below, which is identical between daily and ½ 
month tables with a couple of exceptions:
- Daily tables have a fill factor of 100, ½ month tables are default
- Only the 4 column indexes were created for the daily tables since the others 
never get used

CREATE TABLE vds_detector_data
(
  reading_timestamp timestamp without time zone,
  vds_id integer,
  detector_id integer,
  status smallint,
  speed numeric(12,9),
  volume numeric(12,9),
  confidence smallint,
  occupancy numeric(12,9),
  loadid bigint,
  road_name character varying(150),
  road_dir character varying(2),
  mile_marker numeric(7,2),
  lane_number integer,
  lane_type character varying(32),
  CONSTRAINT vds_detector_vdsid_fkey FOREIGN KEY (vds_id, detector_id)
      REFERENCES ref_vds_detector_properties (device_id, detector_id) MATCH 
SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT
)
WITH (
  OIDS=FALSE
);

CREATE INDEX vds_detector_data_dir_idx
  ON vds_detector_data
  USING btree
  (road_dir);

CREATE INDEX vds_detector_data_lane_idx
  ON vds_detector_data
  USING btree
  (lane_number);

CREATE INDEX vds_detector_data_mm_idx
  ON vds_detector_data
  USING btree
  (mile_marker);

CREATE INDEX vds_detector_data_occupancy_idx
  ON vds_detector_data
  USING btree
  (occupancy);

CREATE INDEX vds_detector_data_road_idx
  ON vds_detector_data
  USING btree
  (road_name);

CREATE INDEX vds_detector_data_road_ts_mm_dir_idx
  ON vds_detector_data
  USING btree
  (road_name, reading_timestamp, mile_marker, road_dir);

CREATE INDEX vds_detector_data_speed_idx
  ON vds_detector_data
  USING btree
  (speed);

CREATE INDEX vds_detector_data_timestamp_idx
  ON vds_detector_data
  USING btree
  (reading_timestamp);

CREATE INDEX vds_detector_data_ts_road_mm_dir_idx
  ON vds_detector_data
  USING btree
  (reading_timestamp, road_name, mile_marker, road_dir);

CREATE INDEX vds_detector_data_volume_idx
  ON vds_detector_data
  USING btree
  (volume);

EXPLAIN ANALYZE:

Query:
select cast(reading_timestamp as Date) as date, floor(extract(hour from 
reading_timestamp) / 1.0) * 1.0  as hour, floor(extract(minute from 
reading_timestamp) / 60) * 60 as min,
  count(*), sum(vdd.volume) as totalVolume, 
sum(vdd.occupancy*vdd.volume)/sum(vdd.volume) as avgOcc, 
sum(vdd.speed*vdd.volume)/sum(vdd.volume) as avgSpeed,
  avg(vdd.confidence) as avgConfidence, min(vdd.detector_id) as detectorId, 
vdd.vds_id as vdsId
from vds_detector_data vdd
where (vdd.reading_timestamp between '2011-4-01 00:00:00.000' and '2011-04-30  
23:59:59.999') 
  and vdd.volume!=0 
  and ((road_name='44' and mile_marker between 257.65 and 289.5 and 
(road_dir='E' or road_dir='W')) 
    or (road_name='64' and mile_marker between 0.7 and 40.4 and (road_dir='E' 
or road_dir='W'))
                or (road_name='55' and mile_marker between 184.8 and 208.1 and 
(road_dir='N' or road_dir='S'))
                or (road_name='270' and mile_marker between 0.8 and 34.5 and 
(road_dir='N' or road_dir='S')))
  and not(vdd.speed<0.0 or vdd.speed>90.0 or vdd.volume=0.0) and vdd.lane_type 
in ('through_lanes') 
group by date, hour, min, vdd.vds_id, mile_marker
having sum(vdd.volume)!=0
order by vdd.vds_id, mile_marker;

Daily table explain analyze: http://explain.depesz.com/s/iLY
Half month table explain analyze: http://explain.depesz.com/s/Unt

Postgres version:
PostgreSQL 8.4.8, compiled by Visual C++ build 1400, 32-bit

History:
None, this is a new database and application

Hardware:
- 2 Intel Xeon 2.13GHz processors with 8 cores each
- 8GB RAM
- Disks are in a RAID 5 configuration with a PERC H700 Integrated RAID 
Controller 512MB Cache
- 5 disks, Seagate 7200 RPM SAS, 500GB each for a total capacity of about 2TB
- Windows Server 2008 R2 64-bit (but 32-bit postgres)
- Hardware upgrades arent an option at this point due to budget and time 
constraints

Maintenance Setup:
Autovacuum is disabled for these tables since the data is never updated.  The 
tables that we are testing with at the moment will not grow any larger and have 
been both clustered and analyzed.  They were clustered on the 
vds_detector_data_timestamp_idx index.

GUC Settings:
effective_cache_size: 2048MB
work_mem: 512MB
shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query plan and 
took the same amount of time to execute give or take a few seconds

Summary:

The time to get the raw data (before aggregation and sorting) is relatively 
similar between the daily and half month tables.  It would appear that the 
major difference is the ordering of sort and aggregation, the daily tables 
aggregate first so the amount of data sorted is significantly less.

Since the daily tables are only 360MB, I would hope that the entire table could 
be pulled into memory with one large sequential read.  Of course this assumes 
that the file pieces are stored contiguously, but auto defrag is enabled and 
shows low fragmentation so Im trusting (as much as one can) Windows to do the 
right thing here.  My drives have a 150MB/s sustained max throughput, and 
considering that data is spread across 5 drives I would hope to at least be 
able to reach the single disk theoretical limit and read an entire table plus 
the index into memory about 4 to 5 seconds.  Based on the analyze output, each 
daily table averages 6 to 7 seconds, so Im pretty close there and maybe just 
limited by disk speed?

In both cases, the row estimates vs actual are way off.  Ive increased 
statistics on the reading_timestamp and road_name columns to 100 and then 1000 
with no change.  I ran an ANALYZE after each statistics change.  Should I be 
upping stats on the non-indexed columns as well?  Ive read documentation that 
says I should be able to set statistics values for an entire table as opposed 
to per column, but havent found how to do that.  I guess I was either too lazy 
to update statistics on each column or just didnt think it would help much.

So, any pointers for performance improvement?

Thanks,
Craig
Open Roads Consulting, Inc.
757-546-3401
http://www.openroadsconsulting.com

This e-mail communication (including any attachments) may contain confidential 
and/or privileged material intended solely for the individual or entity to 
which it is addressed.  If you are not the intended recipient, you should 
immediately stop reading this message and delete it from all computers that it 
resides on. Any unauthorized reading, distribution, copying or other use of 
this communication (or its attachments) is strictly prohibited.  If you have 
received this communication in error, please notify us immediately.



This e-mail communication (including any attachments) may contain confidential 
and/or privileged material intended solely for the individual or entity to 
which it is addressed.
P - Think before you print.

[PERFORM] Slow performance when querying millions of rows

Reply via email to