Hello,
I have a handful of queries that are performing very slowly. I realize that I
will be hitting hardware limits at some point, but want to make sure Ive
squeezed out every bit of performance I can before calling it quits.
Our database is collecting traffic data at the rate of about 3 million rows a
day. The most important columns in the table being queried are timestamp,
volume, occupancy, and speed. I have also denormalized the table by adding
road name, direction, mile marker, and lane type values to eliminate joins with
other tables that contain information about the devices that collect this
information. A typical query will involve segments of roadway (i.e. road
names, directions, and mile marker bounds) over a certain period of time (e.g.
morning rush hour), and will have filters to exclude questionable data such
(e.g. speed > 100 MPH). Unfortunately, there are also a few cases in which a
user will query data for many full days on all roadways, essentially querying
everything for a large period of time. One other thing to note is that we only
ever query readings with lane_type = through_lanes, although we are collecting
ramp and reversible lane data to facilitate future reporting needs.
Table Metadata:
- Volume typically ranges anywhere from 0 to 20, averages around 4 5. A small
percentage of the rows have null volume.
- Occupancy ranges from 1 to 10, averages around 1 or 2
- Speed is about what you would expect, ranging from 30 70 with an average
somewhere near the middle
- There are 17 roads
- There are 2 directions per road
- Mile marker ranges vary by roadway, typical ranges are something like 0 to 40
or 257 to 290
- Most (80 to 90% +) of the readings have lane_type = through_lanes
- Size of a daily table is about 360MB, a half month table is 5 to 6 GB
Full Table and Index Schema:
Ive experimented with partitioning using a table per day and 2 tables per month
(1st through 15th, 16th to end of month). 2 tables/month was the original
approach to keep the number of tables from growing too rapidly, and shows about
3x slower performance. Using daily tables incurs extra planning overhead as
expected, but isnt all that bad. Im OK with taking a 1 second planning hit if
my overall query time decreases significantly. Furthermore, we will only be
storing raw data for about a year and can aggregate old data. This means that
I can do daily tables for raw data and larger/fewer tables for older data. The
table and index structure is below, which is identical between daily and ½
month tables with a couple of exceptions:
- Daily tables have a fill factor of 100, ½ month tables are default
- Only the 4 column indexes were created for the daily tables since the others
never get used
CREATE TABLE vds_detector_data
(
reading_timestamp timestamp without time zone,
vds_id integer,
detector_id integer,
status smallint,
speed numeric(12,9),
volume numeric(12,9),
confidence smallint,
occupancy numeric(12,9),
loadid bigint,
road_name character varying(150),
road_dir character varying(2),
mile_marker numeric(7,2),
lane_number integer,
lane_type character varying(32),
CONSTRAINT vds_detector_vdsid_fkey FOREIGN KEY (vds_id, detector_id)
REFERENCES ref_vds_detector_properties (device_id, detector_id) MATCH
SIMPLE
ON UPDATE CASCADE ON DELETE RESTRICT
)
WITH (
OIDS=FALSE
);
CREATE INDEX vds_detector_data_dir_idx
ON vds_detector_data
USING btree
(road_dir);
CREATE INDEX vds_detector_data_lane_idx
ON vds_detector_data
USING btree
(lane_number);
CREATE INDEX vds_detector_data_mm_idx
ON vds_detector_data
USING btree
(mile_marker);
CREATE INDEX vds_detector_data_occupancy_idx
ON vds_detector_data
USING btree
(occupancy);
CREATE INDEX vds_detector_data_road_idx
ON vds_detector_data
USING btree
(road_name);
CREATE INDEX vds_detector_data_road_ts_mm_dir_idx
ON vds_detector_data
USING btree
(road_name, reading_timestamp, mile_marker, road_dir);
CREATE INDEX vds_detector_data_speed_idx
ON vds_detector_data
USING btree
(speed);
CREATE INDEX vds_detector_data_timestamp_idx
ON vds_detector_data
USING btree
(reading_timestamp);
CREATE INDEX vds_detector_data_ts_road_mm_dir_idx
ON vds_detector_data
USING btree
(reading_timestamp, road_name, mile_marker, road_dir);
CREATE INDEX vds_detector_data_volume_idx
ON vds_detector_data
USING btree
(volume);
EXPLAIN ANALYZE:
Query:
select cast(reading_timestamp as Date) as date, floor(extract(hour from
reading_timestamp) / 1.0) * 1.0 as hour, floor(extract(minute from
reading_timestamp) / 60) * 60 as min,
count(*), sum(vdd.volume) as totalVolume,
sum(vdd.occupancy*vdd.volume)/sum(vdd.volume) as avgOcc,
sum(vdd.speed*vdd.volume)/sum(vdd.volume) as avgSpeed,
avg(vdd.confidence) as avgConfidence, min(vdd.detector_id) as detectorId,
vdd.vds_id as vdsId
from vds_detector_data vdd
where (vdd.reading_timestamp between '2011-4-01 00:00:00.000' and '2011-04-30
23:59:59.999')
and vdd.volume!=0
and ((road_name='44' and mile_marker between 257.65 and 289.5 and
(road_dir='E' or road_dir='W'))
or (road_name='64' and mile_marker between 0.7 and 40.4 and (road_dir='E'
or road_dir='W'))
or (road_name='55' and mile_marker between 184.8 and 208.1 and
(road_dir='N' or road_dir='S'))
or (road_name='270' and mile_marker between 0.8 and 34.5 and
(road_dir='N' or road_dir='S')))
and not(vdd.speed<0.0 or vdd.speed>90.0 or vdd.volume=0.0) and vdd.lane_type
in ('through_lanes')
group by date, hour, min, vdd.vds_id, mile_marker
having sum(vdd.volume)!=0
order by vdd.vds_id, mile_marker;
Daily table explain analyze: http://explain.depesz.com/s/iLY
Half month table explain analyze: http://explain.depesz.com/s/Unt
Postgres version:
PostgreSQL 8.4.8, compiled by Visual C++ build 1400, 32-bit
History:
None, this is a new database and application
Hardware:
- 2 Intel Xeon 2.13GHz processors with 8 cores each
- 8GB RAM
- Disks are in a RAID 5 configuration with a PERC H700 Integrated RAID
Controller 512MB Cache
- 5 disks, Seagate 7200 RPM SAS, 500GB each for a total capacity of about 2TB
- Windows Server 2008 R2 64-bit (but 32-bit postgres)
- Hardware upgrades arent an option at this point due to budget and time
constraints
Maintenance Setup:
Autovacuum is disabled for these tables since the data is never updated. The
tables that we are testing with at the moment will not grow any larger and have
been both clustered and analyzed. They were clustered on the
vds_detector_data_timestamp_idx index.
GUC Settings:
effective_cache_size: 2048MB
work_mem: 512MB
shared_buffers: 64MB, 512MB, and 1024MB, each yielded the same query plan and
took the same amount of time to execute give or take a few seconds
Summary:
The time to get the raw data (before aggregation and sorting) is relatively
similar between the daily and half month tables. It would appear that the
major difference is the ordering of sort and aggregation, the daily tables
aggregate first so the amount of data sorted is significantly less.
Since the daily tables are only 360MB, I would hope that the entire table could
be pulled into memory with one large sequential read. Of course this assumes
that the file pieces are stored contiguously, but auto defrag is enabled and
shows low fragmentation so Im trusting (as much as one can) Windows to do the
right thing here. My drives have a 150MB/s sustained max throughput, and
considering that data is spread across 5 drives I would hope to at least be
able to reach the single disk theoretical limit and read an entire table plus
the index into memory about 4 to 5 seconds. Based on the analyze output, each
daily table averages 6 to 7 seconds, so Im pretty close there and maybe just
limited by disk speed?
In both cases, the row estimates vs actual are way off. Ive increased
statistics on the reading_timestamp and road_name columns to 100 and then 1000
with no change. I ran an ANALYZE after each statistics change. Should I be
upping stats on the non-indexed columns as well? Ive read documentation that
says I should be able to set statistics values for an entire table as opposed
to per column, but havent found how to do that. I guess I was either too lazy
to update statistics on each column or just didnt think it would help much.
So, any pointers for performance improvement?
Thanks,
Craig
Open Roads Consulting, Inc.
757-546-3401
http://www.openroadsconsulting.com
This e-mail communication (including any attachments) may contain confidential
and/or privileged material intended solely for the individual or entity to
which it is addressed. If you are not the intended recipient, you should
immediately stop reading this message and delete it from all computers that it
resides on. Any unauthorized reading, distribution, copying or other use of
this communication (or its attachments) is strictly prohibited. If you have
received this communication in error, please notify us immediately.
This e-mail communication (including any attachments) may contain confidential
and/or privileged material intended solely for the individual or entity to
which it is addressed.
P - Think before you print.