I work at a ride sharing company and we found a simple scenario that
Postgres has a lot to improve at.
After my talk at pgconf.ru Alexander Korotkov encouraged me to share my
story and thoughts in -hackers.

Story setting:

 - Amazon RDS (thus vanilla unpatchable postgres), synchronuous_commit=off
(we're ok to lose some seconds on crash).

 - Amazon gp2 storage. It's almost like SSD for small spikes, but has IOPS
limit if you're doing a lot of operations.

 - Drivers driving around the city and sending their GPS positions each
second, for simplicity - 50k of them at a time.

 - An append-only table of shape (id uuid, ts timestamp, geom geometry,
heading speed accuracy float, source text).
A btree index on (id, ts).

 - A service that gets the measurements on the network, batches all into a
buffer of 1 second (~N=50k rows) and inserting via COPY.

After someone orders and completes a trip, we have the id of the driver and
trip time interval coming from another service, and want to get trip's
route to calculate the bill. Trip times are from seconds up to 4 hours
(N=4*3600=14400 rows, a page typically has 60 rows).

select * from positions where id = :id and ts between :start_ts and :end_ts;

Data for more than 24 hours ago is not needed in this service, thus a
storage of 70 gb should be enough. On the safe side we're giving it 500 gb,
which on gp2 gives a steady 1500 iops.

In development (on synthetic data) plans use index and look great, so we
proceed with Postgres and not MySQL. :)

When deployed to production we figured out that a single query for interval
of more than half an hour (1800+ rows) can exhaust all the IOPS.
Data is appended with increasing time field, which effectively ensures no
rows from the same driver are ever going to be in the same heap page. A 4
hour long request can degrade system for 10 seconds. gp2 provides max 10000
IOPS, and we need to get 14400 pages then. We need the biggest available
gp2 offer just to read 2 megabytes of data in 1.5 seconds. The best
io-optimized io1 volumes provide 32000 IOPS, which get us as low as 500 ms.

If the data was perfectly packed, it would be just 240 8k pages and
translate to 120 input operations of gp2's 16kb blocks.

Our options were:

 - partitioning. Not entirely trivial when your id is uuid. To get visible
gains, we need to make sure each driver gets their own partition. That
would leave us with 50 000(+) tables, and rumors say that in that's what is
done in some bigger taxi service, and relcache then eats up all the RAM and
system OOMs.

 - CLUSTER table using index. Works perfect on test stand, isn't available
as online option.

 - Postgres Pro suggested CREATE INDEX .. INCLUDE (on commitfest
https://commitfest.postgresql.org/15/1350/). We can't use that as it's not
in upstream/amazon Postgres yet.

 - We decided to live with overhead of unnecessary sorting by all fields
and keeping a copy of heap and created a btree over all the fields to
utilize Index-Only Scans:

  * Testing went well on dump of production database.

  * After we've made indexes on production, we found out that performance
is _worse_ than with simpler index.

  * EXPLAIN (BUFFERS) revealed that Visibility Map is never being frozen,
as autovacuum ignores append-only never-updated never-deleted table that is
only truncated once a day. No way to force autovacuum on such table exists.

  * We created a microservice (hard to find spot for crontab in distributed
system!) that periodically agressively runs VACUUM on the table.
It indeed helped with queries, but.. VACUUM skips all-visible pages in
index, but always walks over all the pages of btree, which is even larger
than the heap in our case.
There is a patch to repair this behavior on commitfest
https://commitfest.postgresql.org/16/952/ - but not yet in upstream/amazon

  * We ended up inventing partitioning schema that rotates a table every
gigabyte of data, to keep VACUUM run time low. There are a hundreds
partitions with indexes on all the fields. Finally the system is stable.

  * EXPLAIN (BUFFERS) shows later that each reading query visits all the
indexes of partitioned table and fetches a page from index to know there
are 0 rows there. To prune obviously unneeded partitions we decided to add
constraint on timestamp after a partition is finalized. Timestamps are
sanitized due to mobile network instability are stamped on the client side,
so we don't know the bounds in advance. Unfortunately that means we need
two seq scans to do it: first one to get min(ts), max(ts), and second one
on ALTER TABLE ADD CONSTRAINT. This operation also eats up iops.

We are not very large company but we bump into too many scalability issues
on this path already. Searching for solutions on every step shows other
people with tables named like "gpsdata" and "traces", so we're not alone
with this problem. :)

I gave this all some thought and it looks like it all could have not
happened if Postgres was able to cluster heap insertions by (id, ts) index.
We're ok with synchronuous_commit=off, so amplified write won't immediately
hit disk and can get cooled down in progress. Clustering doesn't require
perfect sorting: we need to minimize number of pages fetched, it's ok if
the pages are not consecutive on disk.

I see the option for clustered heap writes as follows:

 - on tuple insertion, check if table is clustered by some index;

 - if table is clustered, we start writing not into last used page, but
instead go into index and get page numbers for index tuples that are less
or equal than current one, up to index page boundary (or some other exit
strategy, at least one heap page is needed);

 - if we can fit tuple into that page, let it be written there;

 - if we cannot fit it, consult statistics to see if we have too many empty
space in pages. If we have more than 50% space empty, get pages by FSM.
Create a new page otherwise.
(looking into FSM as safety measure for people who specifically insert
tuples in backward sorted order).

This would not require more space than is currently required to keep
vacuumed all-fields index, and let us omit VACUUM passes over relations. A
pencil-and-paper simulation shows it as a good thing. Do you have thoughts,
or can you help with implementation, or tell why it would be a bad idea and
we need to follow our competitor in moving away to other RDBMS? :)

I encourage everyone to help with at least
https://commitfest.postgresql.org/16/952/ as no good SQL-level workaround
exists for it. After that enabling autovacuum for tables that have only
inserted and no deleted tuples would be cool to have, so that we have them
marked as Visible.

Hackers, can you help me keep Postgres in the system? :)

Darafei Praliaskouski,
GIS Engineer / Juno Minsk

Reply via email to