We currently use 4 int32s to store xmin, xmax, cmin, cmax and xvac on every heap tuple. That's a lot of overhead, especially on tables with narrow rows. Reduction in header size would give us considerable space and I/O savings.

I'm thinking of removing cmin and cmax, and keeping that information in backend-private memory instead. cmin and cmax are only interesting to the inserting/deleting transaction, so using precious tuple header space for that is a waste. This has been discussed before, and Manfred Koizar even had a patch for 7.4 but didn't submit it because it was incomplete (http://archives.postgresql.org/pgsql-hackers/2005-09/msg00172.php). BTW: Manfred, do you still have the patch? It'd be interesting to look at, even if it's not finished.


This reduces the tuple header size by 4 bytes, or 8 bytes if we can later get rid of xvac as well.

There's some interesting properties we can exploit in implementing the backend-private storage:

1. in small OLTP transactions that touch few rows, the cmin/cmax information will easily fit in memory.

2. if current commandid == 1, every tuple with xmin == current xid is not visible. Similarly, every tuple with xmax = current xid is visible. So we don't need to store anything for the first command in a transaction.

3. we can forget modifications by command X as soon as there's no live snapshots with curcid <= X.

4. we don't need to record the information, if there's no live snapshots, and we know that the current command is not going to read existing rows.

These optimizations take care of bulk inserts nicely. In particular, pg_restore and similar applications wouldn't need to keep track of inserted records.

By choosing a clever data structure, we might be able to get away without spilling to disk. For example, a hash table works great for small transactions. If a transaction does bulk modifications, a bitmap per relation could be used. And to optimize for the cases when the information is not actually used, we can just collect the information to an array, and turn it into a more lookup-friendly data structure the first time it's needed.

Even if we do have to spill to disk on large transactions that do massive updates and selects, it would still be a win for most databases. And it penalizes the transactions that do massive updates, instead of every transaction in the system as we do now.

Comments?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Reply via email to