From: Amit Kapila [mailto:amit.kapil...@gmail.com]
> At EnterpriseDB, we (me and some of my colleagues) are working from more
> than a year on the new storage format in which only the latest version of
> the data is kept in main storage and the old versions are moved to an undo
> log. We call this new storage format "zheap". To be clear, this proposal
> is for PG-12.
Wonderful! BTW, what "z" stand for? Ultimate?
> Robert did much of the basic design work. The design and development of
> various subsystems of zheap have been done by a team comprising of me, Dilip
> Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson,
> and Amit Khandekar. Thomas Munro wrote the undo storage system. Marc
> Linster has provided unfailing management support, and Andres Freund has
> provided some design input (and criticism). Neha Sharma and Tushar Ahuja
> are helping with the testing of this project.
What a gorgeous star team!
Below are my first questions and comments.
This is a pure simple question from the user's perspective. What kind of
workloads would you recommend zheap and heap respectively? Are you going to
recommend zheap for all use cases, and will heap be deprecated? I think we
need to be clear on this in the manual, at least before the final release.
I felt zheap would be better for update-intensive workloads. Then, how about
insert-and-read-mostly databases like a data warehouse? zheap seems better for
that, since the database size is reduced. Although data loading may generate
more transaction logs for undo, that increase is offset by the reduction of the
tuple header in WAL.
zheap allows us to run long-running analytics and reporting queries
simultaneously with updates without the concern on database bloat, so zheap is
a way toward HTAP, right?
Can zheap be used for system catalogs? If yes, we won't be bothered with
system catalog bloat, e.g. as a result of repeated creation and deletion of
> Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100
> shows 5.13% TPS improvement with 64 clients. The performance improvement
> increases as we increase the scale factor; at scale factor 1000, it
> reaches11.5% with 64 clients.
What was the fillfactor? What would be the comparison when HOT works
effectively for heap?
"Undo logs are not yet crash-safe. Fsync and some recovery details are yet to
"We also want to make FSM crash-safe, since we can’t count on
VACUUM to recover free space that we neglect to record."
Would these directly affect the response time of each transaction? Do you
predict that the performance difference will get smaller when these are
"The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment):
2 bytes each for informask and infomask2, and one byte for t_hoff. I think we
might be able to squeeze some space from t_infomask, but for now, I have kept
it as two bytes. All transactional information is stored in undo, so fields
that store such information are not needed here."
"To check the visibility of a
tuple, we fetch the transaction slot number stored in the tuple header, and
then get the transaction id and undo record pointer from transaction slot."
Where in the tuple header is the transaction slot number stored?
"As of now, we have four transaction slots per
page, but this can be changed. Currently, this is a compile-time option; we
can decide later whether such an option is desirable in general for users."
"The one known problem with the fixed number of slots is that
it can lead to deadlock, so we are planning to add a mechanism to allow the
array of transactions slots to be continued on a separate overflow page. We
also need such a mechanism to support cases where a large number of
transactions acquire SHARE or KEY SHARE locks on a single page."
I wish for this. I was bothered with deadlocks with Oracle and had to tune
INITRANS with CREATE TABLE. The fixed number of slots introduces a new
configuration parameter, which adds something the DBA has to be worried about
and monitor a statistics figure for tuning.
What index AMs does "indexes which lack delete-marking support" apply to?
Can we be freed from vacuum in a typical use case where only zheap and B-tree
indexes are used?
How does rollback after subtransaction rollback work? Does the undo of a whole
transaction skip the undo of the subtransaction?
Will the prepare of 2pc transactions be slower, as they have to safely save