From: Amit Kapila [mailto:amit.kapil...@gmail.com] > At EnterpriseDB, we (me and some of my colleagues) are working from more > than a year on the new storage format in which only the latest version of > the data is kept in main storage and the old versions are moved to an undo > log. We call this new storage format "zheap". To be clear, this proposal > is for PG-12.
Wonderful! BTW, what "z" stand for? Ultimate? > Credits > ------------ > Robert did much of the basic design work. The design and development of > various subsystems of zheap have been done by a team comprising of me, Dilip > Kumar, Kuntal Ghosh, Mithun CY, Ashutosh Sharma, Rafia Sabih, Beena Emerson, > and Amit Khandekar. Thomas Munro wrote the undo storage system. Marc > Linster has provided unfailing management support, and Andres Freund has > provided some design input (and criticism). Neha Sharma and Tushar Ahuja > are helping with the testing of this project. What a gorgeous star team! Below are my first questions and comments. (1) This is a pure simple question from the user's perspective. What kind of workloads would you recommend zheap and heap respectively? Are you going to recommend zheap for all use cases, and will heap be deprecated? I think we need to be clear on this in the manual, at least before the final release. I felt zheap would be better for update-intensive workloads. Then, how about insert-and-read-mostly databases like a data warehouse? zheap seems better for that, since the database size is reduced. Although data loading may generate more transaction logs for undo, that increase is offset by the reduction of the tuple header in WAL. zheap allows us to run long-running analytics and reporting queries simultaneously with updates without the concern on database bloat, so zheap is a way toward HTAP, right? (2) Can zheap be used for system catalogs? If yes, we won't be bothered with system catalog bloat, e.g. as a result of repeated creation and deletion of temporary tables. (3) > Scenario 1: A 15 minutes simple-update pgbench test with scale factor 100 > shows 5.13% TPS improvement with 64 clients. The performance improvement > increases as we increase the scale factor; at scale factor 1000, it > reaches11.5% with 64 clients. What was the fillfactor? What would be the comparison when HOT works effectively for heap? (4) "Undo logs are not yet crash-safe. Fsync and some recovery details are yet to be implemented." "We also want to make FSM crash-safe, since we can’t count on VACUUM to recover free space that we neglect to record." Would these directly affect the response time of each transaction? Do you predict that the performance difference will get smaller when these are implemented? )5) "The tuple header is reduced from 24 bytes to 5 bytes (8 bytes with alignment): 2 bytes each for informask and infomask2, and one byte for t_hoff. I think we might be able to squeeze some space from t_infomask, but for now, I have kept it as two bytes. All transactional information is stored in undo, so fields that store such information are not needed here." "To check the visibility of a tuple, we fetch the transaction slot number stored in the tuple header, and then get the transaction id and undo record pointer from transaction slot." Where in the tuple header is the transaction slot number stored? (6) "As of now, we have four transaction slots per page, but this can be changed. Currently, this is a compile-time option; we can decide later whether such an option is desirable in general for users." "The one known problem with the fixed number of slots is that it can lead to deadlock, so we are planning to add a mechanism to allow the array of transactions slots to be continued on a separate overflow page. We also need such a mechanism to support cases where a large number of transactions acquire SHARE or KEY SHARE locks on a single page." I wish for this. I was bothered with deadlocks with Oracle and had to tune INITRANS with CREATE TABLE. The fixed number of slots introduces a new configuration parameter, which adds something the DBA has to be worried about and monitor a statistics figure for tuning. (7) What index AMs does "indexes which lack delete-marking support" apply to? Can we be freed from vacuum in a typical use case where only zheap and B-tree indexes are used? (8) How does rollback after subtransaction rollback work? Does the undo of a whole transaction skip the undo of the subtransaction? (9) Will the prepare of 2pc transactions be slower, as they have to safely save undo log? Regards Takayuki Tsunakawa