On Sat, May 26, 2018 at 6:33 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > On Fri, Mar 2, 2018 at 4:05 PM, Alexander Korotkov > <a.korot...@postgrespro.ru> wrote: > > It's been a while since we have updated the progress on this project, > so here is an update. >
Yet, another update. > This is based on the features that were not > working (as mentioned in Readme.md) when the branch was published. > 1. TID Scans are working now. > 2. Insert .. On Conflict is working now. > 3. Tuple locking is working with a restriction that if there are more > concurrent lockers on a page than the number of transaction slots on a > page, then some of the lockers will wait till others get committed. > We are working on a solution to extend the number of transaction slots > on a separate set of pages which exist in heap, but will contain only > transaction data. > Now, we have a working solution for this problem. The extended transaction slots are stored in TPD pages (those contains only transaction slot arrays) which are interleaved with regular pages. For a detailed idea, you can see atop src/backend/access/zheap/tpd.c. We still have a caveat here which is once the TPD pages are pruned (the TPD page can be pruned if all the transaction slots are old enough to matter), they are not added to FSM for reuse. We are working on a patch for this which we expect to finish in a week or so. Toast tables are working now, the toast data is stored in zheap. Apart from having a consistency for storing toast data in the same storage engine as main data, it has the advantage of early cleanup which means the space for deleted rows can be reclaimed as soon as the transaction commits. This is good for toast tables as each update in toast table is a DELETE+INSERT. Alignment of tuples is changed such that we don’t have align padding between the tuple header and the tuple data as we always make a copy of the tuple to support in-place updates. Likewise, we ideally don't need any alignment padding between tuples. However, there are places in zheap code where we access tuple header directly from page (ex. zheap_delete, zheap_update, etc.) for which we want them to be aligned at the two-byte boundary). We omit all alignment padding for pass-by-value types. Even in the current heap, we never point directly to such values, so the alignment padding doesn’t help much; it lets us fetch the value using a single instruction, but that is all. Pass-by-reference types will work as they do in the heap. We can't directly access unaligned values; instead, we need to use memcpy. We believe that the space savings will more than pay for the additional CPU costs. Vacuum full is implemented in such a way that we don't copy the information required for MVCC-aware scans. We copy only LIVE tuples in new heap and freeze them before storing in new heap. This is not a good idea as we lose all the visibility information of tuples, but OTOH, the same can't be copied from the original tuple as that is maintained in undo and we don't have the facility to modify undorecords. We can either allow to modify undo records or write special kind of undo records which will capture the required visibility information. I think it will be tricky to do this and not sure if it is valuable to put a whole lot of effort without making basic things work and another thing is that after zheap, the need of vacuum will anyway be minimized to a good extent. Serializable isolation is also supported, we don't need to make any major changes except for making it understand ZheapTuple (used TID in the required API's). I think this part needs some changes after integration with pluggable storage API. We have a special handling for the tuples which are in-place updated or the latest transaction that modified that tuple got aborted. In that case, we check whether the latest committed transaction that modified that tuple is a concurrent transaction. Based on that, we take a decision on whether we have any serialization conflict. In zheap, for sub-transactions we don't need to generate new xid as the visibility information for a particular tuple is present in undo and on Rollabck To Savepoint, we apply the required undo to make the state of the tuples as they were before the particular transaction. This gives us a performance/scalability boost when sub-transactions are involved as we don't need to acquire XIDGenLock for subtransaction. Apart from the above benefits, we need this for zheap as otherwise the undo chain for each transaction won't be linear and we save allocating additional slots for the each transaction id at the page level. Undo workers and transaction rollbacks are working now. My colleague Dilip has posted a separate patch [1] for this as this can have some use cases without zheap as well and Thomas has just posted a patch using that facility. Some of the other features like row movement for an update of partition key are also handled. In short, now most of the user-visible features are working. The make installcheck for zheap has 12 failures and all are mostly due to the plan or some stats changes as zheap has additional meta pages (meta page and TPD pages) and or we have inplace updates. So in most cases either additional ORDER BY needs to be added or some minor tweak in the query is required. The isolation test has one failure which again is due to inplace updates and seems to be a valid case, but needs a bit more investigation. We have yet to support JIT for zheap, so the corresponding tests would also fail. Some of the main things that are not working: Logical decoding - I am not sure at this stage whether it is a must for the first version of zheap. Surely, we can have a basic design ready. Snapshot too old - This feature allows the data in heap pages to be removed in presence of old transactions. This is going to work differently for zheap as we want the undo for older snapshots to go-away rather than based on heap pages as we do for current heap. One can argue that we should make it similar to the current heap, but I see a lot less value in that as this new heap works entirely differently and we can have a better implementation for that. Delete marking in indexes - This will allow inplace updates even when index columns are updated and additionally with this we can avoid the need for a dedicated vacuum process to perform retail deletes. This is the feature we definitely want to do separate than the main heap because current indexes work with zheap without any major changes. You can find the latest code at https://github.com/EnterpriseDB/zheap I want to again like to highlight that this all is not alone my work. Dilip Kumar, Kuntal Ghosh, Rafia Sabih, Mithun C Y and Amit Khandekar have worked along with me to make this progress. Feedback is welcome. [1] - https://www.postgresql.org/message-id/flat/cafitn-syq8r8anjwfykxvfnxgxylrfvbx9ee4sxo9ns-obb...@mail.gmail.com [2] - https://www.postgresql.org/message-id/CAEepm%3D0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG%2BXpssKqmezug%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com