Hello, We (David and I) recently observed that a Zedstore table can be considerably bloated when we load data into it with concurrent copies. Also, we found that concurrent insert performance was less than desirable. This is a detailed analysis of the extent of the problem, the cause of the problem and how we fixed it. This has input from much of our team: Alex, Ashwin, David, Heikki, Melanie, Taylor and myself.
An example of the bloat that we observed: TPC-DS scale = 270: Table heap zed(serial) zed(16 parallel COPYs) web_sales 39G 19G 39G We found that it was caused due to inefficient page splits resultant from out-of-tid-order-inserts into full/full-ish attribute tree leaf pages. The degree of under-utilization was significant - attribute tree leaves with a serial data load had 6-8x more datums than the attribute tree leaves resultant with a parallel load of 16 sessions. Consider the scenario below: Assumptions: 1. Let us consider two concurrent copy commands executing (sessions S1 and S2). 2. The table has only one (fixed-length for sake of argument) attribute 'a'. 3. For attribute 'a', a full attribute tree leaf page can accommodate 1500 datums. TID allocations: S1: 1-1000 S2: 1001-2000, 2001-3000 Order of operations: 1. S2 writes datums for tids 1001-2000, 2001-3000. The resulting leaves are: L1: lokey = 1 hikey = 2500 firsttid = 1001 lasttid = 2500 L2: lokey = 2501 hikey = MaxZSTid firsttid = 2501 lasttid = 3000 2. S1 now writes datums for tids 1-1000. We have to split L1 into L1' and L1''. L1': lokey = 1 hikey = 1500 firsttid = 1 lasttid = 1500 L1'': [under-utilized page] lokey = 1501 hikey = 2000 firsttid = 1501 lasttid = 2000 L2: lokey = 2501 hikey = MaxZSTid firsttid = 2501 lasttid = 3000 Note: The lokeys/hikeys reflect ranges of what CAN be inserted whereas firsttid and lasttid reflect what actually have been inserted. L1'' will be an under-utilized page that is not going to be filled again because it inherits the tight hikey from L1. In this example, space wastage in L1'' is 66% but it could very easily be close to 100%, especially under concurrent workloads which mixes single and multi-inserts, or even unequally sized multi-inserts. Solution (kudos to Ashwin!): For every multi-insert (and only multi-insert, not for singleton inserts), allocate N times more tids. Each session will keep these extra tids in a buffer. Subsequent calls to multi-insert would use these buffered tids. If at any time a tid allocation request cannot be met by the remaining buffered tids, a new batch of N times the number of tids requested will again be allocated. If we take the same example above and say we allocated N=5 times the number of tids upon the first request for 1000 tids.: TID allocations: S1: 1-5000 S2: 5001-10000 Order of operations: 1. S2 writes datums for tids 5001-6000, 6001-7000. The resulting leaves are: L1: lokey = 1 hikey = 6500 firsttid = 5001 lasttid = 6500 L2: lokey = 6501 hikey = MaxZSTid firsttid = 6501 lasttid = 7000 2. S1 writes datums for tids 1-1000. L1 will be split into L1' and L1''. L1': lokey = 1 hikey = 5500 firsttid = 1 lasttid = 1000 L1'' [under-utilized page]: lokey = 5501 hikey = 6500 firsttid = 5501 lasttid = 6500 L2: lokey = 6501 hikey = MaxZSTid firsttid = 6501 lasttid = 7000 Subsequent inserts by S1 will land on L1' whose hikey isn't restrictive. However, we do end up with the inefficient page L1''. With a high enough value of N, we reduce the frequency of such pages. We could further reduce this wastage by incorporating a special left split (Since L1 was already full, we don't change it at all -> we simply update it's lokey -> L1 becomes L1'' and we fork of a new leaf to its left: L1'). This would look like: L1': lokey = 1 hikey = 5000 firsttid = 1 lasttid = 1000 L1'': lokey = 5001 hikey = 6500 firsttid = 5001 lasttid = 6500 We found that with a high enough value of N, we did not get significant space benefits from the left split. Thus, we decided to only incorporate N. Results: [TPC-DS scale = 50, 16 conc copies] Table zed N=10 N=100 N=1000 heap zed(serial) catalog_sales 15G 9.1G 7.7G 7.5G 15G 8.0G catalog_returns 1.5G 0.9G 0.7G 0.7G 1.2G 0.8G store_returns 2.1G 1.2G 1.1G 1.1G 1.9G 1.2G store_sales 17G 11G 10.1G 10.1G 21G 10G Load time: N=10 30min N=100 10min N=1000 7min zed 100min heap 8min 'zed' refers to the zedstore branch without our fix. We see that with N = 10, we get closer to what we get with serial inserts. For N = 100, we even beat serial insert. We can attribute the differences in runtime to the fact that by lowering the number of tid range requests, we reduce the contention on the tid tree - which is a bottleneck for concurrent loads. A significant win! How N relates to the other parameters in play: Let S be the number of concurrent sessions Let T be the average number of rows that a session wants to write in t sized multi-insert batches Let A be the number of attributes Number of times a session multi-inserts into the tid tree without buffered allocation = T/t Number of times a session multi-inserts into the tid tree with buffered allocation = T/Nt Total number of multi-inserts into the tid tree = Mt = ST/Nt Also, total number of adverse insert cases where we could have bloat ∝ Mt So, bloat ∝ Mt Run-time of a parallel data load ∝ Mt * A So the guidance would be to increase N with the increase in S or in T (t will be relatively constant for a certain table - it is constrained by the size of a row and the copy buffer) and also if the table is significantly wide. We can see that it is difficult to provide a default to N, it really should be a GUC. Also, SINGLE_INSERT_TID_RESERVATION_THRESHOLD and SINGLE_INSERT_TID_RESERVATION_SIZE should be turned into GUCs. In our implementation, we treat MULTI_INSERT_TID_RESERVATION_FACTOR = N. We leave the GUC implementation for later. Cost of killing the extra unused tids not consumed by multi-inserts: The maximum number of tids that can be wasted (W) is capped at (tN - 1) * S. This is the worst case: where the last tid allocation request only used 1 tid out of the tN tids it received and buffered for every session. So average case ~ (tN /2) * S. Number of times the tid tree has to be accessed to delete these (tN/2) * S tids is S. So taking tid wastage into account, on average, number of accesses to the tid tree = Mt + W = ST/Nt + Thus this additional cost of S, and thus cost of tid killing is not really significant. Regards, Soumyadeep & David