On Wed, Sep 11, 2019 at 5:38 AM Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > I reviewed them and everything looks good except the idea of not > splitting dead posting tuples. > According to comments to scan->ignore_killed_tuples in genam.c:107, > it may lead to incorrect tuple order on a replica. > I don't sure, if it leads to any real problem, though, or it will be > resolved > by subsequent visibility checks.
Fair enough, but I didn't do that because it's compelling on its own -- it isn't. I did it because it seemed like the best way to handle posting list splits in a version of the patch where LP_DEAD bits can be set on posting list tuples. I think that we have 3 high level options here: 1. We don't support kill_prior_tuple/LP_DEAD bit setting with posting lists at all. This is clearly the easiest approach. 2. We do what I did in v11 of the patch -- we make it so that _bt_insertonpg() and _bt_split() never have to deal with LP_DEAD posting lists that they must split in passing. 3. We add additional code to _bt_insertonpg() and _bt_split() to deal with the rare case where they must split an LP_DEAD posting list, probably by unsetting the bit or something like that. Obviously it would be wrong to leave the LP_DEAD bit set for the newly inserted heap tuples TID that must go in a posting list that had its LP_DEAD bit set -- that would make it dead to index scans even after its xact successfully committed. I think that you already agree that we want to have the kill_prior_tuple optimizations with posting lists, so #1 isn't really an option. That just leaves #2 and #3. Since posting list splits are already assumed to be quite rare, it seemed far simpler to take the conservative approach of forcing clean-up that removes LP_DEAD bits so that _bt_insertonpg() and _bt_split() don't have to think about it. Obviously I think it's important that we make as few changes as possible to _bt_insertonpg() and _bt_split(), in general. I don't understand what you mean about visibility checks. There is nothing truly special about the way in which _bt_findinsertloc() will sometimes have to kill LP_DEAD items so that _bt_insertonpg() and _bt_split() don't have to think about LP_DEAD posting lists. As far as recovery is concerned, it is just another XLOG_BTREE_DELETE record, like any other. Note that there is a second call to _bt_binsrch_insert() within _bt_findinsertloc() when it has to generate a new XLOG_BTREE_DELETE record (by calling _bt_dedup_one_page(), which calls _bt_delitems_delete() in a way that isn't dependent on the BTP_HAS_GARBAGE status bit being set). > Anyway, it's worth to add more comments in > _bt_killitems() explaining why it's safe. There is no question that the little snippet of code I added to _bt_killitems() in v11 is still too complicated. We also have to consider cases where the array overflows because the scan direction was changed (see the kill_prior_tuple comment block in btgetuple()). Yeah, it's messy. > Attached is v12, which contains WAL optimizations for posting split and > page > deduplication. Cool. > * xl_btree_split record doesn't contain posting tuple anymore, instead > it keeps > 'in_posting offset' and repeats the logic of _bt_insertonpg() as you > proposed > upthread. That looks good. > * I introduced new xlog record XLOG_BTREE_DEDUP_PAGE, which contains > info about > groups of tuples deduplicated into posting tuples. In principle, it is > possible > to fit it into some existing record, but I preferred to keep things clear. I definitely think that inventing a new WAL record was the right thing to do. > I haven't measured how these changes affect WAL size yet. > Do you have any suggestions on how to automate testing of new WAL records? > Is there any suitable place in regression tests? I don't know about the regression tests (I doubt that there is a natural place for such a test), but I came up with a rough test case. I more or less copied the approach that you took with the index build WAL reduction patches, though I also figured out a way of subtracting heapam WAL overhead to get a real figure. I attach the test case -- note that you'll need to use the "land" database with this. (This test case might need to be improved, but it's a good start.) > * I also noticed that _bt_dedup_one_page() can be optimized to return early > when none tuples were deduplicated. I wonder if we can introduce inner > statistic to tune deduplication? That is returning to the idea of > BT_COMPRESS_THRESHOLD, which can help to avoid extra work for pages that > have > very few duplicates or pages that are already full of posting lists. I think that the BT_COMPRESS_THRESHOLD idea is closely related to making _bt_dedup_one_page() behave incrementally. On my machine, v12 of the patch actually uses slightly more WAL than v11 did with the nbtree_wal_test.sql test case -- it's 6510 MB of nbtree WAL in v12 vs. 6502 MB in v11 (note that v11 benefits from WAL compression, so if I turned that off v12 would probably win by a small amount). Both numbers are wildly excessive, though. The master branch figure is only 2011 MB, which is only about 1.8x the size of the index on the master branch. And this is for a test case that makes the index 6.5x smaller, so the gap between total index size and total WAL volume is huge here -- the volume of WAL is nearly 40x greater than the index size! You are right to wonder what the result would be if we put BT_COMPRESS_THRESHOLD back in. It would probably significantly reduce the volume of WAL, because _bt_dedup_one_page() would no longer "thrash". However, I strongly suspect that that wouldn't be good enough at reducing the WAL volume down to something acceptable. That will require an approach to WAL-logging that is much more logical than physical. The nbtree_wal_test.sql test case involves a case where page splits mostly don't WAL-log things that were previously WAL-logged by simple inserts, because nbtsplitloc.c has us split in a right-heavy fashion when there are lots of duplicates. In other words, the _bt_split() optimization to WAL volume naturally works very well with the test case, or really any case with lots of duplicates, so the "write amplification" to the total volume of WAL is relatively small on the master branch. I think that the new WAL record has to be created once per posting list that is generated, not once per page that is deduplicated -- that's the only way that I can see that avoids a huge increase in total WAL volume. Even if we assume that I am wrong about there being value in making deduplication incremental, it is still necessary to make the WAL-logging behave incrementally. Otherwise you end up needlessly rewriting things that didn't actually change way too often. That's definitely not okay. Why worry about bringing 40x down to 20x, or even 10x? It needs to be comparable to the master branch. > To be honest, I don't believe that incremental deduplication can really > improve > something, because no matter how many items were compressed we still > rewrite > all items from the original page to the new one, so, why not do our best. > What do we save by this incremental approach? The point of being incremental is not to save work in cases where a page split is inevitable anyway. Rather, the idea is that we can be even more lazy, and avoid doing work that will never be needed -- maybe delaying page splits actually means preventing them entirely. Or, we can spread out the work over time, so that the amount of WAL per checkpoint is smoother than what we would get with a batch approach. My mental model of page splits is that there are sometimes many of them on the same page again and again in a very short time period, but more often the chances of any individual page being split is low. Even the rightmost page of a serial PK index isn't truly an exception, because a new rightmost page isn't "the same page" as the original rightmost page -- it is its new right sibling. Since we're going to have to optimize the WAL logging anyway, it will be relatively easy to experiment with incremental deduplication within _bt_dedup_one_page(). The WAL logging is the the hard part, so let's focus on that rather than worrying too much about whether or not incrementally doing all the work (not just the WAL logging) makes sense. It's still too early to be sure about whether or not that's a good idea. -- Peter Geoghegan
Description: Binary data