On Thu, Aug 22, 2019 at 12:54 AM Andres Freund <and...@anarazel.de> wrote: > But why? It makes a *lot* more sense to have it in the beginning. I > don't think bulk-fetch really requires it to be in the end - we can > still process records forward on a page-by-page basis.
There are two separate needs here: to be able to go forward, and to be able to go backward. We have the length at the end of each record not because we're stupid, but so that we can back up. If we have another way of backing up, then the thing to do is not to move that to beginning of the record but to remove it entirely as unnecessary wastage. We can also think about how to improve forward traversal. Considering each problem separately: For forward traversal, we could simplify things somewhat by having only 3 decoding stages instead of N decoding stages. We really only need (1) a stage for accumulating bytes until we have uur_info, and then (2) a stage for accumulating bytes until we know the payload and tuple lengths, and then (3) a stage for accumulating bytes until we have the whole record. We have a lot more stages than that right now but I don't think we really need them for anything. Originally we had them so that we could do incremental decoding to find the transaction header in the record, but now that the transaction header is at a fixed offset, I think the multiplicity of stages is just baggage. We could simplify things more by deciding that the first two bytes of the record are going to contain the record size. That would increase the size of the record by 2 bytes, but we could (mostly) claw those bytes back by not storing the size of both uur_payload and uur_tuple. The size of the other one would be computed by subtraction: take the total record size, subtract the size of whichever of those two things we store, subtract the mandatory and optional headers that are present, and the rest must be the other value. That would still add 2 bytes for records that contain neither a payload nor a tuple, but that would probably be OK given that (a) a lot of records wouldn't be affected, (b) the code would get simpler, and (c) something like this seems necessary anyway given that we want to make the record format more generic. With this approach instead of 3 stages we only need 2: (1) accumulating bytes until we have the 2-byte length word, and (2) accumulating bytes until we have the whole record. For backward traversal, as I see it, there are basically two options. One is to do what we're doing right now, and store the record length at the end of the record. (That might mean that a record both begins and ends with its own length, which is not a crazy design.) The other is to do what I think you are proposing here: locate the beginning of the first record on the page, presumably based on some information stored in the page header, and then work forward through the page to figure out where all the records start. Then process them in reverse order. That saves 2 bytes per record. It's a little more expensive in terms of CPU cycles, especially if you only need some of the records on the page but not all of them, but that's probably not too bad. I think I'm basically agreeing with what you are proposing but I think it's important to spell out the underlying concerns, because otherwise I'm afraid we might think we have a meeting of the minds when we don't really. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company