Hi, On 2019-02-13 18:40:05 +1300, Thomas Munro wrote: > Thanks! And sorry for not replying sooner -- I got distracted by > FOSDEM (and the associated 20 thousand miles of travel). On that trip > I had a chance to discuss this patch with Andres Freund in person, and > he opined that it might be better for the fsync request queue to work > in terms of pathnames. Instead of the approach in this patch, where a > backend sends an fsync request for { reflfilenode, segno } inside > mdwrite(), and then the checkpointer processes the request by calling > smgrdimmedsyncrel(), he speculated that it'd be better to have > mdwrite() send an fsync request for a pathname, and then the > checkpointer would just open that file by name and fsync() it. That > is, the checkpointer wouldn't call back into smgr. > > One of the advantages of that approach is that there are probably > other files that need to be fsync'd for each checkpoint that could > benefit from being offloaded to the checkpointer. Another is that you > break the strange cycle mentioned above.
The other issue is that I think your approach moves the segmentation logic basically out of md into smgr. I think that's wrong. We shouldn't presume that every type of storage is going to have segmentation that's representable in a uniform way imo. > Another consideration if we do that is that the existing scheme has a > kind of hierarchy that allows fsync requests to be cancelled in bulk > when you drop relations and databases. That is, the checkpointer > knows about the internal hierarchy of tablespace, db, rel, seg. If we > get rid of that and have just paths, it seems like a bad idea to teach > the checkpointer about the internal structure of the paths (even > though we know they contain the same elements encoded somehow). You'd > have to send an explicit cancel for every key; that is, if you're > dropping a relation, you need to generate a cancel message for every > segment, and if you're dropping a database, you need to generate a > cancel message for every segment of every relation. I can't see that being a problem - compared to the overhead of dropping a relation, that doesn't seem to be a meaningfully large cost? Greetings, Andres Freund