On Wed, Mar 19, 2025 at 1:47 PM Rafael Thofehrn Castro <rafaelt...@gmail.com> wrote: > Sending a new version as rebase was required.
Reading this thread, it seems to me that there has been a good deal of discussion about things like the name of the feature and what the UI ought to be, but not much discussion of whether the feature is actually safe, and not much detailed review of the code. I'm willing to bet that everybody wants some version of this feature if we can convince ourselves that it's not going to do horrible things like cause server crashes, and that even people who don't get their first choice in terms of how the feature is named or how the GUCs work will still be pretty happy to have it overall. However, if it breaks stuff and there's no easy way to fix the breakage, that's going to be a problem. In broad strokes, the danger here is that doing stuff in the middle of query execution that we currently only do at the end of query execution will turn out to be problematic. The biggest problem, I think, is whether it's safe to do all of the things that EXPLAIN does while we're at some random point in query execution. It looks to me like the wrap/unwrap stuff is more-or-less consistent with previous discussions of how a feature of this kind should work, though I don't recall the previous discussion and I think the patch should contain some comments about why it works the way that it works. I do notice that WrapExecProcNodeWithExplain does not walk the ps->initPlan list, which I think is an oversight. Without having the prior discussion near to hand, I *think* that the reason we wanted to do this wrap/unwrap stuff is to make it so that the progressive EXPLAIN code could only execute when entering a new plan node rather than at any random point inside of that plan node, and that does seem a lot safer than the alternative. For example, I think it means that we won't start trying to do progressive EXPLAIN while already holding some random LWLock, which is very good. However, this still means we could do a bunch of catalog access (and thus potentially receive invalidations) at places where that can't happen today. I'm not sure that's a problem -- the same thing could happen at any place in the executor where we evaluate a user-supplied expression, and there are many such places -- but it's certainly worth a few senior people thinking real carefully about it and trying to imagine whether there's any scenario in which it might break something that works today. One way in which this proposal seems safer than previous proposals is that previous proposals have involved session A poking session B and trying to get session B to emit an EXPLAIN on the fly with no prior setup. That would be very useful, but I think it's more difficult and more risky than this proposal, where all the configuration happens in the session that is going to emit the EXPLAIN output. It knows from the beginning that it's going to maybe be doing this, and so it can do whatever setup it likes to accomplish that goal. So I think this design looks pretty good from that point of view. I don't understand how this would be safe against interrupts or errors. If a running query is interrupted, what would cause ProgressiveExplainCleanup() to be called? If the answer is "nothing," why isn't that unsafe? ExecProcNodeOriginal looks like it's basically the same thing as ExecProcNodeReal, except that it's for a different purpose. But you would be hard-pressed to guess which one is used for what purpose based on the field names or the comments. Maybe it's also worth worrying about whether this is a scalable design. Can we find a way to use the existing fields here instead of having to add a new one? The documentation for the progressive_explain = { off | explain | analyze } option seems like it should go into more detail about how the "explain" and "analyze" values are different. I'm not 100% sure I know the answer, and I'm not the least-experienced person who will ever read this documentation. WrapMultiExecProcNodesWithExplain seems like a poor choice of name. It invites confusion with MultiExecProcNode, to which it is unrelated. I just went to some trouble to start breaking up the monolith that is explain.c, so I'm not that happy about seeing this patch dump another 800+ lines of source code into that file. Probably we should have a new source file for some or this, or maybe even more than one. The changes to explain.h add three new data types. Two of them begin with an uppercase letter and one with a lowercase letter. That seems inconsistent. I also don't think that the declaration of char plan[] is per PostgreSQL coding style. I believe we always write char plan[FLEXIBLE_ARRAY_MEMBER]. Also, maybe it should be named something other than plan, since it's really a string-ified explain-y kind of thing, not literally a Plan. Also, can we please not have structure members with single letter names? "h" and "p" are going to be completely ungreppable, and I like grepping. It looks very strange to me that ProgressiveExplainPrint() seems to have a theoretical possibility of generating the output and then throwing it away if we end up with entry == NULL. I guess maybe that case is not supposed to happen because ProgressiveExplainInit() is supposed to create the entry, but then why isn't this an elog(ERROR) or something instead of a no-op? It seems like when we replace a longer entry with a shorter one, we forget that it was originally longer. Hence, if the length of a progressive EXPLAIN is alternately 2922 characters and 2923 characters, we'll reallocate on every other progressive EXPLAIN instead of at most once. I'll try to look at this some more tomorrow. It seems like a very interesting patch, but time is very short for this release and it doesn't look to me like we have all the kinks sorted out here just yet. -- Robert Haas EDB: http://www.enterprisedb.com