Re: post-freeze damage control

David Steele Sun, 14 Apr 2024 20:18:14 -0700

On 4/13/24 21:02, Tomas Vondra wrote:

On 4/13/24 01:23, David Steele wrote:

Even for the summarizer, though, I do worry about the complexity of
maintaining it over time. It seems like it would be very easy to
introduce a bug and have it go unnoticed until it causes problems in the
field. A lot of testing was done outside of the test suite for this
feature and I'm not sure if we can rely on that focus with every release.


I'm not sure there's a simpler way to implement this. I haven't really
worked on that part (not until the CoW changes a couple weeks ago), but
I think Robert was very conscious of the complexity.

I don't think expect this code to change very often, but I agree it's
not great to rely on testing outside the regular regression test suite.
But I'm not sure how much more we can do, really - for example my
testing was very much "randomized stress testing" with a lot of data and
long runs, looking for unexpected stuff. That's not something we could
do in the usual regression tests, I think.

But if you have suggestions how to extend the testing ...

Doing stress testing in the regular test suite is obviously a problemdue to runtime, but it would still be great to see tests for issues thatwere found during external stress testing.

For example, the issue you and Jakub found was fixed in 55a5ee30 butthere is no accompanying test and no existing test was broken by the change.

For me an incremental approach would be to introduce the WAL summarizer
first. There are already plenty of projects that do page-level
incremental (WAL-G, pg_probackup, pgBackRest) and could help shake out
the bugs. Then introduce the client tools later when they are more
robust. Or, release the client tools now but mark them as experimental
or something so people know that changes are coming and they don't get
blindsided by that in the next release. Or, at the very least, make the
caveats very clear so users can make an informed choice.


I don't think introducing just the summarizer, without any client tools,
would really work. How would we even test the summarizer, for example?
If the only users of that code are external tools, we'd do only some
very rudimentary tests. But the more complex tests would happen in the
external tools, which means it wouldn't be covered by cfbot, buildfarm
and so on. Considering the external tools are likely a bit behind, It's
not clear to me how I would do the stress testing, for example.

IMHO we should aim to have in-tree clients when possible, even if some
external tools can do more advanced stuff etc.

This however reminds me my question is the summarizer provides the right
interface(s) for the external tools. One option is to do pg_basebackup
and then parse the incremental files, but is that suitable for the
external tools, or should there be a more convenient way?

Running a pg_basebackup to get the incremental changes would not be atall satisfactory. Luckily there are thepg_wal_summary_contents()/pg_available_wal_summaries() functions, whichseem to provide the required information. I have not played with themmuch but I think they will do the trick.

They are pretty awkward to work with since they are essentiallytime-series data but what you'd really want, I think, is the ability toget page changes for a particular relfileid/segment.


Regards,
-David

Re: post-freeze damage control

Reply via email to