Re: pg_basebackup misses to report checksum error

Ashwin Agrawal Thu, 07 May 2020 10:16:32 -0700

On Wed, May 6, 2020 at 3:02 PM Robert Haas <robertmh...@gmail.com> wrote:

> On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal <aagra...@pivotal.io> wrote:
> > If pg_basebackup is not able to read BLCKSZ content from file, then it
> > just emits a warning "could not verify checksum in file "____" block
> > X: read buffer size X and page size 8192 differ" currently but misses
> > to error with "checksum error occurred". Only if it can read 8192 and
> > checksum mismatch happens will it error in the end.
>
> I don't think it's a good idea to conflate "hey, we can't checksum
> this because the size is strange" with "hey, the checksum didn't
> match". Suppose the a file has 1000 full blocks and a partial block.
> All 1000 blocks have good checksums. With your change, ISTM that we'd
> first emit a warning saying that the checksum couldn't be verified,
> and then we'd emit a second warning saying that there was 1 checksum
> verification failure, which would also be reported to the stats
> system. I don't think that's what we want.

I feel the intent of reporting "total checksum verification failure" is to
report corruption. Which way is the secondary piece of the puzzle. Not
being able to read checksum itself to verify is also corruption and is
checksum verification failure I think. WARNINGs will provide fine grained
clarity on what type of checksum verification failure it is, so I am not
sure we really need fine grained clarity in "total numbers" to
differentiate these two types.

Not reporting anything to the stats system and at end reporting there is
checksum failure would be more weird, right? Or will say
ERRCODE_DATA_CORRUPTED with some other message and not checksum
verification failure.

There might be an argument
> for making this code trigger...
>
>         ereport(ERROR,
>                 (errcode(ERRCODE_DATA_CORRUPTED),
>                  errmsg("checksum verification failure during base
> backup")));
>
> ...but I wouldn't for that reason inflate the number of blocks that
> are reported as having failures.
>

When checksum verification is turned on and the issue is detected, I
strongly feel ERROR must be triggered as silently reporting success doesn't
seem right.
I can introduce one more variable just to capture and track files with such
cases. But will we report them separately to stats? How? Also, do we want
to have separate WARNING for the total number of files in this category?
Those all seem slight complications but if wind is blowing in that
direction, I am ready to fly that way.

Re: pg_basebackup misses to report checksum error

Reply via email to