Re: Design of pg_stat_subscription_workers vs pgstats

Masahiko Sawada Wed, 02 Feb 2022 20:34:05 -0800

On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston
<[email protected]> wrote:
>
> On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <[email protected]> wrote:
>>
>> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston
>> <[email protected]> wrote:
>> >
>> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <[email protected]> wrote:
>> >>
>> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <[email protected]> 
>> >> wrote:
>> >>
>> >> >
>> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP
>> >> > feature to pass error-XID or error-LSN information to the worker
>> >> > whereas I'm also not sure of the advantages in storing all error
>> >> > information in a system catalog. Since what we need to do for this
>> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the
>> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an
>> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip
>> >> > the transaction in question. The worker clears the error-XID/LSN after
>> >> > successfully applying or skipping the first non-empty transaction.
>> >> >
>> >>
>> >> Where do you propose to store this information?
>> >
>> >
>> > pg_subscription_worker
>> >
>> > The error message and context is very important.  Just make sure it is 
>> > only non-null when the worker state is "syncing failed" (or whatever term 
>> > we use).
>> >
>> >
>>
>> Sure, but is this the reason you want to store all the error info in
>> the system catalog? I agree that providing more error info could be
>> useful and also possibly the previously failed (apply) xacts info as
>> well but I am not able to see why you want to have that sort of info
>> in the catalog. I could see storing info like err_lsn/err_xid that can
>> allow to proceed to apply worker automatically or to slow down the
>> launch of errored apply worker but not all sort of other error info
>> (like err_cnt, err_code, err_message, err_time, etc.). I want to know
>> why you are insisting to make all the error info persistent via the
>> system catalog?
>
>
> I look at the catalog and am informed that the worker has stopped because of 
> an error.  I'd rather simply read the error message right then instead of 
> having to go look at the log file.  And if I am going to take an action in 
> order to overcome the error I would have to know what that error is; so the 
> error message is not something I can ignore.  The error is an attribute of 
> system state, and the catalog stores the current state of the (workers) 
> system.
>
> I already explained that the concept of err_cnt is not useful.  The fact that 
> you include it here makes me think you are still thinking that this all 
> somehow is meant to keep track of history.  It is not.  The workers are state 
> machines and "error" is one of the states - with relevant attributes to 
> display to the user, and system, while in that state.  The state machine 
> reporting does not care about historical states nor does it report on them.  
> There is some uncertainty if we continue with the automatic re-launch; which, 
> now that I write this, I can see where what you call err_cnt is effectively a 
> count of how many times the worker re-launched without the underlying problem 
> being resolved and thus encountered the same error.  If we persist with the 
> re-launch behavior then maybe err_cnt should be left in place - with the 
> description for it basically being the ah-ha! comment I just made. In a world 
> where we do not typically re-launch and simply re-try without being informed 
> there is a change - such a count remains of minimal value.
>
> I don't really understand the confusion here though - this error data already 
> exists in the pg_stat_subscription_workers stat collector view - the fact 
> that I want to keep it around (just changing the reset behavior) - doesn't 
> seem like it should be controversial.  I, thinking as a user, really don't 
> care about all of these implementation details.  Whether it is a pg_stat_* 
> view (collector or shmem IPC) or a pg_* catalog is immaterial to me.  The 
> behavior I observe is what matters.  As a developer I don't want to use the 
> statistics collector because these are not statistics and the collector is 
> unreliable.  I don't know enough about the relevant differences between 
> shared memory IPC and catalog tables to decide between them.  But catalog 
> tables seem like a lower bar to meet and seem like they can implement the 
> user-facing requirements as I envision them.


I see that important information such as error-XID that can be used
for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and
using system catalogs is a reasonable way for this purpose. But it's
still unclear to me why all error information that is currently shown
in pg_stat_subscription_workers view, including error-XID and the
error message, relation OID, action, etc., need to be stored in the
catalog. The information other than error-XID doesn't necessarily need
to be reliable compared to error-XID. I think we can have
error-XID/LSN in the pg_subscription catalog and have other error
information in pg_stat_subscription_workers view. After the user
checks the current status of logical replication by checking
error-XID/LSN, they can check pg_stat_subscription_workers for
details.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Design of pg_stat_subscription_workers vs pgstats

Reply via email to