On Wed, Feb 2, 2022 at 4:36 PM David G. Johnston <david.g.johns...@gmail.com> wrote: > > On Tue, Feb 1, 2022 at 11:55 PM Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> On Wed, Feb 2, 2022 at 9:41 AM David G. Johnston >> <david.g.johns...@gmail.com> wrote: >> > >> > On Tue, Feb 1, 2022 at 8:07 PM Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> >> >> On Tue, Feb 1, 2022 at 11:47 AM Masahiko Sawada <sawada.m...@gmail.com> >> >> wrote: >> >> >> >> > >> >> > I see that it's better to use a better IPC for ALTER SUBSCRIPTION SKIP >> >> > feature to pass error-XID or error-LSN information to the worker >> >> > whereas I'm also not sure of the advantages in storing all error >> >> > information in a system catalog. Since what we need to do for this >> >> > purpose is only error-XID/LSN, we can store only error-XID/LSN in the >> >> > catalog? That is, the worker stores error-XID/LSN in the catalog on an >> >> > error, and ALTER SUBSCRIPTION SKIP command enables the worker to skip >> >> > the transaction in question. The worker clears the error-XID/LSN after >> >> > successfully applying or skipping the first non-empty transaction. >> >> > >> >> >> >> Where do you propose to store this information? >> > >> > >> > pg_subscription_worker >> > >> > The error message and context is very important. Just make sure it is >> > only non-null when the worker state is "syncing failed" (or whatever term >> > we use). >> > >> > >> >> Sure, but is this the reason you want to store all the error info in >> the system catalog? I agree that providing more error info could be >> useful and also possibly the previously failed (apply) xacts info as >> well but I am not able to see why you want to have that sort of info >> in the catalog. I could see storing info like err_lsn/err_xid that can >> allow to proceed to apply worker automatically or to slow down the >> launch of errored apply worker but not all sort of other error info >> (like err_cnt, err_code, err_message, err_time, etc.). I want to know >> why you are insisting to make all the error info persistent via the >> system catalog? > > > I look at the catalog and am informed that the worker has stopped because of > an error. I'd rather simply read the error message right then instead of > having to go look at the log file. And if I am going to take an action in > order to overcome the error I would have to know what that error is; so the > error message is not something I can ignore. The error is an attribute of > system state, and the catalog stores the current state of the (workers) > system. > > I already explained that the concept of err_cnt is not useful. The fact that > you include it here makes me think you are still thinking that this all > somehow is meant to keep track of history. It is not. The workers are state > machines and "error" is one of the states - with relevant attributes to > display to the user, and system, while in that state. The state machine > reporting does not care about historical states nor does it report on them. > There is some uncertainty if we continue with the automatic re-launch; which, > now that I write this, I can see where what you call err_cnt is effectively a > count of how many times the worker re-launched without the underlying problem > being resolved and thus encountered the same error. If we persist with the > re-launch behavior then maybe err_cnt should be left in place - with the > description for it basically being the ah-ha! comment I just made. In a world > where we do not typically re-launch and simply re-try without being informed > there is a change - such a count remains of minimal value. > > I don't really understand the confusion here though - this error data already > exists in the pg_stat_subscription_workers stat collector view - the fact > that I want to keep it around (just changing the reset behavior) - doesn't > seem like it should be controversial. I, thinking as a user, really don't > care about all of these implementation details. Whether it is a pg_stat_* > view (collector or shmem IPC) or a pg_* catalog is immaterial to me. The > behavior I observe is what matters. As a developer I don't want to use the > statistics collector because these are not statistics and the collector is > unreliable. I don't know enough about the relevant differences between > shared memory IPC and catalog tables to decide between them. But catalog > tables seem like a lower bar to meet and seem like they can implement the > user-facing requirements as I envision them.
I see that important information such as error-XID that can be used for ALTER SUBSCRIPTION SKIP needs to be stored in a reliable way, and using system catalogs is a reasonable way for this purpose. But it's still unclear to me why all error information that is currently shown in pg_stat_subscription_workers view, including error-XID and the error message, relation OID, action, etc., need to be stored in the catalog. The information other than error-XID doesn't necessarily need to be reliable compared to error-XID. I think we can have error-XID/LSN in the pg_subscription catalog and have other error information in pg_stat_subscription_workers view. After the user checks the current status of logical replication by checking error-XID/LSN, they can check pg_stat_subscription_workers for details. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/