On 4/18/22 13:04, Amit Kapila wrote: > On Thu, Apr 14, 2022 at 9:09 AM Amit Kapila <amit.kapil...@gmail.com> wrote: >> >> On Thu, Apr 14, 2022 at 8:32 AM Masahiko Sawada <sawada.m...@gmail.com> >> wrote: >>>> >>>> The other part of the puzzle is the below check in the code: >>>> /* >>>> * If we reached the sync worker limit per subscription, just exit >>>> * silently as we might get here because of an otherwise harmless race >>>> * condition. >>>> */ >>>> if (nsyncworkers >= max_sync_workers_per_subscription) >>>> >>>> It is not clear to me why this check is there, if this wouldn't be >>>> there, the user would have got either a WARNING to increase the >>>> max_logical_replication_workers or the apply worker would have been >>>> restarted. Do you have any idea about this? >>> >>> Yeah, I'm also puzzled with this check. It seems that this function >>> doesn't work well when the apply worker is not running and some >>> tablesync workers are running. I initially thought that the apply >>> worker calls to this function as many as tables that needs to be >>> synced, but it checks the max_sync_workers_per_subscription limit >>> before calling to logicalrep_worker_launch(). So I'm not really sure >>> we need this check. >>> >> >> I just hope that the original author Petr J. responds to this point. I >> have added him to this email. This will help us to find the best >> solution for this problem. >> > > I did some more investigation for this code. It is added by commit [1] > and the patch that led to this commit is first time posted on -hackers > in email [2]. Now, neither the commit message nor the patch (comments) > gives much idea as to why this part of code is added but I think there > is some hint in the email [2]. In particular, read the paragraph in > the email [2] that has the lines: ".... and limiting sync workers per > subscription theoretically wasn't either (although I don't think it > could happen in practice).". > > It seems that this check has been added to theoretically limit the > sync workers even though that can't happen because apply worker > ensures that before trying to launch the sync worker. Does this theory > make sense to me? If so, I think we can change the check as: "if > (OidIsValid(relid) && nsyncworkers >= > max_sync_workers_per_subscription)" in launcher.c. This will serve the > purpose of the original code and will solve the issue being discussed > here. I think we can even backpatch this. What do you think? >
Sounds reasonable to me. It's unfortunate there's no explanation of what exactly is the commit message fixing (and why), but I doubt anyone will remember the details after 5 years. +1 to backpatching, I consider this to be a bug regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company