002_types.pl hanging on particular environment

Amit Kapila Tue, 19 Sep 2017 07:31:28 -0700

On Tue, Sep 19, 2017 at 3:34 PM, Petr Jelinek
<[email protected]> wrote:
> n 18/09/17 18:42, Tom Lane wrote:
>> Amit Kapila <[email protected]> writes:
>>> On Mon, Sep 18, 2017 at 7:46 PM, Tom Lane <[email protected]> wrote:
>>>> The subscriber log includes
>>>> 2017-09-18 08:43:08.240 UTC [15672] WARNING:  out of background worker 
>>>> slots
>>>> Maybe that's harmless, but I'm suspicious that it's a smoking gun.
>>
>>> I have noticed while fixing the issue you are talking that this path
>>> is also susceptible to such problems.  In
>>> WaitForReplicationWorkerAttach(), it relies on
>>> GetBackgroundWorkerPid() to know the status of the worker which won't
>>> give the correct status in case of fork failure.  The patch I have
>>> posted has a fix for the issue,
>>
>> Your GetBackgroundWorkerPid fix looks good as far as it goes, but
>> I feel that WaitForReplicationWorkerAttach's problems go way deeper
>> than that --- in fact, that function should not exist at all IMO.
>>
>> The way that launcher.c is set up, it's relying on either the calling
>> process or the called process to free the logicalrep slot when done.
>> The called process might never exist at all, so the second half of
>> that is obviously unreliable; but WaitForReplicationWorkerAttach
>> can hardly be relied on either, seeing it's got a big fat exit-on-
>> SIGINT in it.  I thought about putting a PG_TRY, or more likely
>> PG_ENSURE_ERROR_CLEANUP, into it, but that doesn't fix the basic
>> problem: you can't assume that the worker isn't ever going to start,
>> just because you got a signal that means you shouldn't wait anymore.
>>
>> Now, this rickety business might be fine if it were only about the
>> states of the caller and called processes, but we've got long-lived
>> shared data involved (ie the worker slots); failing to clean it up
>> is not an acceptable outcome.
>>
>
> We'll clean it up eventually if somebody requests creation of new
> logical replication worker and that slot is needed. See the "garbage
> collection" part in logicalrep_worker_launch(). I know it's not ideal
> solution, but it's the best one I could think of given the bellow.
>
>> So, frankly, I think we would be best off losing the "logical rep
>> worker slot" business altogether, and making do with just bgworker
>> slots.  The alternative is getting the postmaster involved in cleaning
>> up lrep slots as well as bgworker slots, and I'm going to resist
>> any such idea strongly.  That way madness lies, or at least an
>> unreliable postmaster --- if we need lrep slots, what other forty-seven
>> data structures are we going to need the postmaster to clean up
>> in the future?
>>
>> I haven't studied the code enough to know why it grew lrep worker
>> slots in the first place.  Maybe someone could explain?
>>
>
> I am not quite sure I understand this question, we need to store
> additional info about workers in shared memory so we need slots for that.
>
> If you are asking why they are not identified by the
> BackgroundWorkerHandle, then it's because it's private struct and can't
> be shared with other processes so there is no way to link the logical
> worker info with bgworker directly.
>


Not sure, but can't we identify the actual worker slot if we just have
pid of background worker?  IIUC, you need access to
BackgroundWorkerHandle by other processes, so that you can stop them
like in case of drop subscription command.  If so, then, might be
storing pid can serve the purpose.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] src/test/subscription/t/002_types.pl hanging on particular environment

Reply via email to