On 14/04/17 12:18, Masahiko Sawada wrote:
> On Fri, Apr 14, 2017 at 7:09 AM, Petr Jelinek
> <petr.jeli...@2ndquadrant.com> wrote:
>> On 13/04/17 12:23, Masahiko Sawada wrote:
>>> On Thu, Apr 13, 2017 at 11:53 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>>> wrote:
>>>> On Wed, Apr 12, 2017 at 11:46 PM, Peter Eisentraut
>>>> <peter.eisentr...@2ndquadrant.com> wrote:
>>>>> On 4/12/17 00:48, Masahiko Sawada wrote:
>>>>>> On Wed, Apr 12, 2017 at 1:28 PM, Peter Eisentraut
>>>>>>> Perhaps instead of a global last_start_time, we store a per relation
>>>>>>> last_start_time in SubscriptionRelState?
>>>>>>
>>>>>> I was thinking the same. But a problem is that the list of
>>>>>> SubscriptionRelState is refreshed whenever the syncing table state
>>>>>> becomes invalid (table_state_valid = false). I guess we need to
>>>>>> improve these logic including GetSubscriptionNotReadyRelations().
>>>>>
>>>>> The table states are invalidated on a syscache callback from
>>>>> pg_subscription_rel, which happens roughly speaking when a table
>>>>> finishes the initial sync.  So if we're worried about failing tablesync
>>>>> workers relaunching to quickly, this would only be a problem if a
>>>>> tablesync of another table finishes right in that restart window.  That
>>>>> doesn't seem a terrible issue to me.
>>>>>
>>>>
>>>> I think the table states are invalidated whenever the table sync
>>>> worker starts, because the table sync worker updates its status of
>>>> pg_subscription_rel and commits it before starting actual copy. So we
>>>> cannot rely on that. I thought we can store last_start_time into
>>>> pg_subscription_rel but it might be overkill. I'm now thinking to
>>>> change GetSubscriptionNotReadyRealtions so that last_start_time in
>>>> SubscriptionRelState is taken over to new list.
>>>>
>>>
>>> Attached the latest patch. It didn't actually necessary to change
>>> GetSubscriptionNotReadyRelations. I just changed the logic refreshing
>>> the sync table state list.
>>> Please give me feedback.
>>>
>>
>> Hmm this might work. Although I was actually wondering if we could store
>> the last start timestamp in the worker shared memory and do some magic
>> with that (ie, not clearing subid and relid and try to then do rate
>> limiting based on subid+relid+timestamp stored in shmem). That would
>> then work same way for the main apply workers as well. It would have the
>> disadvantage that if some tables were consistently failing, no other
>> tables could get synchronized as the amount of workers is limited.
> 
> Hmm I guess that it's not a good design that a table sync worker and a
> apply worker for a relation takes sole possession of a worker slot
> until it successes. It would bother each other.
> 

Not sure what you mean by apply worker for relation, I meant the main
subscription apply worker, the "per relation apply worker" is same as
tablesync worker.

Thinking about the possible failure scenarios more, I guess you are
right. Because if there is "global" event for the publisher/subscriber
connection (ie network goes down or something) it will affect the main
worker as well so it won't launch anything. If there is table specific
issue it will affect only that table.

The corner-cases with doing it per table where we launch tablesync
needlessly are pretty much just a) connection limit on publisher
reached, b) slot limit  on publisher reached, that's probably okay. We
might be able to catch these two specifically and do some global
limiting based on those (something like your original patch except the
start-time global variable would only be set on specific error and
stored in shmem), but that does not need to be solved immediately.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to