On Tue, Jan 20, 2015 at 12:57 PM, Michael Paquier <michael.paqu...@gmail.com> wrote: > On Tue, Jan 20, 2015 at 1:56 AM, Andres Freund <and...@2ndquadrant.com> wrote: >> On 2015-01-19 17:16:11 +0900, Michael Paquier wrote: >>> On Mon, Jan 19, 2015 at 4:10 PM, Michael Paquier >>> <michael.paqu...@gmail.com> wrote: >>> > On Sat, Jan 17, 2015 at 2:44 AM, Andres Freund <and...@2ndquadrant.com> >>> > wrote: >>> >> Not this patch's fault, but I'm getting a bit tired seeing the above open >>> >> coded. How about adding a function that does the sleeping based on a >>> >> timestamptz and a ms interval? >>> > You mean in plugins, right? I don't recall seeing similar patterns in >>> > other code paths of backend. But I think that we can use something >>> > like that in timestamp.c then because we need to leverage that between >>> > two timestamps, the last failure and now(): >>> > TimestampSleepDifference(start_time, stop_time, internal_ms); >>> > Perhaps you have something else in mind? >>> > >>> > Attached is an updated patch. >> >>> Actually I came with better than last patch by using a boolean flag as >>> return value of TimestampSleepDifference and use >>> TimestampDifferenceExceeds directly inside it. >> >>> Subject: [PATCH] Add wal_availability_check_interval to control WAL fetching >>> on failure >> >> I think that name isn't a very good. And its isn't very accurate >> either. >> How about wal_retrieve_retry_interval? Not very nice, but I think it's >> still better than the above. > Fine for me. > >>> + <varlistentry id="wal-availability-check-interval" >>> xreflabel="wal_availability_check_interval"> >>> + <term><varname>wal_availability_check_interval</varname> >>> (<type>integer</type>) >>> + <indexterm> >>> + <primary><varname>wal_availability_check_interval</> recovery >>> parameter</primary> >>> + </indexterm> >>> + </term> >>> + <listitem> >>> + <para> >>> + This parameter specifies the amount of time to wait when >>> + WAL is not available for a node in recovery. Default value is >>> + <literal>5s</>. >>> + </para> >>> + <para> >>> + A node in recovery will wait for this amount of time if >>> + <varname>restore_command</> returns nonzero exit status code when >>> + fetching new WAL segment files from archive or when a WAL receiver >>> + is not able to fetch a WAL record when using streaming replication. >>> + </para> >>> + </listitem> >>> + </varlistentry> >>> + >>> </variablelist> >> >> Walreceiver doesn't wait that amount, but rather how long the connection >> is intact. And restore_command may or may not retry. > I changed this paragraph as follows: > + This parameter specifies the amount of time to wait when a failure > + occurred when reading WAL from a source (be it via streaming > + replication, local <filename>pg_xlog</> or WAL archive) for a node > + in standby mode, or when WAL is expected from a source. Default > + value is <literal>5s</>. > Pondering more about this patch, I am getting the feeling that it is > not that much a win to explain in details in the doc each failure > depending on the source from which WAL is taken. But it is essential > to mention that the wait is done only for a node using standby mode. > > Btw, I noticed two extra things that I think should be done for consistency: > - We should use as well wal_retrieve_retry_interval when waiting for > WAL to arrive after checking for the failures: > /* > - * Wait for more WAL to > arrive. Time out after 5 seconds, > - * like when polling the > archive, to react to a trigger > - * file promptly. > + * Wait for more WAL to > arrive. Time out after > + * wal_retrieve_retry_interval > ms, like when polling the archive > + * to react to a trigger file > promptly. > */ > > WaitLatch(&XLogCtl->recoveryWakeupLatch, > WL_LATCH_SET > | WL_TIMEOUT, > - 5000L); > + > wal_retrieve_retry_interval * 1L); > - > > Updated patch attached.
+ TimestampDifference(start_time, stop_time, &secs, µsecs); + pg_usleep(interval_msec * 1000L - (1000000L * secs + 1L * microsecs)); What if the large interval is set and a signal arrives during the sleep? I'm afraid that a signal cannot terminate the sleep on some platforms. This problem exists even now because pg_usleep is used, but the sleep time is just 5 seconds, so it's not so bad. But the patch allows a user to set large sleep time. Shouldn't we use WaitLatch or split the pg_usleep like recoveryPausesHere() does? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers