Hi Alan,

On Thu, Mar 19, 2015 at 10:56:35AM +0000, Alan Fitton wrote:
> Hi Willy,
> 
> Thank you for your reply and your work on HAProxy. I will add some
> instrumentation and hopefully be able to demonstrate your theory. I agree,
> it's the one I had arrived at too :) It seemed unlikely at first since the
> signals are masked inside __signal_process_queue, but it's still possible if
> timing goes against you.

I don't think we're sensitive to timing here. I've done this to try to
reproduce the issue :

diff --git a/src/signal.c b/src/signal.c
index e9301ed..241feac 100644
--- a/src/signal.c
+++ b/src/signal.c
@@ -37,6 +37,7 @@ int signal_pending = 0; /* non-zero if t least one signal 
remains unprocessed */
  */
 void signal_handler(int sig)
 {
+       fprintf(stderr, "received signal %d\n", sig);
        if (sig < 0 || sig >= MAX_SIGNAL) {
                /* unhandled signal */
                signal(sig, SIG_IGN);
@@ -77,6 +78,7 @@ void __signal_process_queue()
         * handler. That allows real signal handlers to redistribute signals
         * to tasks subscribed to signal zero.
         */
+       fprintf(stderr, "enter loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
        for (cur_pos = 0; cur_pos < signal_queue_len; cur_pos++) {
                sig  = signal_queue[cur_pos];
                desc = &signal_state[sig];
@@ -90,7 +92,9 @@ void __signal_process_queue()
                        }
                        desc->count = 0;
                }
+               sleep(1);
        }
+       fprintf(stderr, "leave loop: signal_queue_len=%d, count[%d]=%d\n", 
signal_queue_len, SIGUSR1, signal_state[SIGUSR1].count);
        signal_queue_len = 0;
 
        /* restore signal delivery */

Then I'm doing that :

$ killall -QUIT haproxy;killall -1 haproxy; kill -USR1 $(pidof haproxy) $(pidof 
haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) $(pidof haproxy) 
$(pidof haproxy) $(pidof haproxy); 

(SIGQUIT send a large response, and SIGHUP talks as well). The sleep(1)
is here to ensure that signals get delivered and queued before we
re-enable signals. I'm clearly seeing my messages being queued :

received signal 3
enter loop: signal_queue_len=1, count[10]=0
Dumping pools usage. Use SIGQUIT to flush them.
  - Pool pipe (32 bytes) : 5 allocated (160 bytes), 5 used, 4 users [SHARED]
  - Pool hlua_com (48 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool task (112 bytes) : 2 allocated (224 bytes), 2 used, 1 users [SHARED]
  - Pool uniqueid (128 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool connection (336 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
[SHARED]
  - Pool hdr_idx (416 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool requri (1024 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool session (1072 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool buffer (8064 bytes) : 3 allocated (24192 bytes), 0 used, 1 users 
[SHARED]
Total: 10 pools, 24576 bytes allocated, 384 used.
leave loop: signal_queue_len=1, count[10]=0
received signal 10
received signal 1
enter loop: signal_queue_len=2, count[10]=1
[WARNING] 077/155017 (7008) : Stopping frontend foo in 0 ms.
received signal 0
[WARNING] 077/155017 (7008) : SIGHUP received, dumping servers states.
[WARNING] 077/155017 (7008) : SIGHUP: Proxy foo has no servers. Conn: 
act(FE+BE): 0+0, 0 pend (0 unass), tot(FE+BE): 0+0.
leave loop: signal_queue_len=3, count[10]=0
[WARNING] 077/155020 (7008) : Proxy foo stopped (FE: 0 conns, BE: 0 conns).

As you can see, the signals are properly queued and delivered. I'm suspecting
that it would be possible that the signals are not properly blocked on your
platform. If you play with the patch above, you might be able to observe
whether it works or not. If this is the case then maybe we'll have to work
on a workaround.

> We have a large number of microservices and are using automation where app
> servers get registered and deregistered from an apache zookeeper (for
> autoscaling). HAProxy configuration is generated based off changes to
> zookeeper and reloaded. We don't have reloads in "loops" as such, they are
> usually done infrequently. But, during a large release of applications I
> think it's quite possible that haproxy will be send a few reload commands in
> a very short space of time.

I tried to do that as well but couldn't get into the bug unfortunately :-/

> Maybe I will try to queue these up a little on the automation side.

No, there is no reason, we need to fix the bug if it's one, or to work
around the issue if the problem is in your kernel for example.

Best regards,
Willy


Reply via email to