Hi Pieter,

On Fri, Apr 13, 2018 at 12:05:26AM +0200, PiBa-NL wrote:
> Hi Willy,
> 
> Okay did some more digging..
> 
> before and after the 'offending' commit the EV_SET calls are as follows in
> attached screenshot, these seem to be the main cause of some things going
> wrong.
> (I think also without using http-tunnel there is some things that dont fully
> work right, but i havn't investigated that to get something reproducible
> though, and possibly caused by the same kqueue issue..) The current issue is
> 100% reproducible sofar in a (for me) easy way.

OK that's nice (except for you of course).

> I made some changes to haproxy's output, and captured and compared the
> output.
> The 'START State' is written and also the eo and en variables are added. to
> the output.
> the READ ADD/DEL WRITE ADD/DEL output is written with fprintf statements
> just before the actual EV_SET call..
> Left side 'WORKS', right side doesn't..
> 
> As you can see there are some different calls to the EV_SET with the
> different eo and en states..

I'm suspecting we could have something wrong with the polled_mask, maybe
sometimes it's removed too early somewhere, preventing the delete(write)
from being performed, which would explain why it loops.

> I could 'fix' the behavior after the offending commit by adding back a check
> for the eo in the second screenshot.. (At least for this 1 particular page
> request..)
> 
> But in current master that 'eo' variable nolonger exists. I'm kinda at the
> end of my abilities here.. :)

Indeed, it took us years to get rid of this duplicate information! And as
you can see apparently doing so possibly revealed that some corner cases
were still relying on it. By the way you must really not try to debug an
old version but stick to the latest fixes. It's too easy to be caught by
another issue that was since fixed. The fact that you could put "eo" there
in your test indicates to me that the test was made on a commit prior to
its removal. I just want to be sure you don't waste your time.

> I hope at least some of it makes sense to you?

In part, yes! I have to sit down in front of it and scratch my head, but
my context switches are rather hard, so I'm very happy that you provide
such detailed information!

I'm seeing two things that could be of interest to test :
  - remove the two "if (fdtab[fd].polled_mask & tid_bit)" conditions
    to delete the events. It will slightly inflate the list of events
    but not that much. If it fixes the problem it means that the
    polled_mask is sometimes wrong. Please do that with the updated
    master.

  - switch to poll() just to see if you have the same so that we can
    figure if only the kqueue code triggers the issue. poll() doesn't
    rely on polled_mask at all.

Many thanks for your tests.
Willy

Reply via email to