On 09.01.2019 23:27, Ben Pfaff wrote:
> On Wed, Jan 09, 2019 at 08:28:54PM +0300, Ilya Maximets wrote:
>> On 27.12.2018 20:36, Ben Pfaff wrote:
>>> On Wed, Dec 26, 2018 at 06:23:56PM +0300, Ilya Maximets wrote:
>>>> On some systems in case where remote is not responding, socket could
>>>> remain in SYN_SENT state for a really long time without errors waiting
>>>> for connection. This leads to situations where vconn connection hangs
>>>> for a few minutes waiting for connection to the DOWN remote.
>>>>
>>>> For example, this situation emulated by "refuse-connection" vconn
>>>> testcase. This leads to test failures because Alarm signal arrives much
>>>> faster than ETIMEDOUT from the socket:
>>>>
>>>>   ./vconn.at:21: ovstest test-vconn refuse-connection tcp
>>>>   Alarm clock
>>>>   stderr:
>>>>   |socket_util|INFO|0:127.0.0.1: listening on port 63812
>>>>   |poll_loop|DBG|wakeup due to 0-ms timeout
>>>>   |poll_loop|DBG|wakeup due to 10155-ms timeout
>>>>   |fatal_signal|WARN|terminating with signal 14 (Alarm clock)
>>>>   ./vconn.at:21: exit code was 142, expected 0
>>>>   vconn.at:21: 535. tcp vconn - refuse connection (vconn.at:21): FAILED
>>>>
>>>> This patch allowes to specify timeout value for vconn blocking
>>>> connections. If the connection takes more time, socket will be closed
>>>> with ETIMEDOUT error code. Negative value could be used to wait
>>>> infinitely.
>>>>
>>>> Signed-off-by: Ilya Maximets <[email protected]>
>>>
>>> Same comments as patch 2.
>>>
>>> Are the timeouts only useful for the test cases?  I wonder whether just
>>> calling alarm(10); at the beginning of the test programs would be just
>>> as helpful.  On the other hand, it would make using a debugger on those
>>> programs harder.
>>
>> I guess, we have alarms in all the test programs.
>> The issue here is that some test apps like 'test_refuse_connection' treats
>> connection failure as a success. But on some systems, wrong connections hangs
>> for a really long time and alarm kills the test application. In this case
>> we can't say for sure if the test failed or not, i.e. if it was expected
>> connection failure or other random issue that forced the application to hang.
>>
>> stream connection tests even worse, because they are trying to sequentially
>> establish connection to one of 3 different remotes while only one of them is
>> correct. And it will never try to connect to correct one if the blocking
>> connection to wrong port will hang for a few minutes. It'll be simply killed
>> by alarm.
> 
> It should be possible to tell what caused the test program to exit by
> testing the the exit status.  When a program exits due to a signal, a
> Bourne-compatible shell sets $? to 128 plus the signal number.  Usually,
> it's good enough just to know that the process died with an unusual exit
> status, but you can get the particular signal name back with "kill -l
> $?", e.g. on Linux "kill -l 142" prints "ALRM".  This behavior is
> specified by POSIX so it should be portable.

Yes, we can detect that app was killed by alarm, but we can't say if it was
expected hang while connecting to the wrong port or it was just too long
execution due to random environment issue or a bug.

Let's look at "multiple remotes" test cases. Their workflow is following:

  1. alarm(10)
  2. Initialize idl with multiple remotes.
  2. RPC: Try to connect to WRONG_PORT_1. Fail expected.
  3. RPC: Try to connect to right port.
  4. Perform some ovsdb transactions.
  5. Check result.

Step 2 always hangs in CirrusCI environment and app dies there by alarm.
We can't treat this as success because we didn't check anything useful.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to