Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-23 Thread Brent Eagles

Salvatore Orlando wrote:

Before starting this post I confess I did not read with the required level
of attention all this thread, so I apologise for any repetition.

I just wanted to point out that floating IPs in neutron are created
asynchronously when using the l3 agent, and I think this is clear to
everybody.
So when the create floating IP call returns, this does not mean the
floating IP has actually been wired, ie: IP configured on qg-interface and
SNAT/DNAT rules added.

Unfortunately, neutron lacks a concept of operational status for a floating
IP which would tell clients, including nova (it acts as a client wrt nova
api), when a floating IP is ready to be used. I started work in this
direction, but it has been suspended now for a week. If anybody wants to
take over and deems this a reasonable thing to do, it will be great.


Unless somebody picks it up before I get from the break, I'd like to 
discuss this further with you.



I think neutron tests checking connectivity might return more meaningful
failure data if they would gather the status of the various components
which might impact connectivity.
These are:
- The floating IP
- The router internal interface
- The VIF port
- The DHCP agent


I agree wholeheartedly. In fact, I think if we are going to rely on 
timeouts for pass/fail we need to do more for post-mortem details.



Collecting info about the latter is very important but a bit trickier. I
discussed with Sean and Maru that it would be great for a starter, grep the
console log to check whether the instance obtained an IP.
Other things to consider would be:
- adding an operational status to a subnet, which would express whether the
DHCP agent is in sync with that subnet (this information won't make sense
for subnets with dhcp disabled)
- working on a 'debug' administrative API which could return, for instance,
for each DHCP agent the list of configured networks and leases.


Interesting!


Regarding timeouts, I think it's fair for tempest to define a timeout and
ask that everything from VM boot to Floating IP wiring completes within
that timeout.

Regards,
Salvatore


I would agree. It would be impossible to have reasonable automated 
testing otherwise.


Cheers,

Brent

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-23 Thread Yair Fried


- Original Message -
 From: Brent Eagles beag...@redhat.com
 To: OpenStack Development Mailing List (not for usage questions) 
 openstack-dev@lists.openstack.org
 Sent: Monday, December 23, 2013 10:48:50 PM
 Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the 
 FloatingIPChecker control point
 
 Salvatore Orlando wrote:
  Before starting this post I confess I did not read with the
  required level
  of attention all this thread, so I apologise for any repetition.
 
  I just wanted to point out that floating IPs in neutron are created
  asynchronously when using the l3 agent, and I think this is clear
  to
  everybody.
  So when the create floating IP call returns, this does not mean the
  floating IP has actually been wired, ie: IP configured on
  qg-interface and
  SNAT/DNAT rules added.
 
  Unfortunately, neutron lacks a concept of operational status for a
  floating
  IP which would tell clients, including nova (it acts as a client
  wrt nova
  api), when a floating IP is ready to be used. I started work in
  this
  direction, but it has been suspended now for a week. If anybody
  wants to
  take over and deems this a reasonable thing to do, it will be
  great.
 
 Unless somebody picks it up before I get from the break, I'd like to
 discuss this further with you.
 
  I think neutron tests checking connectivity might return more
  meaningful
  failure data if they would gather the status of the various
  components
  which might impact connectivity.
  These are:
  - The floating IP
  - The router internal interface
  - The VIF port
  - The DHCP agent
I wrote something addressing at least some of these points: 
https://review.openstack.org/#/c/55146/
 
 I agree wholeheartedly. In fact, I think if we are going to rely on
 timeouts for pass/fail we need to do more for post-mortem details.
 
  Collecting info about the latter is very important but a bit
  trickier. I
  discussed with Sean and Maru that it would be great for a starter,
  grep the
  console log to check whether the instance obtained an IP.
  Other things to consider would be:
  - adding an operational status to a subnet, which would express
  whether the
  DHCP agent is in sync with that subnet (this information won't make
  sense
  for subnets with dhcp disabled)
  - working on a 'debug' administrative API which could return, for
  instance,
  for each DHCP agent the list of configured networks and leases.
 
 Interesting!
 
  Regarding timeouts, I think it's fair for tempest to define a
  timeout and
  ask that everything from VM boot to Floating IP wiring completes
  within
  that timeout.
 
  Regards,
  Salvatore
 
 I would agree. It would be impossible to have reasonable automated
 testing otherwise.
 
 Cheers,
 
 Brent
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Yair Fried
Hi Guys, 
I run into this issue trying to incorporate this test into 
cross_tenant_connectivity scenario: 
launching 2 VMs in different tenants 
What I saw, is that in the gate it fails half the time (the original test 
passes without issues) and ONLY on the 2nd VM (the first FLIP propagates fine). 
https://bugs.launchpad.net/nova/+bug/1262529 

I don't see this in: 
1. my local RHOS-Havana setup 
2. the cross_tenant_connectivity scenario without the control point (test 
passes without issues) 
3. test_network_basic_ops runs in the gate 

So here's my somewhat less experienced opinion: 
1. this happens due to stress (more than a single FLIP/VM) 
2. (as Brent said) Timeout interval between polling are too short 
3. FLIP is usually reachable long before it is seen in the nova DB (also from 
manual experience), so blocking the test until it reaches the nova DB doesn't 
make sense for me. if we could do this in different thread, then maybe, but 
using a Pass/Fail criteria to test for a timing issue seems wrong. Especially 
since as I understand it, the issue is on IF it reaches nova DB, only WHEN. 

I would like to, at least, move this check from its place as a blocker to later 
in the test. Before this is done, I would like to know if anyone else has seen 
the same problems Brent describes prior to this patch being merged. 

Regarding Jay's scenario suggestion, I think this should not be a part of 
network_basic_ops, but rather a separate stress scenario creating multiple VMs 
and testing for FLIP associations and propagation time. 

Regards Yair 
(Also added my comments inline) 

- Original Message -

From: Jay Pipes jaypi...@gmail.com 
To: openstack-dev@lists.openstack.org 
Sent: Thursday, December 19, 2013 5:54:29 AM 
Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the 
FloatingIPChecker control point 

On 12/18/2013 10:21 PM, Brent Eagles wrote: 
 Hi, 
 
 Yair and I were discussing a change that I initiated and was 
 incorporated into the test_network_basic_ops test. It was intended as a 
 configuration control point for floating IP address assignments before 
 actually testing connectivity. The question we were discussing was 
 whether this check was a valid pass/fail criteria for tests like 
 test_network_basic_ops. 
 
 The initial motivation for the change was that test_network_basic_ops 
 had a less than 50/50 chance of passing in my local environment for 
 whatever reason. After looking at the test, it seemed ridiculous that it 
 should be failing. The problem is that more often than not the data that 
 was available in the logs all pointed to it being set up correctly but 
 the ping test for connectivity was timing out. From the logs it wasn't 
 clear that the test was failing because neutron did not do the right 
 thing, did not do it fast enough, or is something else happening? Of 
 course if I paused the test for a short bit between setup and the checks 
 to manually verify everything the checks always passed. So it's a timing 
 issue right? 
 

DID anyone else see experience this issue? locally or on the gate? 

 Two things: adding more timeout to a check is as appealing to me as 
 gargling glass AND I was less annoyed that the test was failing as I 
 was that it wasn't clear from reading logs what had gone wrong. I tried 
 to find an additional intermediate control point that would split 
 failure modes into two categories: neutron is too slow in setting things 
 up and neutron failed to set things up correctly. Granted it still is 
 adding timeout to the test, but if I could find a control point based on 
 settling so that if it passed, then there is a good chance that if the 
 next check failed it was because neutron actually screwed up what it was 
 trying to do. 
 
 Waiting until the query on the nova for the floating IP information 
 seemed a relatively reasonable, if imperfect, settling criteria before 
 attempting to connect to the VM. Testing to see if the floating IP 
 assignment gets to the nova instance details is a valid test and, 
 AFAICT, missing from the current tests. However, Yair has the reasonable 
 point that connectivity is often available long before the floating IP 
 appears in the nova results and that it could be considered invalid to 
 use non-network specific criteria as pass/fail for this test. 

But, Tempest is all about functional integration testing. Using a call 
to Nova's server details to determine whether a dependent call to 
Neutron succeeded (setting up the floating IP) is exactly what I think 
Tempest is all about. It's validating that the integration between Nova 
and Neutron is working as expected. 

So, I actually think the assertion on the floating IP address appearing 
(after some timeout/timeout-backoff) is entirely appropriate. 

Blocking the connectivity check until DB is updated doesn't make sense to me, 
since we know FLIP is reachable well before nova DB is updated (this is seen 
also in manual mode, not just

Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Sean Dague
On 12/18/2013 10:54 PM, Jay Pipes wrote:
 On 12/18/2013 10:21 PM, Brent Eagles wrote:
 Hi,

 Yair and I were discussing a change that I initiated and was
 incorporated into the test_network_basic_ops test. It was intended as a
 configuration control point for floating IP address assignments before
 actually testing connectivity. The question we were discussing was
 whether this check was a valid pass/fail criteria for tests like
 test_network_basic_ops.

 The initial motivation for the change was that test_network_basic_ops
 had a less than 50/50 chance of passing in my local environment for
 whatever reason. After looking at the test, it seemed ridiculous that it
 should be failing. The problem is that more often than not the data that
 was available in the logs all pointed to it being set up correctly but
 the ping test for connectivity was timing out. From the logs it wasn't
 clear that the test was failing because neutron did not do the right
 thing, did not do it fast enough, or is something else happening?  Of
 course if I paused the test for a short bit between setup and the checks
 to manually verify everything the checks always passed. So it's a timing
 issue right?

 Two things: adding more timeout to a check is as appealing to me as
 gargling glass AND I was less annoyed that the test was failing as I
 was that it wasn't clear from reading logs what had gone wrong. I tried
 to find an additional intermediate control point that would split
 failure modes into two categories: neutron is too slow in setting things
 up and neutron failed to set things up correctly. Granted it still is
 adding timeout to the test, but if I could find a control point based on
 settling so that if it passed, then there is a good chance that if the
 next check failed it was because neutron actually screwed up what it was
 trying to do.

 Waiting until the query on the nova for the floating IP information
 seemed a relatively reasonable, if imperfect, settling criteria before
 attempting to connect to the VM. Testing to see if the floating IP
 assignment gets to the nova instance details is a valid test and,
 AFAICT, missing from the current tests. However, Yair has the reasonable
 point that connectivity is often available long before the floating IP
 appears in the nova results and that it could be considered invalid to
 use non-network specific criteria as pass/fail for this test.
 
 But, Tempest is all about functional integration testing. Using a call
 to Nova's server details to determine whether a dependent call to
 Neutron succeeded (setting up the floating IP) is exactly what I think
 Tempest is all about. It's validating that the integration between Nova
 and Neutron is working as expected.
 
 So, I actually think the assertion on the floating IP address appearing
 (after some timeout/timeout-backoff) is entirely appropriate.
 
 In general, the validity of checking for the presence of a floating IP
 in the server details is a matter of interpretation. I think it is a
 given that it must be tested somewhere and that if it causes a test to
 fail then it is as valid a failure than a ping failing. Certainly I have
 seen scenarios where an IP appears, but doesn't actually work and others
 where the IP doesn't appear (ever, not just in really long while) but
 magically works. Both are bugs. Which is more appropriate to tests like
 test_network_basic_ops?
 
 I believe both assertions should be part of the test cases, but since
 the latter condition (good ping connectivity, but no floater ever
 appears attached to the instance) necessarily depends on the first
 failure (floating IP does not appear in the server details after a
 timeout), then perhaps one way to handle this would be to do this:
 
 a) create server instance
 b) assign floating ip
 c) query server details looking for floater in a timeout-backoff loop
 c1) floater does appear
  c1-a) assert ping connectivity
 c2) floater does not appear
  c2-a) check ping connectivity. if ping connectivity succeeds, use a
 call to testtools.TestCase.addDetail() to provide some interesting
 feedback
  c2-b) raise assertion that floater did not appear in the server details
 
 Currently, the polling interval for the checks in the gate should be
 tuned. They are borrowing other polling configuration and I can see it
 is ill-advised. It is currently polling at an interval of a second and
 if the intent is to wait for the entire system to settle down before
 proceeding then polling nova that quickly is too often. It simply
 increases the load while we are waiting to adapt to a loaded system. For
 example in the course of a three minute timeout, the floating IP check
 polled nova for server details 180 times.
 
 Agreed completely.

We should just add an exponential backoff to the waiting. That should
decrease load over time. I'd be +2 to such a patch.

That being said I'm not sure why 1 request / sec is considered load
that would break the system. That doesn't seem a 

Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Sean Dague
On 12/19/2013 03:31 AM, Yair Fried wrote:
 Hi Guys,
 I run into this issue trying to incorporate this test into
 cross_tenant_connectivity scenario:
 launching 2 VMs in different tenants
 What I saw, is that in the gate it fails half the time (the original
 test passes without issues) and ONLY on the 2nd VM (the first FLIP
 propagates fine).
 https://bugs.launchpad.net/nova/+bug/1262529
 
 I don't see this in:
 1. my local RHOS-Havana setup
 2. the cross_tenant_connectivity scenario without the control point
 (test passes without issues)
 3. test_network_basic_ops runs in the gate
 
 So here's my somewhat less experienced opinion:
 1. this happens due to stress (more than a single FLIP/VM)
 2. (as Brent said) Timeout interval between polling are too short
 3. FLIP is usually reachable long before it is seen in the nova DB (also
 from manual experience), so blocking the test until it reaches the nova
 DB doesn't make sense for me. if we could do this in different thread,
 then maybe, but using a Pass/Fail criteria to test for a timing issue
 seems wrong. Especially since as I understand it, the issue is on IF it
 reaches nova DB, only WHEN.
 
 I would like to, at least, move this check from its place as a blocker
 to later in the test. Before this is done, I would like to know if
 anyone else has seen the same problems Brent describes prior to this
 patch being merged.

 Regarding Jay's scenario suggestion, I think this should not be a part
 of network_basic_ops, but rather a separate stress scenario creating
 multiple VMs and testing for FLIP associations and propagation time.

+1 there is no need to overload that one scenario. A dedicated one would
be fine.

-Sean

-- 
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net



signature.asc
Description: OpenPGP digital signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Yair Fried
I would also like to point out that, since Brent used compute.build_timeout as 
the timeout value
***It takes more time to update FLIP in nova DB, than for a VM to build***

Yair

- Original Message -
From: Sean Dague s...@dague.net
To: OpenStack Development Mailing List (not for usage questions) 
openstack-dev@lists.openstack.org
Sent: Thursday, December 19, 2013 12:42:56 PM
Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the 
FloatingIPChecker control point

On 12/19/2013 03:31 AM, Yair Fried wrote:
 Hi Guys,
 I run into this issue trying to incorporate this test into
 cross_tenant_connectivity scenario:
 launching 2 VMs in different tenants
 What I saw, is that in the gate it fails half the time (the original
 test passes without issues) and ONLY on the 2nd VM (the first FLIP
 propagates fine).
 https://bugs.launchpad.net/nova/+bug/1262529
 
 I don't see this in:
 1. my local RHOS-Havana setup
 2. the cross_tenant_connectivity scenario without the control point
 (test passes without issues)
 3. test_network_basic_ops runs in the gate
 
 So here's my somewhat less experienced opinion:
 1. this happens due to stress (more than a single FLIP/VM)
 2. (as Brent said) Timeout interval between polling are too short
 3. FLIP is usually reachable long before it is seen in the nova DB (also
 from manual experience), so blocking the test until it reaches the nova
 DB doesn't make sense for me. if we could do this in different thread,
 then maybe, but using a Pass/Fail criteria to test for a timing issue
 seems wrong. Especially since as I understand it, the issue is on IF it
 reaches nova DB, only WHEN.
 
 I would like to, at least, move this check from its place as a blocker
 to later in the test. Before this is done, I would like to know if
 anyone else has seen the same problems Brent describes prior to this
 patch being merged.

 Regarding Jay's scenario suggestion, I think this should not be a part
 of network_basic_ops, but rather a separate stress scenario creating
 multiple VMs and testing for FLIP associations and propagation time.

+1 there is no need to overload that one scenario. A dedicated one would
be fine.

-Sean

-- 
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Brent Eagles

Hi,

Yair Fried wrote:

I would also like to point out that, since Brent used compute.build_timeout as 
the timeout value
***It takes more time to update FLIP in nova DB, than for a VM to build***

Yair


Agreed. I think that's an extremely important highlight of this 
discussion. Propagation of the floating IP is definitely bugged. In the 
small sample of logs (2) that I checked, the floating IP assignment 
propagated in around 10 seconds for test_network_basic_ops, but in the 
cross tenant connectivity test it took somewhere around 1 minute for the 
first assignment and something over 3 (otherwise known as 
simply-too-long-to-find-out). Even if the querying of once a second were 
excessive - which I do not feel strong enough about to say is anything 
other than a *possible* contributing factor - it should not take that long.


Cheers,

Brent

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Frittoli, Andrea (Cloud Services)
My 2 cents:

In the test the floating IP is created via neutron API and later checked via
nova API.

So the test is relying here (or trying to verify?) the network cache refresh
mechanism in nova. 
This is something that we should test, but in a test dedicated to this.

The primary objective of test_network_basic_ops is to verify the network
plumbing and end-to-end connectivity, so it should be decoupled from things
like network cache refresh.

If the floating IP is associated via neutron API, only the neutron API will
report the associated in a timely manner. 
Else if the floating IP is created via the nova API, this will update the
network cache automatically, not relying on the cache refresh mechanism, so
both neutron and nova API will report the associated in a timely manner
(this did not work some weeks ago, so it something tempest tests should
catch).

andrea

-Original Message-
From: Brent Eagles [mailto:beag...@redhat.com] 
Sent: 19 December 2013 14:53
To: OpenStack Development Mailing List (not for usage questions)
Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the
FloatingIPChecker control point

Hi,

Yair Fried wrote:
 I would also like to point out that, since Brent used 
 compute.build_timeout as the timeout value ***It takes more time to 
 update FLIP in nova DB, than for a VM to build***

 Yair

Agreed. I think that's an extremely important highlight of this discussion.
Propagation of the floating IP is definitely bugged. In the small sample of
logs (2) that I checked, the floating IP assignment propagated in around 10
seconds for test_network_basic_ops, but in the cross tenant connectivity
test it took somewhere around 1 minute for the first assignment and
something over 3 (otherwise known as simply-too-long-to-find-out). Even if
the querying of once a second were excessive - which I do not feel strong
enough about to say is anything other than a *possible* contributing factor
- it should not take that long.

Cheers,

Brent

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


smime.p7s
Description: S/MIME cryptographic signature
___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-19 Thread Salvatore Orlando
Before starting this post I confess I did not read with the required level
of attention all this thread, so I apologise for any repetition.

I just wanted to point out that floating IPs in neutron are created
asynchronously when using the l3 agent, and I think this is clear to
everybody.
So when the create floating IP call returns, this does not mean the
floating IP has actually been wired, ie: IP configured on qg-interface and
SNAT/DNAT rules added.

Unfortunately, neutron lacks a concept of operational status for a floating
IP which would tell clients, including nova (it acts as a client wrt nova
api), when a floating IP is ready to be used. I started work in this
direction, but it has been suspended now for a week. If anybody wants to
take over and deems this a reasonable thing to do, it will be great.

I think neutron tests checking connectivity might return more meaningful
failure data if they would gather the status of the various components
which might impact connectivity.
These are:
- The floating IP
- The router internal interface
- The VIF port
- The DHCP agent

Collecting info about the latter is very important but a bit trickier. I
discussed with Sean and Maru that it would be great for a starter, grep the
console log to check whether the instance obtained an IP.
Other things to consider would be:
- adding an operational status to a subnet, which would express whether the
DHCP agent is in sync with that subnet (this information won't make sense
for subnets with dhcp disabled)
- working on a 'debug' administrative API which could return, for instance,
for each DHCP agent the list of configured networks and leases.

Regarding timeouts, I think it's fair for tempest to define a timeout and
ask that everything from VM boot to Floating IP wiring completes within
that timeout.

Regards,
Salvatore


On 19 December 2013 16:15, Frittoli, Andrea (Cloud Services) 
fritt...@hp.com wrote:

 My 2 cents:

 In the test the floating IP is created via neutron API and later checked
 via
 nova API.

 So the test is relying here (or trying to verify?) the network cache
 refresh
 mechanism in nova.
 This is something that we should test, but in a test dedicated to this.

 The primary objective of test_network_basic_ops is to verify the network
 plumbing and end-to-end connectivity, so it should be decoupled from things
 like network cache refresh.

 If the floating IP is associated via neutron API, only the neutron API will
 report the associated in a timely manner.
 Else if the floating IP is created via the nova API, this will update the
 network cache automatically, not relying on the cache refresh mechanism, so
 both neutron and nova API will report the associated in a timely manner
 (this did not work some weeks ago, so it something tempest tests should
 catch).

 andrea

 -Original Message-
 From: Brent Eagles [mailto:beag...@redhat.com]
 Sent: 19 December 2013 14:53
 To: OpenStack Development Mailing List (not for usage questions)
 Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the
 FloatingIPChecker control point

 Hi,

 Yair Fried wrote:
  I would also like to point out that, since Brent used
  compute.build_timeout as the timeout value ***It takes more time to
  update FLIP in nova DB, than for a VM to build***
 
  Yair

 Agreed. I think that's an extremely important highlight of this discussion.
 Propagation of the floating IP is definitely bugged. In the small sample of
 logs (2) that I checked, the floating IP assignment propagated in around 10
 seconds for test_network_basic_ops, but in the cross tenant connectivity
 test it took somewhere around 1 minute for the first assignment and
 something over 3 (otherwise known as simply-too-long-to-find-out). Even if
 the querying of once a second were excessive - which I do not feel strong
 enough about to say is anything other than a *possible* contributing factor
 - it should not take that long.

 Cheers,

 Brent

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-18 Thread Brent Eagles

Hi,

Yair and I were discussing a change that I initiated and was 
incorporated into the test_network_basic_ops test. It was intended as a 
configuration control point for floating IP address assignments before 
actually testing connectivity. The question we were discussing was 
whether this check was a valid pass/fail criteria for tests like 
test_network_basic_ops.


The initial motivation for the change was that test_network_basic_ops 
had a less than 50/50 chance of passing in my local environment for 
whatever reason. After looking at the test, it seemed ridiculous that it 
should be failing. The problem is that more often than not the data that 
was available in the logs all pointed to it being set up correctly but 
the ping test for connectivity was timing out. From the logs it wasn't 
clear that the test was failing because neutron did not do the right 
thing, did not do it fast enough, or is something else happening?  Of 
course if I paused the test for a short bit between setup and the checks 
to manually verify everything the checks always passed. So it's a timing 
issue right?


Two things: adding more timeout to a check is as appealing to me as 
gargling glass AND I was less annoyed that the test was failing as I 
was that it wasn't clear from reading logs what had gone wrong. I tried 
to find an additional intermediate control point that would split 
failure modes into two categories: neutron is too slow in setting things 
up and neutron failed to set things up correctly. Granted it still is 
adding timeout to the test, but if I could find a control point based on 
settling so that if it passed, then there is a good chance that if the 
next check failed it was because neutron actually screwed up what it was 
trying to do.


Waiting until the query on the nova for the floating IP information 
seemed a relatively reasonable, if imperfect, settling criteria before 
attempting to connect to the VM. Testing to see if the floating IP 
assignment gets to the nova instance details is a valid test and, 
AFAICT, missing from the current tests. However, Yair has the reasonable 
point that connectivity is often available long before the floating IP 
appears in the nova results and that it could be considered invalid to 
use non-network specific criteria as pass/fail for this test.


In general, the validity of checking for the presence of a floating IP 
in the server details is a matter of interpretation. I think it is a 
given that it must be tested somewhere and that if it causes a test to 
fail then it is as valid a failure than a ping failing. Certainly I have 
seen scenarios where an IP appears, but doesn't actually work and others 
where the IP doesn't appear (ever, not just in really long while) but 
magically works. Both are bugs. Which is more appropriate to tests like 
test_network_basic_ops?


Currently, the polling interval for the checks in the gate should be 
tuned. They are borrowing other polling configuration and I can see it 
is ill-advised. It is currently polling at an interval of a second and 
if the intent is to wait for the entire system to settle down before 
proceeding then polling nova that quickly is too often. It simply 
increases the load while we are waiting to adapt to a loaded system. For 
example in the course of a three minute timeout, the floating IP check 
polled nova for server details 180 times.


All this aside it is granted that checking for the floating IP in the 
nova instance details is imperfect in itself. There is nothing that 
assures that the presence of that information indicates that the 
networking backend is done its work.


Comments, suggestions, queries, foam bricks?

Cheers,

Brent

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point

2013-12-18 Thread Jay Pipes

On 12/18/2013 10:21 PM, Brent Eagles wrote:

Hi,

Yair and I were discussing a change that I initiated and was
incorporated into the test_network_basic_ops test. It was intended as a
configuration control point for floating IP address assignments before
actually testing connectivity. The question we were discussing was
whether this check was a valid pass/fail criteria for tests like
test_network_basic_ops.

The initial motivation for the change was that test_network_basic_ops
had a less than 50/50 chance of passing in my local environment for
whatever reason. After looking at the test, it seemed ridiculous that it
should be failing. The problem is that more often than not the data that
was available in the logs all pointed to it being set up correctly but
the ping test for connectivity was timing out. From the logs it wasn't
clear that the test was failing because neutron did not do the right
thing, did not do it fast enough, or is something else happening?  Of
course if I paused the test for a short bit between setup and the checks
to manually verify everything the checks always passed. So it's a timing
issue right?

Two things: adding more timeout to a check is as appealing to me as
gargling glass AND I was less annoyed that the test was failing as I
was that it wasn't clear from reading logs what had gone wrong. I tried
to find an additional intermediate control point that would split
failure modes into two categories: neutron is too slow in setting things
up and neutron failed to set things up correctly. Granted it still is
adding timeout to the test, but if I could find a control point based on
settling so that if it passed, then there is a good chance that if the
next check failed it was because neutron actually screwed up what it was
trying to do.

Waiting until the query on the nova for the floating IP information
seemed a relatively reasonable, if imperfect, settling criteria before
attempting to connect to the VM. Testing to see if the floating IP
assignment gets to the nova instance details is a valid test and,
AFAICT, missing from the current tests. However, Yair has the reasonable
point that connectivity is often available long before the floating IP
appears in the nova results and that it could be considered invalid to
use non-network specific criteria as pass/fail for this test.


But, Tempest is all about functional integration testing. Using a call 
to Nova's server details to determine whether a dependent call to 
Neutron succeeded (setting up the floating IP) is exactly what I think 
Tempest is all about. It's validating that the integration between Nova 
and Neutron is working as expected.


So, I actually think the assertion on the floating IP address appearing 
(after some timeout/timeout-backoff) is entirely appropriate.



In general, the validity of checking for the presence of a floating IP
in the server details is a matter of interpretation. I think it is a
given that it must be tested somewhere and that if it causes a test to
fail then it is as valid a failure than a ping failing. Certainly I have
seen scenarios where an IP appears, but doesn't actually work and others
where the IP doesn't appear (ever, not just in really long while) but
magically works. Both are bugs. Which is more appropriate to tests like
test_network_basic_ops?


I believe both assertions should be part of the test cases, but since 
the latter condition (good ping connectivity, but no floater ever 
appears attached to the instance) necessarily depends on the first 
failure (floating IP does not appear in the server details after a 
timeout), then perhaps one way to handle this would be to do this:


a) create server instance
b) assign floating ip
c) query server details looking for floater in a timeout-backoff loop
c1) floater does appear
 c1-a) assert ping connectivity
c2) floater does not appear
 c2-a) check ping connectivity. if ping connectivity succeeds, use a 
call to testtools.TestCase.addDetail() to provide some interesting 
feedback

 c2-b) raise assertion that floater did not appear in the server details


Currently, the polling interval for the checks in the gate should be
tuned. They are borrowing other polling configuration and I can see it
is ill-advised. It is currently polling at an interval of a second and
if the intent is to wait for the entire system to settle down before
proceeding then polling nova that quickly is too often. It simply
increases the load while we are waiting to adapt to a loaded system. For
example in the course of a three minute timeout, the floating IP check
polled nova for server details 180 times.


Agreed completely.

Best,
-jay


All this aside it is granted that checking for the floating IP in the
nova instance details is imperfect in itself. There is nothing that
assures that the presence of that information indicates that the
networking backend is done its work.

Comments, suggestions, queries, foam bricks?

Cheers,

Brent