Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
Salvatore Orlando wrote: Before starting this post I confess I did not read with the required level of attention all this thread, so I apologise for any repetition. I just wanted to point out that floating IPs in neutron are created asynchronously when using the l3 agent, and I think this is clear to everybody. So when the create floating IP call returns, this does not mean the floating IP has actually been wired, ie: IP configured on qg-interface and SNAT/DNAT rules added. Unfortunately, neutron lacks a concept of operational status for a floating IP which would tell clients, including nova (it acts as a client wrt nova api), when a floating IP is ready to be used. I started work in this direction, but it has been suspended now for a week. If anybody wants to take over and deems this a reasonable thing to do, it will be great. Unless somebody picks it up before I get from the break, I'd like to discuss this further with you. I think neutron tests checking connectivity might return more meaningful failure data if they would gather the status of the various components which might impact connectivity. These are: - The floating IP - The router internal interface - The VIF port - The DHCP agent I agree wholeheartedly. In fact, I think if we are going to rely on timeouts for pass/fail we need to do more for post-mortem details. Collecting info about the latter is very important but a bit trickier. I discussed with Sean and Maru that it would be great for a starter, grep the console log to check whether the instance obtained an IP. Other things to consider would be: - adding an operational status to a subnet, which would express whether the DHCP agent is in sync with that subnet (this information won't make sense for subnets with dhcp disabled) - working on a 'debug' administrative API which could return, for instance, for each DHCP agent the list of configured networks and leases. Interesting! Regarding timeouts, I think it's fair for tempest to define a timeout and ask that everything from VM boot to Floating IP wiring completes within that timeout. Regards, Salvatore I would agree. It would be impossible to have reasonable automated testing otherwise. Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
- Original Message - From: Brent Eagles beag...@redhat.com To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Monday, December 23, 2013 10:48:50 PM Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point Salvatore Orlando wrote: Before starting this post I confess I did not read with the required level of attention all this thread, so I apologise for any repetition. I just wanted to point out that floating IPs in neutron are created asynchronously when using the l3 agent, and I think this is clear to everybody. So when the create floating IP call returns, this does not mean the floating IP has actually been wired, ie: IP configured on qg-interface and SNAT/DNAT rules added. Unfortunately, neutron lacks a concept of operational status for a floating IP which would tell clients, including nova (it acts as a client wrt nova api), when a floating IP is ready to be used. I started work in this direction, but it has been suspended now for a week. If anybody wants to take over and deems this a reasonable thing to do, it will be great. Unless somebody picks it up before I get from the break, I'd like to discuss this further with you. I think neutron tests checking connectivity might return more meaningful failure data if they would gather the status of the various components which might impact connectivity. These are: - The floating IP - The router internal interface - The VIF port - The DHCP agent I wrote something addressing at least some of these points: https://review.openstack.org/#/c/55146/ I agree wholeheartedly. In fact, I think if we are going to rely on timeouts for pass/fail we need to do more for post-mortem details. Collecting info about the latter is very important but a bit trickier. I discussed with Sean and Maru that it would be great for a starter, grep the console log to check whether the instance obtained an IP. Other things to consider would be: - adding an operational status to a subnet, which would express whether the DHCP agent is in sync with that subnet (this information won't make sense for subnets with dhcp disabled) - working on a 'debug' administrative API which could return, for instance, for each DHCP agent the list of configured networks and leases. Interesting! Regarding timeouts, I think it's fair for tempest to define a timeout and ask that everything from VM boot to Floating IP wiring completes within that timeout. Regards, Salvatore I would agree. It would be impossible to have reasonable automated testing otherwise. Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
Hi Guys, I run into this issue trying to incorporate this test into cross_tenant_connectivity scenario: launching 2 VMs in different tenants What I saw, is that in the gate it fails half the time (the original test passes without issues) and ONLY on the 2nd VM (the first FLIP propagates fine). https://bugs.launchpad.net/nova/+bug/1262529 I don't see this in: 1. my local RHOS-Havana setup 2. the cross_tenant_connectivity scenario without the control point (test passes without issues) 3. test_network_basic_ops runs in the gate So here's my somewhat less experienced opinion: 1. this happens due to stress (more than a single FLIP/VM) 2. (as Brent said) Timeout interval between polling are too short 3. FLIP is usually reachable long before it is seen in the nova DB (also from manual experience), so blocking the test until it reaches the nova DB doesn't make sense for me. if we could do this in different thread, then maybe, but using a Pass/Fail criteria to test for a timing issue seems wrong. Especially since as I understand it, the issue is on IF it reaches nova DB, only WHEN. I would like to, at least, move this check from its place as a blocker to later in the test. Before this is done, I would like to know if anyone else has seen the same problems Brent describes prior to this patch being merged. Regarding Jay's scenario suggestion, I think this should not be a part of network_basic_ops, but rather a separate stress scenario creating multiple VMs and testing for FLIP associations and propagation time. Regards Yair (Also added my comments inline) - Original Message - From: Jay Pipes jaypi...@gmail.com To: openstack-dev@lists.openstack.org Sent: Thursday, December 19, 2013 5:54:29 AM Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point On 12/18/2013 10:21 PM, Brent Eagles wrote: Hi, Yair and I were discussing a change that I initiated and was incorporated into the test_network_basic_ops test. It was intended as a configuration control point for floating IP address assignments before actually testing connectivity. The question we were discussing was whether this check was a valid pass/fail criteria for tests like test_network_basic_ops. The initial motivation for the change was that test_network_basic_ops had a less than 50/50 chance of passing in my local environment for whatever reason. After looking at the test, it seemed ridiculous that it should be failing. The problem is that more often than not the data that was available in the logs all pointed to it being set up correctly but the ping test for connectivity was timing out. From the logs it wasn't clear that the test was failing because neutron did not do the right thing, did not do it fast enough, or is something else happening? Of course if I paused the test for a short bit between setup and the checks to manually verify everything the checks always passed. So it's a timing issue right? DID anyone else see experience this issue? locally or on the gate? Two things: adding more timeout to a check is as appealing to me as gargling glass AND I was less annoyed that the test was failing as I was that it wasn't clear from reading logs what had gone wrong. I tried to find an additional intermediate control point that would split failure modes into two categories: neutron is too slow in setting things up and neutron failed to set things up correctly. Granted it still is adding timeout to the test, but if I could find a control point based on settling so that if it passed, then there is a good chance that if the next check failed it was because neutron actually screwed up what it was trying to do. Waiting until the query on the nova for the floating IP information seemed a relatively reasonable, if imperfect, settling criteria before attempting to connect to the VM. Testing to see if the floating IP assignment gets to the nova instance details is a valid test and, AFAICT, missing from the current tests. However, Yair has the reasonable point that connectivity is often available long before the floating IP appears in the nova results and that it could be considered invalid to use non-network specific criteria as pass/fail for this test. But, Tempest is all about functional integration testing. Using a call to Nova's server details to determine whether a dependent call to Neutron succeeded (setting up the floating IP) is exactly what I think Tempest is all about. It's validating that the integration between Nova and Neutron is working as expected. So, I actually think the assertion on the floating IP address appearing (after some timeout/timeout-backoff) is entirely appropriate. Blocking the connectivity check until DB is updated doesn't make sense to me, since we know FLIP is reachable well before nova DB is updated (this is seen also in manual mode, not just
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
On 12/18/2013 10:54 PM, Jay Pipes wrote: On 12/18/2013 10:21 PM, Brent Eagles wrote: Hi, Yair and I were discussing a change that I initiated and was incorporated into the test_network_basic_ops test. It was intended as a configuration control point for floating IP address assignments before actually testing connectivity. The question we were discussing was whether this check was a valid pass/fail criteria for tests like test_network_basic_ops. The initial motivation for the change was that test_network_basic_ops had a less than 50/50 chance of passing in my local environment for whatever reason. After looking at the test, it seemed ridiculous that it should be failing. The problem is that more often than not the data that was available in the logs all pointed to it being set up correctly but the ping test for connectivity was timing out. From the logs it wasn't clear that the test was failing because neutron did not do the right thing, did not do it fast enough, or is something else happening? Of course if I paused the test for a short bit between setup and the checks to manually verify everything the checks always passed. So it's a timing issue right? Two things: adding more timeout to a check is as appealing to me as gargling glass AND I was less annoyed that the test was failing as I was that it wasn't clear from reading logs what had gone wrong. I tried to find an additional intermediate control point that would split failure modes into two categories: neutron is too slow in setting things up and neutron failed to set things up correctly. Granted it still is adding timeout to the test, but if I could find a control point based on settling so that if it passed, then there is a good chance that if the next check failed it was because neutron actually screwed up what it was trying to do. Waiting until the query on the nova for the floating IP information seemed a relatively reasonable, if imperfect, settling criteria before attempting to connect to the VM. Testing to see if the floating IP assignment gets to the nova instance details is a valid test and, AFAICT, missing from the current tests. However, Yair has the reasonable point that connectivity is often available long before the floating IP appears in the nova results and that it could be considered invalid to use non-network specific criteria as pass/fail for this test. But, Tempest is all about functional integration testing. Using a call to Nova's server details to determine whether a dependent call to Neutron succeeded (setting up the floating IP) is exactly what I think Tempest is all about. It's validating that the integration between Nova and Neutron is working as expected. So, I actually think the assertion on the floating IP address appearing (after some timeout/timeout-backoff) is entirely appropriate. In general, the validity of checking for the presence of a floating IP in the server details is a matter of interpretation. I think it is a given that it must be tested somewhere and that if it causes a test to fail then it is as valid a failure than a ping failing. Certainly I have seen scenarios where an IP appears, but doesn't actually work and others where the IP doesn't appear (ever, not just in really long while) but magically works. Both are bugs. Which is more appropriate to tests like test_network_basic_ops? I believe both assertions should be part of the test cases, but since the latter condition (good ping connectivity, but no floater ever appears attached to the instance) necessarily depends on the first failure (floating IP does not appear in the server details after a timeout), then perhaps one way to handle this would be to do this: a) create server instance b) assign floating ip c) query server details looking for floater in a timeout-backoff loop c1) floater does appear c1-a) assert ping connectivity c2) floater does not appear c2-a) check ping connectivity. if ping connectivity succeeds, use a call to testtools.TestCase.addDetail() to provide some interesting feedback c2-b) raise assertion that floater did not appear in the server details Currently, the polling interval for the checks in the gate should be tuned. They are borrowing other polling configuration and I can see it is ill-advised. It is currently polling at an interval of a second and if the intent is to wait for the entire system to settle down before proceeding then polling nova that quickly is too often. It simply increases the load while we are waiting to adapt to a loaded system. For example in the course of a three minute timeout, the floating IP check polled nova for server details 180 times. Agreed completely. We should just add an exponential backoff to the waiting. That should decrease load over time. I'd be +2 to such a patch. That being said I'm not sure why 1 request / sec is considered load that would break the system. That doesn't seem a
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
On 12/19/2013 03:31 AM, Yair Fried wrote: Hi Guys, I run into this issue trying to incorporate this test into cross_tenant_connectivity scenario: launching 2 VMs in different tenants What I saw, is that in the gate it fails half the time (the original test passes without issues) and ONLY on the 2nd VM (the first FLIP propagates fine). https://bugs.launchpad.net/nova/+bug/1262529 I don't see this in: 1. my local RHOS-Havana setup 2. the cross_tenant_connectivity scenario without the control point (test passes without issues) 3. test_network_basic_ops runs in the gate So here's my somewhat less experienced opinion: 1. this happens due to stress (more than a single FLIP/VM) 2. (as Brent said) Timeout interval between polling are too short 3. FLIP is usually reachable long before it is seen in the nova DB (also from manual experience), so blocking the test until it reaches the nova DB doesn't make sense for me. if we could do this in different thread, then maybe, but using a Pass/Fail criteria to test for a timing issue seems wrong. Especially since as I understand it, the issue is on IF it reaches nova DB, only WHEN. I would like to, at least, move this check from its place as a blocker to later in the test. Before this is done, I would like to know if anyone else has seen the same problems Brent describes prior to this patch being merged. Regarding Jay's scenario suggestion, I think this should not be a part of network_basic_ops, but rather a separate stress scenario creating multiple VMs and testing for FLIP associations and propagation time. +1 there is no need to overload that one scenario. A dedicated one would be fine. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net signature.asc Description: OpenPGP digital signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
I would also like to point out that, since Brent used compute.build_timeout as the timeout value ***It takes more time to update FLIP in nova DB, than for a VM to build*** Yair - Original Message - From: Sean Dague s...@dague.net To: OpenStack Development Mailing List (not for usage questions) openstack-dev@lists.openstack.org Sent: Thursday, December 19, 2013 12:42:56 PM Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point On 12/19/2013 03:31 AM, Yair Fried wrote: Hi Guys, I run into this issue trying to incorporate this test into cross_tenant_connectivity scenario: launching 2 VMs in different tenants What I saw, is that in the gate it fails half the time (the original test passes without issues) and ONLY on the 2nd VM (the first FLIP propagates fine). https://bugs.launchpad.net/nova/+bug/1262529 I don't see this in: 1. my local RHOS-Havana setup 2. the cross_tenant_connectivity scenario without the control point (test passes without issues) 3. test_network_basic_ops runs in the gate So here's my somewhat less experienced opinion: 1. this happens due to stress (more than a single FLIP/VM) 2. (as Brent said) Timeout interval between polling are too short 3. FLIP is usually reachable long before it is seen in the nova DB (also from manual experience), so blocking the test until it reaches the nova DB doesn't make sense for me. if we could do this in different thread, then maybe, but using a Pass/Fail criteria to test for a timing issue seems wrong. Especially since as I understand it, the issue is on IF it reaches nova DB, only WHEN. I would like to, at least, move this check from its place as a blocker to later in the test. Before this is done, I would like to know if anyone else has seen the same problems Brent describes prior to this patch being merged. Regarding Jay's scenario suggestion, I think this should not be a part of network_basic_ops, but rather a separate stress scenario creating multiple VMs and testing for FLIP associations and propagation time. +1 there is no need to overload that one scenario. A dedicated one would be fine. -Sean -- Sean Dague Samsung Research America s...@dague.net / sean.da...@samsung.com http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
Hi, Yair Fried wrote: I would also like to point out that, since Brent used compute.build_timeout as the timeout value ***It takes more time to update FLIP in nova DB, than for a VM to build*** Yair Agreed. I think that's an extremely important highlight of this discussion. Propagation of the floating IP is definitely bugged. In the small sample of logs (2) that I checked, the floating IP assignment propagated in around 10 seconds for test_network_basic_ops, but in the cross tenant connectivity test it took somewhere around 1 minute for the first assignment and something over 3 (otherwise known as simply-too-long-to-find-out). Even if the querying of once a second were excessive - which I do not feel strong enough about to say is anything other than a *possible* contributing factor - it should not take that long. Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
My 2 cents: In the test the floating IP is created via neutron API and later checked via nova API. So the test is relying here (or trying to verify?) the network cache refresh mechanism in nova. This is something that we should test, but in a test dedicated to this. The primary objective of test_network_basic_ops is to verify the network plumbing and end-to-end connectivity, so it should be decoupled from things like network cache refresh. If the floating IP is associated via neutron API, only the neutron API will report the associated in a timely manner. Else if the floating IP is created via the nova API, this will update the network cache automatically, not relying on the cache refresh mechanism, so both neutron and nova API will report the associated in a timely manner (this did not work some weeks ago, so it something tempest tests should catch). andrea -Original Message- From: Brent Eagles [mailto:beag...@redhat.com] Sent: 19 December 2013 14:53 To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point Hi, Yair Fried wrote: I would also like to point out that, since Brent used compute.build_timeout as the timeout value ***It takes more time to update FLIP in nova DB, than for a VM to build*** Yair Agreed. I think that's an extremely important highlight of this discussion. Propagation of the floating IP is definitely bugged. In the small sample of logs (2) that I checked, the floating IP assignment propagated in around 10 seconds for test_network_basic_ops, but in the cross tenant connectivity test it took somewhere around 1 minute for the first assignment and something over 3 (otherwise known as simply-too-long-to-find-out). Even if the querying of once a second were excessive - which I do not feel strong enough about to say is anything other than a *possible* contributing factor - it should not take that long. Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev smime.p7s Description: S/MIME cryptographic signature ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
Before starting this post I confess I did not read with the required level of attention all this thread, so I apologise for any repetition. I just wanted to point out that floating IPs in neutron are created asynchronously when using the l3 agent, and I think this is clear to everybody. So when the create floating IP call returns, this does not mean the floating IP has actually been wired, ie: IP configured on qg-interface and SNAT/DNAT rules added. Unfortunately, neutron lacks a concept of operational status for a floating IP which would tell clients, including nova (it acts as a client wrt nova api), when a floating IP is ready to be used. I started work in this direction, but it has been suspended now for a week. If anybody wants to take over and deems this a reasonable thing to do, it will be great. I think neutron tests checking connectivity might return more meaningful failure data if they would gather the status of the various components which might impact connectivity. These are: - The floating IP - The router internal interface - The VIF port - The DHCP agent Collecting info about the latter is very important but a bit trickier. I discussed with Sean and Maru that it would be great for a starter, grep the console log to check whether the instance obtained an IP. Other things to consider would be: - adding an operational status to a subnet, which would express whether the DHCP agent is in sync with that subnet (this information won't make sense for subnets with dhcp disabled) - working on a 'debug' administrative API which could return, for instance, for each DHCP agent the list of configured networks and leases. Regarding timeouts, I think it's fair for tempest to define a timeout and ask that everything from VM boot to Floating IP wiring completes within that timeout. Regards, Salvatore On 19 December 2013 16:15, Frittoli, Andrea (Cloud Services) fritt...@hp.com wrote: My 2 cents: In the test the floating IP is created via neutron API and later checked via nova API. So the test is relying here (or trying to verify?) the network cache refresh mechanism in nova. This is something that we should test, but in a test dedicated to this. The primary objective of test_network_basic_ops is to verify the network plumbing and end-to-end connectivity, so it should be decoupled from things like network cache refresh. If the floating IP is associated via neutron API, only the neutron API will report the associated in a timely manner. Else if the floating IP is created via the nova API, this will update the network cache automatically, not relying on the cache refresh mechanism, so both neutron and nova API will report the associated in a timely manner (this did not work some weeks ago, so it something tempest tests should catch). andrea -Original Message- From: Brent Eagles [mailto:beag...@redhat.com] Sent: 19 December 2013 14:53 To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point Hi, Yair Fried wrote: I would also like to point out that, since Brent used compute.build_timeout as the timeout value ***It takes more time to update FLIP in nova DB, than for a VM to build*** Yair Agreed. I think that's an extremely important highlight of this discussion. Propagation of the floating IP is definitely bugged. In the small sample of logs (2) that I checked, the floating IP assignment propagated in around 10 seconds for test_network_basic_ops, but in the cross tenant connectivity test it took somewhere around 1 minute for the first assignment and something over 3 (otherwise known as simply-too-long-to-find-out). Even if the querying of once a second were excessive - which I do not feel strong enough about to say is anything other than a *possible* contributing factor - it should not take that long. Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
Hi, Yair and I were discussing a change that I initiated and was incorporated into the test_network_basic_ops test. It was intended as a configuration control point for floating IP address assignments before actually testing connectivity. The question we were discussing was whether this check was a valid pass/fail criteria for tests like test_network_basic_ops. The initial motivation for the change was that test_network_basic_ops had a less than 50/50 chance of passing in my local environment for whatever reason. After looking at the test, it seemed ridiculous that it should be failing. The problem is that more often than not the data that was available in the logs all pointed to it being set up correctly but the ping test for connectivity was timing out. From the logs it wasn't clear that the test was failing because neutron did not do the right thing, did not do it fast enough, or is something else happening? Of course if I paused the test for a short bit between setup and the checks to manually verify everything the checks always passed. So it's a timing issue right? Two things: adding more timeout to a check is as appealing to me as gargling glass AND I was less annoyed that the test was failing as I was that it wasn't clear from reading logs what had gone wrong. I tried to find an additional intermediate control point that would split failure modes into two categories: neutron is too slow in setting things up and neutron failed to set things up correctly. Granted it still is adding timeout to the test, but if I could find a control point based on settling so that if it passed, then there is a good chance that if the next check failed it was because neutron actually screwed up what it was trying to do. Waiting until the query on the nova for the floating IP information seemed a relatively reasonable, if imperfect, settling criteria before attempting to connect to the VM. Testing to see if the floating IP assignment gets to the nova instance details is a valid test and, AFAICT, missing from the current tests. However, Yair has the reasonable point that connectivity is often available long before the floating IP appears in the nova results and that it could be considered invalid to use non-network specific criteria as pass/fail for this test. In general, the validity of checking for the presence of a floating IP in the server details is a matter of interpretation. I think it is a given that it must be tested somewhere and that if it causes a test to fail then it is as valid a failure than a ping failing. Certainly I have seen scenarios where an IP appears, but doesn't actually work and others where the IP doesn't appear (ever, not just in really long while) but magically works. Both are bugs. Which is more appropriate to tests like test_network_basic_ops? Currently, the polling interval for the checks in the gate should be tuned. They are borrowing other polling configuration and I can see it is ill-advised. It is currently polling at an interval of a second and if the intent is to wait for the entire system to settle down before proceeding then polling nova that quickly is too often. It simply increases the load while we are waiting to adapt to a loaded system. For example in the course of a three minute timeout, the floating IP check polled nova for server details 180 times. All this aside it is granted that checking for the floating IP in the nova instance details is imperfect in itself. There is nothing that assures that the presence of that information indicates that the networking backend is done its work. Comments, suggestions, queries, foam bricks? Cheers, Brent ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [neutron][qa] test_network_basic_ops and the FloatingIPChecker control point
On 12/18/2013 10:21 PM, Brent Eagles wrote: Hi, Yair and I were discussing a change that I initiated and was incorporated into the test_network_basic_ops test. It was intended as a configuration control point for floating IP address assignments before actually testing connectivity. The question we were discussing was whether this check was a valid pass/fail criteria for tests like test_network_basic_ops. The initial motivation for the change was that test_network_basic_ops had a less than 50/50 chance of passing in my local environment for whatever reason. After looking at the test, it seemed ridiculous that it should be failing. The problem is that more often than not the data that was available in the logs all pointed to it being set up correctly but the ping test for connectivity was timing out. From the logs it wasn't clear that the test was failing because neutron did not do the right thing, did not do it fast enough, or is something else happening? Of course if I paused the test for a short bit between setup and the checks to manually verify everything the checks always passed. So it's a timing issue right? Two things: adding more timeout to a check is as appealing to me as gargling glass AND I was less annoyed that the test was failing as I was that it wasn't clear from reading logs what had gone wrong. I tried to find an additional intermediate control point that would split failure modes into two categories: neutron is too slow in setting things up and neutron failed to set things up correctly. Granted it still is adding timeout to the test, but if I could find a control point based on settling so that if it passed, then there is a good chance that if the next check failed it was because neutron actually screwed up what it was trying to do. Waiting until the query on the nova for the floating IP information seemed a relatively reasonable, if imperfect, settling criteria before attempting to connect to the VM. Testing to see if the floating IP assignment gets to the nova instance details is a valid test and, AFAICT, missing from the current tests. However, Yair has the reasonable point that connectivity is often available long before the floating IP appears in the nova results and that it could be considered invalid to use non-network specific criteria as pass/fail for this test. But, Tempest is all about functional integration testing. Using a call to Nova's server details to determine whether a dependent call to Neutron succeeded (setting up the floating IP) is exactly what I think Tempest is all about. It's validating that the integration between Nova and Neutron is working as expected. So, I actually think the assertion on the floating IP address appearing (after some timeout/timeout-backoff) is entirely appropriate. In general, the validity of checking for the presence of a floating IP in the server details is a matter of interpretation. I think it is a given that it must be tested somewhere and that if it causes a test to fail then it is as valid a failure than a ping failing. Certainly I have seen scenarios where an IP appears, but doesn't actually work and others where the IP doesn't appear (ever, not just in really long while) but magically works. Both are bugs. Which is more appropriate to tests like test_network_basic_ops? I believe both assertions should be part of the test cases, but since the latter condition (good ping connectivity, but no floater ever appears attached to the instance) necessarily depends on the first failure (floating IP does not appear in the server details after a timeout), then perhaps one way to handle this would be to do this: a) create server instance b) assign floating ip c) query server details looking for floater in a timeout-backoff loop c1) floater does appear c1-a) assert ping connectivity c2) floater does not appear c2-a) check ping connectivity. if ping connectivity succeeds, use a call to testtools.TestCase.addDetail() to provide some interesting feedback c2-b) raise assertion that floater did not appear in the server details Currently, the polling interval for the checks in the gate should be tuned. They are borrowing other polling configuration and I can see it is ill-advised. It is currently polling at an interval of a second and if the intent is to wait for the entire system to settle down before proceeding then polling nova that quickly is too often. It simply increases the load while we are waiting to adapt to a loaded system. For example in the course of a three minute timeout, the floating IP check polled nova for server details 180 times. Agreed completely. Best, -jay All this aside it is granted that checking for the floating IP in the nova instance details is imperfect in itself. There is nothing that assures that the presence of that information indicates that the networking backend is done its work. Comments, suggestions, queries, foam bricks? Cheers, Brent