[jira] [Updated] (DISPATCH-2081) Fallback test fail - router not detecting drained link

Charles E. Rolke (Jira) Fri, 30 Apr 2021 06:54:07 -0700


     [ 
https://issues.apache.org/jira/browse/DISPATCH-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Charles E. Rolke updated DISPATCH-2081:
---------------------------------------
    Fix Version/s: Backlog

> Fallback test fail - router not detecting drained link
> ------------------------------------------------------
>
>                 Key: DISPATCH-2081
>                 URL: https://issues.apache.org/jira/browse/DISPATCH-2081
>             Project: Qpid Dispatch
>          Issue Type: Bug
>          Components: Routing Engine
>    Affects Versions: 1.15.0
>         Environment: h4.  
>            Reporter: Charles E. Rolke
>            Priority: Major
>             Fix For: Backlog
>
>
> h3. History
> The fallback dest test, particularly the SwitchoverTest subclasses, have had 
> a long history of persistent, intermittent failures. See DISPATCH-1361 and 
> DISPATCH-1786. CI tests running on Ubunto xenial fail more frequently than 
> any other platform
> h3. Recreating the failure
> The only way to get any clue at all is to get access to the router logs after 
> a test failure. On the CI systems this is not an option.
> A reproducer was created that fails usually before 1000 switchover tests run. 
> This is an Ubuntu xenial docker image that is run with *--cpus=0.8*. This 
> means slow-upon-slow to get internal scheduling just right. Then loop on 
> *ctest -VV -R fallback_dest*. After the test finally fails then get the log 
> files out of the docker image.
> h3. Analyzing the logs
> h4. Get the Scraper web page
> Run command
> {{  scraper -f I*.log E*.log > fallback_dest.html}}
> Then view the resulting web page.
> h4.  Navigating the web page
> Nice web page. Now what? The tests are designed to help you a little here. 
> The failing case was test_35. This test uses router address *dest.35* for 
> link sources and targets making the test pretty easy to isolate in the 
> >1,000,000 lines of web page. The address appears early on in lists of 
> addresses and then happens for real in an attach launched by the self test.
> h3. What happened?
>  * This test sets up a sender to INTA, a primary receiver to INTB and a 
> fallback receiver in EA1.
>  * Surprisingly the fallback receiver connects before the primary receiver 
> despite the order in the test souce code. Not to worry.
>  * Then the test sends 300 messages that are received and accepted by the 
> primary receiver.
>  * The primary receiver closes
>  * The sender starts sending 300 messages to the fallback receiver
>  * These messages go into INTA and get forwarded to INTB. INTB has no 
> destination for them so they are released.
>  * When the sender gets the released status it sends more.
>  * Pretty soon the sender has sent 1,700 messaged
>  * Somewhere along the way INTB deletes address M0dest.35
>  * Eventually router INTA sends a DRAIN to the sender.
>  * The test sender sends enough messages to consume the remaining credit.
>  * Then all message traffic stops.
>  * The test sits there for a minute and then times out.
> h3. What went wrong
> It looks like the router started a drain cycle with the sender but the sender 
> never sent a FLOW back with drain=true.
> Proton python does not spontaneously send flow with drain=true. It is up to 
> the application, in this case the fallback_dest self test code, to do that. 
> Furthermore, if the application has consumed all the credit then proton will 
> not send a flow with drain=true even if sender.drained() is called. Proton 
> python sends the flow only if the drained function consumed any credits 
> outside of message flow.
> If the router is waiting for a flow then with this test setup it will never 
> come.
> Note: Knowing now that the issue is drain related the web page helps find the 
> drain. In the Table of Contents click on the link for Noteworthy Log Lines. 
> There was one 'Flow with drain set' entry. Clicking on the lozenge shows the 
> line number link. Clicking on that link takes you to the flow performative 
> for the router issuing it.
> h3. What's the fix?
>  # Track client credits and when the credit drops to zero then let that 
> satisfy the ongoing drain cycle. Do this even without receiving the flow with 
> drain=true.
>  #  Don't send a drain to begin with. Come up with another way of dealing 
> with the client's stream of messages internally that does not involve a drain.
>  # The test client could be gimmicked to detect when it has consumed all but 
> one credit. Then it could call drained() so proton python could consume the 
> last credit via a drain cycle and send the AMQP flow with drain=true. This 
> may work to get the test to pass but it won't help people in the real world 
> who use the proton python client.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (DISPATCH-2081) Fallback test fail - router not detecting drained link

Reply via email to