[
https://issues.apache.org/jira/browse/DISPATCH-2081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles E. Rolke updated DISPATCH-2081:
---------------------------------------
Fix Version/s: Backlog
> Fallback test fail - router not detecting drained link
> ------------------------------------------------------
>
> Key: DISPATCH-2081
> URL: https://issues.apache.org/jira/browse/DISPATCH-2081
> Project: Qpid Dispatch
> Issue Type: Bug
> Components: Routing Engine
> Affects Versions: 1.15.0
> Environment: h4.
> Reporter: Charles E. Rolke
> Priority: Major
> Fix For: Backlog
>
>
> h3. History
> The fallback dest test, particularly the SwitchoverTest subclasses, have had
> a long history of persistent, intermittent failures. See DISPATCH-1361 and
> DISPATCH-1786. CI tests running on Ubunto xenial fail more frequently than
> any other platform
> h3. Recreating the failure
> The only way to get any clue at all is to get access to the router logs after
> a test failure. On the CI systems this is not an option.
> A reproducer was created that fails usually before 1000 switchover tests run.
> This is an Ubuntu xenial docker image that is run with *--cpus=0.8*. This
> means slow-upon-slow to get internal scheduling just right. Then loop on
> *ctest -VV -R fallback_dest*. After the test finally fails then get the log
> files out of the docker image.
> h3. Analyzing the logs
> h4. Get the Scraper web page
> Run command
> {{ scraper -f I*.log E*.log > fallback_dest.html}}
> Then view the resulting web page.
> h4. Navigating the web page
> Nice web page. Now what? The tests are designed to help you a little here.
> The failing case was test_35. This test uses router address *dest.35* for
> link sources and targets making the test pretty easy to isolate in the
> >1,000,000 lines of web page. The address appears early on in lists of
> addresses and then happens for real in an attach launched by the self test.
> h3. What happened?
> * This test sets up a sender to INTA, a primary receiver to INTB and a
> fallback receiver in EA1.
> * Surprisingly the fallback receiver connects before the primary receiver
> despite the order in the test souce code. Not to worry.
> * Then the test sends 300 messages that are received and accepted by the
> primary receiver.
> * The primary receiver closes
> * The sender starts sending 300 messages to the fallback receiver
> * These messages go into INTA and get forwarded to INTB. INTB has no
> destination for them so they are released.
> * When the sender gets the released status it sends more.
> * Pretty soon the sender has sent 1,700 messaged
> * Somewhere along the way INTB deletes address M0dest.35
> * Eventually router INTA sends a DRAIN to the sender.
> * The test sender sends enough messages to consume the remaining credit.
> * Then all message traffic stops.
> * The test sits there for a minute and then times out.
> h3. What went wrong
> It looks like the router started a drain cycle with the sender but the sender
> never sent a FLOW back with drain=true.
> Proton python does not spontaneously send flow with drain=true. It is up to
> the application, in this case the fallback_dest self test code, to do that.
> Furthermore, if the application has consumed all the credit then proton will
> not send a flow with drain=true even if sender.drained() is called. Proton
> python sends the flow only if the drained function consumed any credits
> outside of message flow.
> If the router is waiting for a flow then with this test setup it will never
> come.
> Note: Knowing now that the issue is drain related the web page helps find the
> drain. In the Table of Contents click on the link for Noteworthy Log Lines.
> There was one 'Flow with drain set' entry. Clicking on the lozenge shows the
> line number link. Clicking on that link takes you to the flow performative
> for the router issuing it.
> h3. What's the fix?
> # Track client credits and when the credit drops to zero then let that
> satisfy the ongoing drain cycle. Do this even without receiving the flow with
> drain=true.
> # Don't send a drain to begin with. Come up with another way of dealing
> with the client's stream of messages internally that does not involve a drain.
> # The test client could be gimmicked to detect when it has consumed all but
> one credit. Then it could call drained() so proton python could consume the
> last credit via a drain cycle and send the AMQP flow with drain=true. This
> may work to get the test to pass but it won't help people in the real world
> who use the proton python client.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]