Charles E. Rolke created DISPATCH-2081:
------------------------------------------
Summary: Fallback test fail - router not detecting drained link
Key: DISPATCH-2081
URL: https://issues.apache.org/jira/browse/DISPATCH-2081
Project: Qpid Dispatch
Issue Type: Bug
Components: Routing Engine
Affects Versions: 1.15.0
Environment: h4.
Reporter: Charles E. Rolke
h3. History
The fallback dest test, particularly the SwitchoverTest subclasses, have had a
long history of persistent, intermittent failures. See DISPATCH-1361 and
DISPATCH-1786. CI tests running on Ubunto xenial fail more frequently than any
other platform
h3. Recreating the failure
The only way to get any clue at all is to get access to the router logs after a
test failure. On the CI systems this is not an option.
A reproducer was created that fails usually before 1000 switchover tests run.
This is an Ubuntu xenial docker image that is run with *--cpus=0.8*. This means
slow-upon-slow to get internal scheduling just right. Then loop on *ctest -VV
-R fallback_dest*. After the test finally fails then get the log files out of
the docker image.
h3. Analyzing the logs
h4. Get the Scraper web page
Run command
{{ scraper -f I*.log E*.log > fallback_dest.html}}
Then view the resulting web page.
h4. Navigating the web page
Nice web page. Now what? The tests are designed to help you a little here. The
failing case was test_35. This test uses router address *dest.35* for link
sources and targets making the test pretty easy to isolate in the >1,000,000
lines of web page. The address appears early on in lists of addresses and then
happens for real in an attach launched by the self test.
h3. What happened?
* This test sets up a sender to INTA, a primary receiver to INTB and a
fallback receiver in EA1.
* Surprisingly the fallback receiver connects before the primary receiver
despite the order in the test souce code. Not to worry.
* Then the test sends 300 messages that are received and accepted by the
primary receiver.
* The primary receiver closes
* The sender starts sending 300 messages to the fallback receiver
* These messages go into INTA and get forwarded to INTB. INTB has no
destination for them so they are released.
* When the sender gets the released status it sends more.
* Pretty soon the sender has sent 1,700 messaged
* Somewhere along the way INTB deletes address M0dest.35
* Eventually router INTA sends a DRAIN to the sender.
* The test sender sends enough messages to consume the remaining credit.
* Then all message traffic stops.
* The test sits there for a minute and then times out.
h3. What went wrong
It looks like the router started a drain cycle with the sender but the sender
never sent a FLOW back with drain=true.
Proton python does not spontaneously send flow with drain=true. It is up to the
application, in this case the fallback_dest self test code, to do that.
Furthermore, if the application has consumed all the credit then proton will
not send a flow with drain=true even if sender.drained() is called. Proton
python sends the flow only if the drained function consumed any credits outside
of message flow.
If the router is waiting for a flow then with this test setup it will never
come.
Note: Knowing now that the issue is drain related the web page helps find the
drain. In the Table of Contents click on the link for Noteworthy Log Lines.
There was one 'Flow with drain set' entry. Clicking on the lozenge shows the
line number link. Clicking on that link takes you to the flow performative for
the router issuing it.
h3. What's the fix?
# Track client credits and when the credit drops to zero then let that satisfy
the ongoing drain cycle. Do this even without receiving the flow with
drain=true.
# Don't send a drain to begin with. Come up with another way of dealing with
the client's stream of messages internally that does not involve a drain.
# The test client could be gimmicked to detect when it has consumed all but
one credit. Then it could call drained() so proton python could consume the
last credit via a drain cycle and send the AMQP flow with drain=true. This may
work to get the test to pass but it won't help people in the real world who use
the proton python client.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]