Ethan Li created STORM-3763:
-------------------------------

             Summary: Backpressure message ignored by the receiver caused the 
topology to not progress
                 Key: STORM-3763
                 URL: https://issues.apache.org/jira/browse/STORM-3763
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 2.1.0, 2.0.0, 2.2.0
            Reporter: Ethan Li


We have noticed a case where topology is stuck due to the mis-interpretation of 
backpressure messge:

At beginning, the topology ran fine but a downstream component had 
backpressure, so it sent backpressure signal to its upstream component, and the 
upstream component paused sending data to the downstream bolt.
Then the downstream component restarted (due to any reason, for example, killed 
by supervisor due to heartbeat timeout). When it came back up, 
when a downstream component restarts, it sends backpressure message to the 
upstream bolt. However, the upstream component didn't know how to interpret the 
backpressure message so it logs the below error and ignores the message.  Then 
the downstream component will not receive any data from the upstream bolt, so 
it won't have any backpressure (since no data is sent to it), hence it won't 
send any backpressure update message to the upstream bolt.  This leads to a 
dead situation that the upstream component thinks the downstream has 
backpressure so it paused sending data to it, while the downstream doesn't have 
backpressure but can't receive any data from upstream. 

{code:java}
2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[ERROR] Unexpected message from server: 
{worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[], 
nonBpTasks=[546, 790, 863]}
{code}

Let's look at the code:

When the connection between the downstream (server) and upstream (client) is 
established,
server invokes

https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerHandler.java#L39-L41

https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java#L237

which sends backpressure messages to the client.

This is because in this pipeline, "StormServerHandler" is the only one with 
that implemented "channelActive()" method.

https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerPipelineFactory.java#L56

However, the Client side expects authentication messages.

https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormClientHandler.java#L70-L75

so the client can't interpret the backpressure message at the beginning, hence 
"unexpected message".

This can be supported with an example. I have a wordcount topology running. At 
the startup, the client tries to connect to the server. Once connected, it 
sends a "SASL_TOKEN_MESSAGE_REQUEST".

client log

{code:java}
021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1 [DEBUG] 
SASL credentials for storm topology wc is 
-8603731884381183101:-9091319821854384981
2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully 
connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id: 
0x29da2e9c, L:/10.215.73.209:45870 - 
R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12]
2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1 [INFO] 
Connection established from /10.215.73.209:45870 to 
openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702
...
2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c, 
L:/10.215.73.209:45870 - 
R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702]
2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG] 
SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server
2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST

...

2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[DEBUG] send/recv time (ms): 277
2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[ERROR] Unexpected message from server: 
{worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], 
nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12,
13]}
{code}

But the server sends the backpressure message first, before it deals with the 
SASL_TOKEN_MESSAGE_REQUEST message

server log

{code:java}
2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler 
Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm 
topology wc is -8603731884381183101:-9091319821854384981
2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from 
resources
2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState 
Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to new 
client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, 
bpTasks=[], nonBpTasks=[17,
1, 18, 3, 4, 22, 7, 8, 9, 12, 13]}
2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1 
[DEBUG] SASL credentials for storm topology wc is 
-8603731884381183101:-9091319821854384981
2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler 
Netty-server-localhost-6702-worker-1 [DEBUG] 
-Dio.netty.recycler.maxCapacityPerThread: 4096
2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler 
Netty-server-localhost-6702-worker-1 [DEBUG] 
-Dio.netty.recycler.maxSharedCapacityFactor: 2
2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler 
Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.linkCapacity: 
16
2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler 
Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8
2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN] 
topology.backpressure.enable is a deprecated config please see class 
org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information.
2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf 
Netty-server-localhost-6702-worker-1 [DEBUG] 
-Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true
2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf 
Netty-server-localhost-6702-worker-1 [DEBUG] 
-Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true
2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory 
Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default 
ResourceLeakDetector: 
org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b


....

2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler 
Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id: 
0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating now, 
with topology token: wc
2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer 
Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token 
is: wc with authmethod DIGEST-MD5
2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer 
Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating 
SaslDigestCallback handler with topology token: wc
{code}

https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormServerHandler.java#L46-L52

This is a bug likely introduced in STORM-2306 
(https://github.com/apache/storm/pull/2502). This willl happen on every 
topology when "storm.messaging.netty.authentication" is set true. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to