[ 
https://issues.apache.org/jira/browse/STORM-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li resolved STORM-3763.
-----------------------------
    Fix Version/s: 2.2.1
                   2.3.0
       Resolution: Fixed

Merged to master: 4a1eea700766da2f175ac7eaba6064f0d7f0ff03
Cherry-picked to 2.2.x-branch: 7543a13e021570b6b4d7e583e00823c0b3106a3b

> Backpressure message ignored by the receiver caused the topology to not 
> progress
> --------------------------------------------------------------------------------
>
>                 Key: STORM-3763
>                 URL: https://issues.apache.org/jira/browse/STORM-3763
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 2.1.0, 2.2.0
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>             Fix For: 2.3.0, 2.2.1
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We have noticed a case where topology is stuck due to the mis-interpretation 
> of backpressure messge:
> At beginning, the topology ran fine but a downstream component had 
> backpressure, so it sent backpressure signal to its upstream component, and 
> the upstream component paused sending data to the downstream bolt.
> Then the downstream component restarted (due to any reason, for example, 
> killed by supervisor due to heartbeat timeout). When it came back up, it 
> sends backpressure message to the upstream bolt. However, the upstream 
> component didn't know how to interpret the backpressure message so it logs 
> the below error and ignores the message.  
> {code:java}
> 2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [ERROR] Unexpected message from server: 
> {worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[], 
> nonBpTasks=[546, 790, 863]}
> {code}
> Then the downstream component will not receive any data from the upstream 
> component, so it won't have any backpressure (since no data is sent to it), 
> hence it won't send any backpressure update message to the upstream 
> component.  This leads to a dead situation that the upstream component thinks 
> the downstream has backpressure so it paused sending data to it, while the 
> downstream doesn't have backpressure but can't receive any data from 
> upstream. The topology is stuck because of it.
> Let's look at the code:
> When the connection between the downstream (server) and upstream (client) is 
> established,
> server invokes
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerHandler.java#L39-L41
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java#L237
> which sends backpressure messages to the client.
> This is because in this pipeline, "StormServerHandler" is the only one with 
> that implemented "channelActive()" method.
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerPipelineFactory.java#L56
> However, the Client side expects authentication messages.
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormClientHandler.java#L70-L75
> so the client can't interpret the backpressure message at the beginning, 
> hence "unexpected message".
> This can be supported with an example. I have a wordcount topology running. 
> At the startup, the client tries to connect to the server. Once connected, it 
> sends a "SASL_TOKEN_MESSAGE_REQUEST".
> client log
> {code:java}
> 021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [DEBUG] SASL credentials for storm topology wc is 
> -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully 
> connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id: 
> 0x29da2e9c, L:/10.215.73.209:45870 - 
> R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12]
> 2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [INFO] Connection established from /10.215.73.209:45870 to 
> openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702
> ...
> 2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c, 
> L:/10.215.73.209:45870 - 
> R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702]
> 2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG] 
> SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server
> 2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST
> ...
> 2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [DEBUG] send/recv time (ms): 277
> 2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [ERROR] Unexpected message from server: 
> {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[], 
> nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12,
> 13]}
> {code}
> But the server sends the backpressure message first, before it deals with the 
> SASL_TOKEN_MESSAGE_REQUEST message
> server log
> {code:java}
> 2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler 
> Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm 
> topology wc is -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from 
> resources
> 2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState 
> Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to 
> new client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, 
> bpStatusId=1, bpTasks=[], nonBpTasks=[17,
> 1, 18, 3, 4, 22, 7, 8, 9, 12, 13]}
> 2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1 
> [DEBUG] SASL credentials for storm topology wc is 
> -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler 
> Netty-server-localhost-6702-worker-1 [DEBUG] 
> -Dio.netty.recycler.maxCapacityPerThread: 4096
> 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler 
> Netty-server-localhost-6702-worker-1 [DEBUG] 
> -Dio.netty.recycler.maxSharedCapacityFactor: 2
> 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler 
> Netty-server-localhost-6702-worker-1 [DEBUG] 
> -Dio.netty.recycler.linkCapacity: 16
> 2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler 
> Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8
> 2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN] 
> topology.backpressure.enable is a deprecated config please see class 
> org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information.
> 2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf 
> Netty-server-localhost-6702-worker-1 [DEBUG] 
> -Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true
> 2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf 
> Netty-server-localhost-6702-worker-1 [DEBUG] 
> -Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true
> 2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory 
> Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default 
> ResourceLeakDetector: 
> org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b
> ....
> 2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler 
> Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id: 
> 0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating 
> now, with topology token: wc
> 2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer 
> Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token 
> is: wc with authmethod DIGEST-MD5
> 2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer 
> Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating 
> SaslDigestCallback handler with topology token: wc
> {code}
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormServerHandler.java#L46-L52
> This is a bug likely introduced in STORM-2306 
> (https://github.com/apache/storm/pull/2502). This willl happen on every 
> topology when "storm.messaging.netty.authentication" is set true (It is false 
> by default)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to