[
https://issues.apache.org/jira/browse/STORM-3763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Li reassigned STORM-3763:
-------------------------------
Assignee: Ethan Li
> Backpressure message ignored by the receiver caused the topology to not
> progress
> --------------------------------------------------------------------------------
>
> Key: STORM-3763
> URL: https://issues.apache.org/jira/browse/STORM-3763
> Project: Apache Storm
> Issue Type: Bug
> Affects Versions: 2.0.0, 2.1.0, 2.2.0
> Reporter: Ethan Li
> Assignee: Ethan Li
> Priority: Major
>
> We have noticed a case where topology is stuck due to the mis-interpretation
> of backpressure messge:
> At beginning, the topology ran fine but a downstream component had
> backpressure, so it sent backpressure signal to its upstream component, and
> the upstream component paused sending data to the downstream bolt.
> Then the downstream component restarted (due to any reason, for example,
> killed by supervisor due to heartbeat timeout). When it came back up, it
> sends backpressure message to the upstream bolt. However, the upstream
> component didn't know how to interpret the backpressure message so it logs
> the below error and ignores the message.
> {code:java}
> 2021-01-28 19:41:37.175 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [ERROR] Unexpected message from server:
> {worker=4c38160a-3c66-4eff-8572-2d0c493bd6c1, bpStatusId=254, bpTasks=[],
> nonBpTasks=[546, 790, 863]}
> {code}
> Then the downstream component will not receive any data from the upstream
> component, so it won't have any backpressure (since no data is sent to it),
> hence it won't send any backpressure update message to the upstream
> component. This leads to a dead situation that the upstream component thinks
> the downstream has backpressure so it paused sending data to it, while the
> downstream doesn't have backpressure but can't receive any data from
> upstream. The topology is stuck because of it.
> Let's look at the code:
> When the connection between the downstream (server) and upstream (client) is
> established,
> server invokes
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerHandler.java#L39-L41
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/daemon/worker/WorkerState.java#L237
> which sends backpressure messages to the client.
> This is because in this pipeline, "StormServerHandler" is the only one with
> that implemented "channelActive()" method.
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/StormServerPipelineFactory.java#L56
> However, the Client side expects authentication messages.
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormClientHandler.java#L70-L75
> so the client can't interpret the backpressure message at the beginning,
> hence "unexpected message".
> This can be supported with an example. I have a wordcount topology running.
> At the startup, the client tries to connect to the server. Once connected, it
> sends a "SASL_TOKEN_MESSAGE_REQUEST".
> client log
> {code:java}
> 021-01-29 19:03:21.355 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [DEBUG] SASL credentials for storm topology wc is
> -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.359 o.a.s.m.n.Client client-worker-1 [DEBUG] successfully
> connected to openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702, [id:
> 0x29da2e9c, L:/10.215.73.209:45870 -
> R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702] [attempt 12]
> 2021-01-29 19:03:21.359 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [INFO] Connection established from /10.215.73.209:45870 to
> openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702
> ...
> 2021-01-29 19:03:21.362 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [DEBUG] Creating saslNettyClient now for channel: [id: 0x29da2e9c,
> L:/10.215.73.209:45870 -
> R:openstorm14blue-n3.blue.ygrid.yahoo.com/10.215.73.209:6702]
> 2021-01-29 19:03:21.363 o.a.s.m.n.SaslNettyClient client-worker-1 [DEBUG]
> SaslNettyClient: Creating SASL DIGEST-MD5 client to authenticate to server
> 2021-01-29 19:03:21.368 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [DEBUG] Sending SASL_TOKEN_MESSAGE_REQUEST
> ...
> 2021-01-29 19:03:21.632 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [DEBUG] send/recv time (ms): 277
> 2021-01-29 19:03:21.633 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [ERROR] Unexpected message from server:
> {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516, bpStatusId=1, bpTasks=[],
> nonBpTasks=[17, 1, 18, 3, 4, 22, 7, 8, 9, 12,
> 13]}
> {code}
> But the server sends the backpressure message first, before it deals with the
> SASL_TOKEN_MESSAGE_REQUEST message
> server log
> {code:java}
> 2021-01-29 19:03:21.473 o.a.s.m.n.SaslStormServerHandler
> Netty-server-localhost-6702-worker-1 [DEBUG] SASL credentials for storm
> topology wc is -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.482 o.a.s.u.Utils main [DEBUG] Using storm.yaml from
> resources
> 2021-01-29 19:03:21.490 o.a.s.d.w.WorkerState
> Netty-server-localhost-6702-worker-1 [INFO] Sending BackPressure status to
> new client. BPStatus: {worker=cdf6f963-678c-45a4-91d2-e1067a9a8516,
> bpStatusId=1, bpTasks=[], nonBpTasks=[17,
> 1, 18, 3, 4, 22, 7, 8, 9, 12, 13]}
> 2021-01-29 19:03:21.510 o.a.s.m.n.SaslStormClientHandler client-worker-1
> [DEBUG] SASL credentials for storm topology wc is
> -8603731884381183101:-9091319821854384981
> 2021-01-29 19:03:21.572 o.a.s.s.i.n.u.Recycler
> Netty-server-localhost-6702-worker-1 [DEBUG]
> -Dio.netty.recycler.maxCapacityPerThread: 4096
> 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler
> Netty-server-localhost-6702-worker-1 [DEBUG]
> -Dio.netty.recycler.maxSharedCapacityFactor: 2
> 2021-01-29 19:03:21.573 o.a.s.s.i.n.u.Recycler
> Netty-server-localhost-6702-worker-1 [DEBUG]
> -Dio.netty.recycler.linkCapacity: 16
> 2021-01-29 19:03:21.574 o.a.s.s.i.n.u.Recycler
> Netty-server-localhost-6702-worker-1 [DEBUG] -Dio.netty.recycler.ratio: 8
> 2021-01-29 19:03:21.575 o.a.s.v.ConfigValidation main [WARN]
> topology.backpressure.enable is a deprecated config please see class
> org.apache.storm.Config.TOPOLOGY_BACKPRESSURE_ENABLE for more information.
> 2021-01-29 19:03:21.593 o.a.s.s.i.n.b.AbstractByteBuf
> Netty-server-localhost-6702-worker-1 [DEBUG]
> -Dorg.apache.storm.shade.io.netty.buffer.checkAccessible: true
> 2021-01-29 19:03:21.594 o.a.s.s.i.n.b.AbstractByteBuf
> Netty-server-localhost-6702-worker-1 [DEBUG]
> -Dorg.apache.storm.shade.io.netty.buffer.checkBounds: true
> 2021-01-29 19:03:21.594 o.a.s.s.i.n.u.ResourceLeakDetectorFactory
> Netty-server-localhost-6702-worker-1 [DEBUG] Loaded default
> ResourceLeakDetector:
> org.apache.storm.shade.io.netty.util.ResourceLeakDetector@524a134b
> ....
> 2021-01-29 19:03:21.695 o.a.s.m.n.SaslStormServerHandler
> Netty-server-localhost-6702-worker-1 [DEBUG] No saslNettyServer for [id:
> 0x6fa65bc5, L:/10.215.73.209:6702 - R:/10.215.73.209:45870] yet; creating
> now, with topology token: wc
> 2021-01-29 19:03:21.697 o.a.s.m.n.SaslNettyServer
> Netty-server-localhost-6702-worker-1 [DEBUG] SaslNettyServer: Topology token
> is: wc with authmethod DIGEST-MD5
> 2021-01-29 19:03:21.698 o.a.s.m.n.SaslNettyServer
> Netty-server-localhost-6702-worker-1 [DEBUG] SaslDigestCallback: Creating
> SaslDigestCallback handler with topology token: wc
> {code}
> https://github.com/apache/storm/blob/2.2.x-branch/storm-client/src/jvm/org/apache/storm/messaging/netty/SaslStormServerHandler.java#L46-L52
> This is a bug likely introduced in STORM-2306
> (https://github.com/apache/storm/pull/2502). This willl happen on every
> topology when "storm.messaging.netty.authentication" is set true.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)