[
https://issues.apache.org/jira/browse/RATIS-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Doroszlai updated RATIS-1072:
------------------------------------
Fix Version/s: 2.0.0
(was: 1.1.0)
> Should not shutdown and re-create channel/stub in GrpcServerProtocolClient
> when StreamObserver::onError() is called.
> --------------------------------------------------------------------------------------------------------------------
>
> Key: RATIS-1072
> URL: https://issues.apache.org/jira/browse/RATIS-1072
> Project: Ratis
> Issue Type: Improvement
> Components: gRPC
> Reporter: Glen Geng
> Assignee: Glen Geng
> Priority: Major
> Labels: MultiRaft, dense-storage
> Fix For: 2.0.0
>
> Time Spent: 2h 50m
> Remaining Estimate: 0h
>
> [link title|http://example.com/]
> When InstallSnapshotResponseHandler::onError() or
> AppendLogResponseHandler::onError() is called, it will:
> # call rpcService.getProxies().resetProxy(getFollowerId()).
> # set appendLogRequestObserver to be null.
> The recovery steps for this BiDi streaming call is actually:
> # call ManagedChannel.shutdown()
> # re-create the channel from NettyChannelBuilder
> # re-create the async stub from the channel.
> # re-create a StreamObserver<OurRequest> from the async stub.
>
> *Above steps is un-necessary and dangerous.*
>
> *It is un-necessary:*
> according to the [reply from
> grpc|[https://github.com/grpc/grpc-java/issues/7442]]]
> {quote}Yes, you don't need to recreate the channel (and stub) for starting a
> new RPC call. The overhead for creating a channel is big.
> Receiving {{StreamObserver.onError()}} only indicates that specific RPC has
> been terminated. The connection should be still there. So you can definitely
> start new RPCs on the existing connection.
> {quote}
>
> *It is dangerous:*
> In multi-raft configuraiton, if two leaders sit in DN1, their followers sit
> in DN2, their steaming call will share the managed channel in the same
> GrpcServerProtocolClient that targeting from DN1 to DN2.
> If leader1 calls StreamObserver::onError(), it will shut down and re-create
> the underlying channel. In the same time, leader2's streaming call will be
> affected, and call StreamObserver::onError() to shut down and re-create the
> channel as well. The consequences is, the channel will be continually
> shutdown and re-create.
>
> *trace from DN consisting of the leaders:*
> We can see that, the channel targetting 172.16.0.69 is continually shutdown
> and re-create in every 3s.
> {quote}2020-09-18 19:47:04,025 [grpc-default-executor-93] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134495,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:07,025 [grpc-default-executor-115] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134499,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:10,026 [grpc-default-executor-106] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134503,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:13,026 [grpc-default-executor-115] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134507,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:16,027 [grpc-default-executor-106] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134511,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:19,027 [grpc-default-executor-114] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134532,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:22,027 [grpc-default-executor-105] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134540,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:25,028 [grpc-default-executor-104] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134544,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:28,028 [grpc-default-executor-117] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134548,
> target=172.16.90.69:9858}}.
> 2020-09-18 19:47:31,028 [grpc-default-executor-114] WARN
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down
> connection:
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134552,
> target=172.16.90.69:9858}}.
>
> {quote}
> *trace from DN consisting of the followers:*
> Streams of different raft-group is continually affected. The (term, index)
> indicate that they belong to different raft-group, (t:9, i:8713), (t:7,
> i:12312),(t:3, i:213868),(t:10, i:12504), meanwhile raft peer id
> df173506-6978-47e5-8aef-034cafd221a6 means they are from leaders on the same
> DN.
> {quote}2020-09-18 19:51:49,498 [grpc-default-executor-113] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#320-t9,
> previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries:
> size=1, first=(t:9, i:8714), METADATAENTRY(c:8713):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:52,499 [grpc-default-executor-104] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#608-t7,
> previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries:
> size=1, first=(t:7, i:12313), METADATAENTRY(c:12312):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:52,499 [grpc-default-executor-109] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#1250591-t3,
> previous=(t:3, i:213868), leaderCommit=213869, initializing? false, entries:
> size=1, first=(t:3, i:213869), METADATAENTRY(c:213865):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:52,499 [grpc-default-executor-112] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#601-t10,
> previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries:
> size=1, first=(t:10, i:12505), METADATAENTRY(c:12504):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:55,499 [grpc-default-executor-109] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#785-t10,
> previous=(t:10, i:6660), leaderCommit=6661, initializing? false, entries:
> size=1, first=(t:10, i:6661), METADATAENTRY(c:6660):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:55,499 [grpc-default-executor-104] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#701-t6,
> previous=(t:6, i:9305), leaderCommit=9306, initializing? false, entries:
> size=1, first=(t:6, i:9306), METADATAENTRY(c:9305):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:58,499 [grpc-default-executor-114] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#611-t7,
> previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries:
> size=1, first=(t:7, i:12313), METADATAENTRY(c:12312):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:58,499 [grpc-default-executor-112] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#604-t10,
> previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries:
> size=1, first=(t:10, i:12505), METADATAENTRY(c:12504):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
> 2020-09-18 19:51:58,499 [grpc-default-executor-113] WARN
> org.apache.ratis.grpc.server.GrpcServerProtocolService:
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest:
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#324-t9,
> previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries:
> size=1, first=(t:9, i:8714), METADATAENTRY(c:8713):
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED:
> cancelled before receiving half close
>
> {quote}
> *The fix is simple*
> We can reuse the the channel and stub, just need {{step 4}} to re-create the
> {{StreamObserver.}}
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)