[ 
https://issues.apache.org/jira/browse/RATIS-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Doroszlai updated RATIS-1072:
------------------------------------
    Fix Version/s: 2.0.0
                       (was: 1.1.0)

> Should not shutdown and re-create channel/stub in GrpcServerProtocolClient 
> when StreamObserver::onError() is called.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: RATIS-1072
>                 URL: https://issues.apache.org/jira/browse/RATIS-1072
>             Project: Ratis
>          Issue Type: Improvement
>          Components: gRPC
>            Reporter: Glen Geng
>            Assignee: Glen Geng
>            Priority: Major
>              Labels: MultiRaft, dense-storage
>             Fix For: 2.0.0
>
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> [link title|http://example.com/] 
> When InstallSnapshotResponseHandler::onError() or 
> AppendLogResponseHandler::onError() is called, it will:
>  # call rpcService.getProxies().resetProxy(getFollowerId()).
>  # set appendLogRequestObserver to be null.
> The recovery steps for this BiDi streaming call is actually: 
>  # call ManagedChannel.shutdown()
>  # re-create the channel from NettyChannelBuilder
>  # re-create the async stub from the channel.
>  # re-create a StreamObserver<OurRequest> from the async stub.
>  
> *Above steps is un-necessary and dangerous.*
>  
> *It is un-necessary:*
> according to the [reply from 
> grpc|[https://github.com/grpc/grpc-java/issues/7442]]]
> {quote}Yes, you don't need to recreate the channel (and stub) for starting a 
> new RPC call. The overhead for creating a channel is big.
> Receiving {{StreamObserver.onError()}} only indicates that specific RPC has 
> been terminated. The connection should be still there. So you can definitely 
> start new RPCs on the existing connection.
> {quote}
>  
> *It is dangerous:*
> In multi-raft configuraiton, if two leaders sit in DN1, their followers sit 
> in DN2, their steaming call will share the managed channel in the same 
> GrpcServerProtocolClient that targeting from DN1 to DN2.
> If leader1 calls StreamObserver::onError(), it will shut down and re-create 
> the underlying channel. In the same time, leader2's streaming call will be 
> affected, and call StreamObserver::onError() to shut down and re-create the 
> channel as well. The consequences is, the channel will be continually 
> shutdown and re-create.
>  
> *trace from DN consisting of the leaders:*
> We can see that, the channel targetting 172.16.0.69 is continually shutdown 
> and re-create in every 3s.
> {quote}2020-09-18 19:47:04,025 [grpc-default-executor-93] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134495, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:07,025 [grpc-default-executor-115] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134499, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:10,026 [grpc-default-executor-106] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134503, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:13,026 [grpc-default-executor-115] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134507, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:16,027 [grpc-default-executor-106] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134511, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:19,027 [grpc-default-executor-114] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134532, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:22,027 [grpc-default-executor-105] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134540, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:25,028 [grpc-default-executor-104] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134544, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:28,028 [grpc-default-executor-117] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134548, 
> target=172.16.90.69:9858}}. 
>  2020-09-18 19:47:31,028 [grpc-default-executor-114] WARN 
> org.apache.ratis.grpc.GrpcUtil: Timed out gracefully shutting down 
> connection: 
> ManagedChannelOrphanWrapper\{delegate=ManagedChannelImpl{logId=134552, 
> target=172.16.90.69:9858}}. 
>   
> {quote}
> *trace from DN consisting of the followers:*
> Streams of different raft-group is continually affected. The (term, index) 
> indicate that they belong to different raft-group, (t:9, i:8713), (t:7, 
> i:12312),(t:3, i:213868),(t:10, i:12504), meanwhile raft peer id 
> df173506-6978-47e5-8aef-034cafd221a6 means they are from leaders on the same 
> DN.
> {quote}2020-09-18 19:51:49,498 [grpc-default-executor-113] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#320-t9,
>  previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: 
> size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-104] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#608-t7,
>  previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: 
> size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-109] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#1250591-t3,
>  previous=(t:3, i:213868), leaderCommit=213869, initializing? false, entries: 
> size=1, first=(t:3, i:213869), METADATAENTRY(c:213865): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:52,499 [grpc-default-executor-112] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#601-t10,
>  previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: 
> size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:55,499 [grpc-default-executor-109] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#785-t10,
>  previous=(t:10, i:6660), leaderCommit=6661, initializing? false, entries: 
> size=1, first=(t:10, i:6661), METADATAENTRY(c:6660): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:55,499 [grpc-default-executor-104] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#701-t6,
>  previous=(t:6, i:9305), leaderCommit=9306, initializing? false, entries: 
> size=1, first=(t:6, i:9306), METADATAENTRY(c:9305): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-114] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#611-t7,
>  previous=(t:7, i:12312), leaderCommit=12313, initializing? false, entries: 
> size=1, first=(t:7, i:12313), METADATAENTRY(c:12312): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-112] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#604-t10,
>  previous=(t:10, i:12504), leaderCommit=12505, initializing? false, entries: 
> size=1, first=(t:10, i:12505), METADATAENTRY(c:12504): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>  2020-09-18 19:51:58,499 [grpc-default-executor-113] WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: 
> 591f38ce-0b11-4003-8528-7a5802f89ea8: installSnapshot onError, lastRequest: 
> df173506-6978-47e5-8aef-034cafd221a6->591f38ce-0b11-4003-8528-7a5802f89ea8#324-t9,
>  previous=(t:9, i:8713), leaderCommit=8714, initializing? false, entries: 
> size=1, first=(t:9, i:8714), METADATAENTRY(c:8713): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
>   
> {quote}
> *The fix is simple* 
> We can reuse the the channel and stub, just need {{step 4}} to re-create the 
> {{StreamObserver.}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to