[ 
https://issues.apache.org/jira/browse/IOTDB-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gaofei Cao reassigned IOTDB-4553:
---------------------------------

    Assignee: Gaofei Cao  (was: Song Ziyang)

> [remove datanode ] SchemaRegion migration failed
> ------------------------------------------------
>
>                 Key: IOTDB-4553
>                 URL: https://issues.apache.org/jira/browse/IOTDB-4553
>             Project: Apache IoTDB
>          Issue Type: Bug
>          Components: mpp-cluster
>    Affects Versions: 0.14.0-SNAPSHOT
>            Reporter: 刘珍
>            Assignee: Gaofei Cao
>            Priority: Major
>         Attachments: image-2022-09-28-18-03-13-622.png, 
> ip39_datanode_logs.tar.gz, ip40_datanode_logs.tar.gz, remove_datanode.conf
>
>
> master_0928_e5cc456
> SchemaRegion : ratis
> DataRegion : multiLeader
> 均为3副本,先启动3C3D,bm写入数据,增加1个datanode ip40,缩容ip39,
> ip39 缩容成功后,{color:#DE350B}*SchemaRegion 迁移失败*{color}
>  !image-2022-09-28-18-03-13-622.png! 
> ip40的datanode error 
> 2022-09-28 17:37:55,449 [pool-21-IoTDB-DataNodeInternalRPC-Processor-3] ERROR 
> o.a.i.d.s.t.i.DataNodeInternalRPCServiceImpl:1002 - CreateNewRegionPeer 
> error, peers: [Peer{groupId=SchemaRegion[0], 
> endpoint=TEndPoint(ip:172.20.70.37, port:50010)}, 
> Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.38, 
> port:50010)}, Peer{groupId=SchemaRegion[0], 
> endpoint=TEndPoint(ip:172.20.70.39, port:50010)}, 
> Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.40, 
> port:50010)}], regionId: SchemaRegion[0], errorMessage
> org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis 
> request failed
>         at 
> org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:332)
>         at 
> org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:999)
>         at 
> org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:838)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3237)
>         at 
> org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3217)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
>         at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
> exception
>         at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:234)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:181)
>         at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:98)
>         at 
> org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:132)
>         at 
> org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:98)
>         at 
> org.apache.ratis.client.impl.GroupManagementImpl.add(GroupManagementImpl.java:51)
>         at 
> org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:327)
>         ... 10 common frames omitted
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
> UNAVAILABLE: io exception
>         at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
>         at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
>         at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
>         at 
> org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:507)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$5(GrpcClientProtocolClient.java:183)
>         at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:232)
>         ... 16 common frames omitted
> Caused by: 
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  finishConnect(..) failed: Connection refused: /172.20.70.40:50010
> Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
> refused
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
>         at 
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:748)
> ConfigNode leader ip34的error
> 2022-09-28 17:37:54,827 [ProcExecWorker-9] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state, 
> please do the rollback operation yourself manually according to the error 
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710, 
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40, 
> port:50010), regionId: SchemaRegion[2], resp: 
> ConsensusGenericResponse{success=false} 
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException: 
> Ratis request failed)
> 2022-09-28 17:37:54,829 [ProcExecWorker-9] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support 
> rollback, filed state ADD_REGION_PEER, originalDataNode: 
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39, 
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003), 
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777), 
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010), 
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 2022-09-28 17:37:54,996 [ProcExecWorker-12] ERROR 
> o.a.i.c.p.e.DataNodeRemoveHandler:384 - Send action createNewRegionPeer, 
> regionId: TConsensusGroupId(type:SchemaRegion, id:0), dataNode: 
> TDataNodeLocation(dataNodeId:6, clientRpcEndPoint:TEndPoint(ip:172.20.70.40, 
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.40, port:9003), 
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.40, port:8777), 
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:40010), 
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:50010)), 
> result: TSStatus(code:915, message:Ratis request failed)
> 2022-09-28 17:44:18,431 [ProcExecWorker-12] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state, 
> please do the rollback operation yourself manually according to the error 
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710, 
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40, 
> port:50010), regionId: SchemaRegion[0], resp: 
> ConsensusGenericResponse{success=false} 
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException: 
> Ratis request failed)
> 2022-09-28 17:44:18,432 [ProcExecWorker-12] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support 
> rollback, filed state ADD_REGION_PEER, originalDataNode: 
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39, 
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003), 
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777), 
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010), 
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 2022-09-28 17:44:18,626 [ProcExecWorker-11] ERROR 
> o.a.i.c.p.e.DataNodeRemoveHandler:384 - Send action createNewRegionPeer, 
> regionId: TConsensusGroupId(type:SchemaRegion, id:1), dataNode: 
> TDataNodeLocation(dataNodeId:6, clientRpcEndPoint:TEndPoint(ip:172.20.70.40, 
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.40, port:9003), 
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.40, port:8777), 
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:40010), 
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:50010)), 
> result: TSStatus(code:915, message:Ratis request failed)
> 2022-09-28 17:51:08,406 [ProcExecWorker-11] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state, 
> please do the rollback operation yourself manually according to the error 
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710, 
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40, 
> port:50010), regionId: SchemaRegion[1], resp: 
> ConsensusGenericResponse{success=false} 
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException: 
> Ratis request failed)
> 2022-09-28 17:51:08,406 [ProcExecWorker-11] ERROR 
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support 
> rollback, filed state ADD_REGION_PEER, originalDataNode: 
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39, 
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003), 
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777), 
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010), 
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 测试环境
> 1. 私有云 172.20.70.34..40   8cpu 32GB
> 34,35,36 是confignode
> 37..40是datanode
> ip21上运行benchmark
> 2. 集群配置参数
> ConfigNode
> MAX_HEAP_SIZE="8G"
> MAX_DIRECT_MEMORY_SIZE="4G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> time_partition_interval_for_routing=86400000
> schema_replication_factor=3
>  schema_replication_factor=3
> DataNode
> MAX_HEAP_SIZE="20G"
> MAX_DIRECT_MEMORY_SIZE="6G"
>  wal_buffer_size_in_byte=1048576
>  enable_timed_flush_seq_memtable=true
> seq_memtable_flush_interval_in_ms=3600000
> seq_memtable_flush_check_interval_in_ms=600000
> enable_timed_flush_unseq_memtable=true
> unseq_memtable_flush_interval_in_ms=3600000
>  unseq_memtable_flush_check_interval_in_ms=600000
> query_timeout_threshold=36000000
> 先启动3C , 34,35,36
> 再启动3D ,37,38,39
> 2. bm 配置见附件
> 3. 启动ip40的datanode
> 4.bm约运行30分钟,缩容ip39
> 5.查看缩容结果
> 日志见附件



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to