[
https://issues.apache.org/jira/browse/IOTDB-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gaofei Cao reassigned IOTDB-4553:
---------------------------------
Assignee: Gaofei Cao (was: Song Ziyang)
> [remove datanode ] SchemaRegion migration failed
> ------------------------------------------------
>
> Key: IOTDB-4553
> URL: https://issues.apache.org/jira/browse/IOTDB-4553
> Project: Apache IoTDB
> Issue Type: Bug
> Components: mpp-cluster
> Affects Versions: 0.14.0-SNAPSHOT
> Reporter: 刘珍
> Assignee: Gaofei Cao
> Priority: Major
> Attachments: image-2022-09-28-18-03-13-622.png,
> ip39_datanode_logs.tar.gz, ip40_datanode_logs.tar.gz, remove_datanode.conf
>
>
> master_0928_e5cc456
> SchemaRegion : ratis
> DataRegion : multiLeader
> 均为3副本,先启动3C3D,bm写入数据,增加1个datanode ip40,缩容ip39,
> ip39 缩容成功后,{color:#DE350B}*SchemaRegion 迁移失败*{color}
> !image-2022-09-28-18-03-13-622.png!
> ip40的datanode error
> 2022-09-28 17:37:55,449 [pool-21-IoTDB-DataNodeInternalRPC-Processor-3] ERROR
> o.a.i.d.s.t.i.DataNodeInternalRPCServiceImpl:1002 - CreateNewRegionPeer
> error, peers: [Peer{groupId=SchemaRegion[0],
> endpoint=TEndPoint(ip:172.20.70.37, port:50010)},
> Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.38,
> port:50010)}, Peer{groupId=SchemaRegion[0],
> endpoint=TEndPoint(ip:172.20.70.39, port:50010)},
> Peer{groupId=SchemaRegion[0], endpoint=TEndPoint(ip:172.20.70.40,
> port:50010)}], regionId: SchemaRegion[0], errorMessage
> org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis
> request failed
> at
> org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:332)
> at
> org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:999)
> at
> org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.createNewRegionPeer(DataNodeInternalRPCServiceImpl.java:838)
> at
> org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3237)
> at
> org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$createNewRegionPeer.getResult(IDataNodeRPCService.java:3217)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
> exception
> at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:234)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:181)
> at
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:98)
> at
> org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:132)
> at
> org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:98)
> at
> org.apache.ratis.client.impl.GroupManagementImpl.add(GroupManagementImpl.java:51)
> at
> org.apache.iotdb.consensus.ratis.RatisConsensus.createPeer(RatisConsensus.java:327)
> ... 10 common frames omitted
> Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
> UNAVAILABLE: io exception
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)
> at
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)
> at
> org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:507)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$5(GrpcClientProtocolClient.java:183)
> at
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:232)
> ... 16 common frames omitted
> Caused by:
> org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> finishConnect(..) failed: Connection refused: /172.20.70.40:50010
> Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
> refused
> at
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
> at
> org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
> at
> org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:320)
> at
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
> at
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
> at
> org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
> at
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
> at
> org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
> at
> org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> ConfigNode leader ip34的error
> 2022-09-28 17:37:54,827 [ProcExecWorker-9] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state,
> please do the rollback operation yourself manually according to the error
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710,
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40,
> port:50010), regionId: SchemaRegion[2], resp:
> ConsensusGenericResponse{success=false}
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException:
> Ratis request failed)
> 2022-09-28 17:37:54,829 [ProcExecWorker-9] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support
> rollback, filed state ADD_REGION_PEER, originalDataNode:
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39,
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003),
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777),
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010),
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 2022-09-28 17:37:54,996 [ProcExecWorker-12] ERROR
> o.a.i.c.p.e.DataNodeRemoveHandler:384 - Send action createNewRegionPeer,
> regionId: TConsensusGroupId(type:SchemaRegion, id:0), dataNode:
> TDataNodeLocation(dataNodeId:6, clientRpcEndPoint:TEndPoint(ip:172.20.70.40,
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.40, port:9003),
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.40, port:8777),
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:40010),
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:50010)),
> result: TSStatus(code:915, message:Ratis request failed)
> 2022-09-28 17:44:18,431 [ProcExecWorker-12] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state,
> please do the rollback operation yourself manually according to the error
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710,
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40,
> port:50010), regionId: SchemaRegion[0], resp:
> ConsensusGenericResponse{success=false}
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException:
> Ratis request failed)
> 2022-09-28 17:44:18,432 [ProcExecWorker-12] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support
> rollback, filed state ADD_REGION_PEER, originalDataNode:
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39,
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003),
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777),
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010),
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 2022-09-28 17:44:18,626 [ProcExecWorker-11] ERROR
> o.a.i.c.p.e.DataNodeRemoveHandler:384 - Send action createNewRegionPeer,
> regionId: TConsensusGroupId(type:SchemaRegion, id:1), dataNode:
> TDataNodeLocation(dataNodeId:6, clientRpcEndPoint:TEndPoint(ip:172.20.70.40,
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.40, port:9003),
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.40, port:8777),
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:40010),
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.40, port:50010)),
> result: TSStatus(code:915, message:Ratis request failed)
> 2022-09-28 17:51:08,406 [ProcExecWorker-11] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:137 - Meets error in region migrate state,
> please do the rollback operation yourself manually according to the error
> message!!! error state: ADD_REGION_PEER, migrateResult: TSStatus(code:710,
> message:Add peer for region error, peerId: TEndPoint(ip:172.20.70.40,
> port:50010), regionId: SchemaRegion[1], resp:
> ConsensusGenericResponse{success=false}
> exception=org.apache.iotdb.consensus.exception.RatisRequestFailedException:
> Ratis request failed)
> 2022-09-28 17:51:08,406 [ProcExecWorker-11] ERROR
> o.a.i.c.p.i.RegionMigrateProcedure:145 - Failed state is not support
> rollback, filed state ADD_REGION_PEER, originalDataNode:
> TDataNodeLocation(dataNodeId:3, clientRpcEndPoint:TEndPoint(ip:172.20.70.39,
> port:6667), internalEndPoint:TEndPoint(ip:172.20.70.39, port:9003),
> mPPDataExchangeEndPoint:TEndPoint(ip:172.20.70.39, port:8777),
> dataRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:40010),
> schemaRegionConsensusEndPoint:TEndPoint(ip:172.20.70.39, port:50010))
> 测试环境
> 1. 私有云 172.20.70.34..40 8cpu 32GB
> 34,35,36 是confignode
> 37..40是datanode
> ip21上运行benchmark
> 2. 集群配置参数
> ConfigNode
> MAX_HEAP_SIZE="8G"
> MAX_DIRECT_MEMORY_SIZE="4G"
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.multileader.MultiLeaderConsensus
> time_partition_interval_for_routing=86400000
> schema_replication_factor=3
> schema_replication_factor=3
> DataNode
> MAX_HEAP_SIZE="20G"
> MAX_DIRECT_MEMORY_SIZE="6G"
> wal_buffer_size_in_byte=1048576
> enable_timed_flush_seq_memtable=true
> seq_memtable_flush_interval_in_ms=3600000
> seq_memtable_flush_check_interval_in_ms=600000
> enable_timed_flush_unseq_memtable=true
> unseq_memtable_flush_interval_in_ms=3600000
> unseq_memtable_flush_check_interval_in_ms=600000
> query_timeout_threshold=36000000
> 先启动3C , 34,35,36
> 再启动3D ,37,38,39
> 2. bm 配置见附件
> 3. 启动ip40的datanode
> 4.bm约运行30分钟,缩容ip39
> 5.查看缩容结果
> 日志见附件
--
This message was sent by Atlassian Jira
(v8.20.10#820010)