[
https://issues.apache.org/jira/browse/IOTDB-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
刘珍 reopened IOTDB-3862:
-----------------------
依赖最新ratis打包,基于master ba7bd5d0fd1a97807549bdb579c2557a3fbabbd9
问题还存在:
1. 192.168.130.3/4/5 3 副本3节点
bm在ip2 启动运行
2. 创建元数据成功,开始写入数据,扩容datanode ip1
3. ip5上分布有2个dataregion的leader(共3个dataregion)
缩容ip5,ip5的error log,{color:#DE350B}缩容后还会往ip5发查询语句?{color}
2022-09-13 16:44:33,412
[192.168.130.5_40010@group-000100000005-SegmentedRaftLogWorker-flush-thread1]
ERROR o.a.r.s.r.s.BufferedWriteChannel:115 - Failed to flush channel
java.nio.channels.ClosedChannelException: null
at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:110)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:379)
at
org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.fileChannelForce(BufferedWriteChannel.java:113)
at
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:44:57,375 [20220913_084457_65466_5.1.0-142] ERROR
o.a.i.d.m.p.s.FragmentInstanceDispatcherImpl:166 - Execute FragmentInstance
failed: The consensus group SchemaRegion[1] doesn't exist
2022-09-13 16:44:57,597 [pool-22-IoTDB-ClientRPC-Processor-37] WARN
o.a.i.d.u.ErrorHandlingUtils:62 -{color:#DE350B} Status code:
INTERNAL_SERVER_ERROR(500), operation: "SELECT last s_50 FROM
root.test.g2_0.d_8". executeStatement failed {color}
2022-09-13 16:45:30,439 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR
o.a.i.d.s.t.i.DataNodeInternalRPCServiceImpl:628 - {color:#DE350B}*change
region DataRegion[3] leader failed
org.apache.iotdb.consensus.exception.RatisRequestFailedException: Ratis request
failed*{color}
at
org.apache.iotdb.consensus.ratis.RatisConsensus.transferLeader(RatisConsensus.java:526)
at
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.transferLeader(DataNodeInternalRPCServiceImpl.java:619)
at
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.changeRegionLeader(DataNodeInternalRPCServiceImpl.java:612)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$changeRegionLeader.getResult(IDataNodeRPCService.java:2643)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$changeRegionLeader.getResult(IDataNodeRPCService.java:2623)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:92)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:233)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.setConfiguration(GrpcClientProtocolClient.java:199)
at
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:102)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:139)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:104)
at
org.apache.ratis.client.impl.AdminImpl.setConfiguration(AdminImpl.java:46)
at
org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:51)
at
org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:40)
at
org.apache.iotdb.consensus.ratis.RatisConsensus.transferLeader(RatisConsensus.java:516)
... 10 common frames omitted
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
at
org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.setConfiguration(AdminProtocolServiceGrpc.java:490)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$setConfiguration$6(GrpcClientProtocolClient.java:201)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:231)
... 18 common frames omitted
Caused by:
org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: Connection refused: /192.168.130.3:40010
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
refused
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:321)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:477)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:385)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:45:47,956 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR
o.a.i.c.s.ThriftService:168 - IoTDB: {color:#DE350B}*close MPP Data exchange
manager failed because:
java.lang.InterruptedException: null*{color}
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1302)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at
org.apache.iotdb.commons.service.ThriftService.stopService(ThriftService.java:163)
at
org.apache.iotdb.commons.service.ThriftService.stop(ThriftService.java:74)
at
org.apache.iotdb.db.mpp.execution.exchange.MPPDataExchangeService.stop(MPPDataExchangeService.java:130)
at
org.apache.iotdb.commons.service.IService.waitAndStop(IService.java:36)
at
org.apache.iotdb.commons.service.RegisterManager.deregisterAll(RegisterManager.java:60)
at org.apache.iotdb.db.service.DataNode.deactivate(DataNode.java:454)
at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:409)
at
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:45:47,999 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR
o.a.i.d.r.AbstractPoolManager:49 - Interrupted while waiting flush sub task
thread pool to exit.
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2067)
at
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
at
org.apache.iotdb.db.rescon.AbstractPoolManager.close(AbstractPoolManager.java:44)
at
org.apache.iotdb.db.rescon.AbstractPoolManager.stop(AbstractPoolManager.java:96)
at
org.apache.iotdb.db.engine.flush.pool.FlushSubTaskPoolManager.stop(FlushSubTaskPoolManager.java:63)
at
org.apache.iotdb.db.engine.flush.FlushManager.stop(FlushManager.java:63)
at
org.apache.iotdb.commons.service.IService.waitAndStop(IService.java:36)
at
org.apache.iotdb.commons.service.RegisterManager.deregisterAll(RegisterManager.java:60)
at org.apache.iotdb.db.service.DataNode.deactivate(DataNode.java:454)
at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:409)
at
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2022-09-13 16:45:48,026 [pool-18-IoTDB-DataNodeInternalRPC-Processor-94] ERROR
o.a.ratis.grpc.GrpcUtil:219 - Unexpected exception while waiting for channel
termination
java.lang.InterruptedException: null
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelImpl.awaitTermination(ManagedChannelImpl.java:905)
at
org.apache.ratis.thirdparty.io.grpc.internal.ForwardingManagedChannel.awaitTermination(ForwardingManagedChannel.java:57)
at
org.apache.ratis.grpc.GrpcUtil.shutdownManagedChannel(GrpcUtil.java:215)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.close(GrpcClientProtocolClient.java:172)
at org.apache.ratis.util.PeerProxyMap.closeProxy(PeerProxyMap.java:175)
at
org.apache.ratis.util.PeerProxyMap.lambda$null$3(PeerProxyMap.java:169)
at java.util.Optional.ifPresent(Optional.java:159)
at
org.apache.ratis.util.PeerProxyMap.lambda$close$4(PeerProxyMap.java:169)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at
java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at
java.util.concurrent.ForkJoinPool.helpComplete(ForkJoinPool.java:1881)
at
java.util.concurrent.ForkJoinPool.externalHelpComplete(ForkJoinPool.java:2478)
at
java.util.concurrent.ForkJoinTask.externalAwaitDone(ForkJoinTask.java:324)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:405)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583)
at org.apache.ratis.util.PeerProxyMap.close(PeerProxyMap.java:168)
at
org.apache.ratis.client.impl.RaftClientRpcWithProxy.close(RaftClientRpcWithProxy.java:53)
at
org.apache.ratis.client.impl.RaftClientImpl.close(RaftClientImpl.java:406)
at
org.apache.iotdb.consensus.ratis.RatisClient.close(RatisClient.java:62)
at
org.apache.iotdb.consensus.ratis.RatisClient$Factory.destroyObject(RatisClient.java:91)
at
org.apache.iotdb.consensus.ratis.RatisClient$Factory.destroyObject(RatisClient.java:74)
at
org.apache.commons.pool2.KeyedPooledObjectFactory.destroyObject(KeyedPooledObjectFactory.java:132)
at
org.apache.commons.pool2.impl.GenericKeyedObjectPool.destroy(GenericKeyedObjectPool.java:875)
at
org.apache.commons.pool2.impl.GenericKeyedObjectPool.clear(GenericKeyedObjectPool.java:607)
at
java.util.concurrent.ConcurrentHashMap$KeySetView.forEach(ConcurrentHashMap.java:4649)
at
org.apache.commons.pool2.impl.GenericKeyedObjectPool.clear(GenericKeyedObjectPool.java:582)
at
org.apache.commons.pool2.impl.GenericKeyedObjectPool.close(GenericKeyedObjectPool.java:692)
at
org.apache.iotdb.commons.client.ClientManager.close(ClientManager.java:103)
at
org.apache.iotdb.consensus.ratis.RatisConsensus.stop(RatisConsensus.java:153)
at org.apache.iotdb.db.service.DataNode.stop(DataNode.java:413)
at
org.apache.iotdb.db.service.thrift.impl.DataNodeInternalRPCServiceImpl.stopDataNode(DataNodeInternalRPCServiceImpl.java:859)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2793)
at
org.apache.iotdb.mpp.rpc.thrift.IDataNodeRPCService$Processor$stopDataNode.getResult(IDataNodeRPCService.java:2773)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:38)
at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:248)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
> [shrink + Region-Migrate ] [pool-16-IoTDB-Region-Migrate-Pool-1] ERROR
> o.a.i.d.s.RegionMigrateService$RegionMigrateTask:349 - add new peer
> TEndPoint(ip:a.b.c.d, port:40010) for region DataRegion[x] failed
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: IOTDB-3862
> URL: https://issues.apache.org/jira/browse/IOTDB-3862
> Project: Apache IoTDB
> Issue Type: Bug
> Components: mpp-cluster
> Affects Versions: 0.14.0-SNAPSHOT
> Reporter: 刘珍
> Assignee: Song Ziyang
> Priority: Major
> Attachments: expand-ip1_datanode_logs.tar.gz, ip3.sh,
> ip3_config.properties, ip4.sh, ip4_config.properties, ip5.sh,
> ip5_config.properties, shrink-ip3_confignode_log_all.tar.gz,
> shrink-ip3_datanode_log_all.tar.gz
>
>
> master_0718_967cde6
> RatisConsensus ,3 副本3C3D
> 先启动3C3D集群(ip3/4/5),扩容1个datanode(ip1), 在ip5执行缩容ip3的操作。
> ip3的datanode有error日志:
> 2022-07-18 {color:red}*14:00:00*{color},830
> [pool-16-IoTDB-Region-Migrate-Pool-1] ERROR
> o.a.i.d.s.RegionMigrateService$RegionMigrateTask:349 - add new peer
> TEndPoint(ip:192.168.130.1, port:40010){color:red}* for region DataRegion[5]
> failed*{color}
> 扩容节点ip1 有warn
> 2022-07-18 {color:red}*13:57:06*{color},554 [grpc-default-executor-1] WARN
> o.a.ratis.util.LogUtils:122 - 192.168.130.1_40010:
> {color:red}*installSnapshot onError,*{color} lastRequest:
> 192.168.130.4_40010->192.168.130.1_40010#20-t1,previous=(t:1,
> i:1790),leaderCommit=1893,initializing? false,entries: size=103, first=(t:1,
> i:1791), STATEMACHINELOGENTRY, 942@client-A8FC5FC601AC:
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: client
> cancelled
> 缩容失败,可以连缩容datanode执行操作。
> 复现流程:
> 1. 启动3C3D 192.168.130.3/4/5 16C32G * 3
> schema_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> data_region_consensus_protocol_class=org.apache.iotdb.consensus.ratis.RatisConsensus
> schema_replication_factor=3
> data_replication_factor=3
> MAX_HEAP_SIZE="16G"
> 2. 启动3个benchmark分别连3个datanode
> benchmark在192.168.130.2
> /home/benchmark/bm_0620_7ec96c1
> 配置文件见附件 , 启动命令
> nohup sh -x ip3.sh > 3.log &
> nohup sh -x ip4.sh > 4.log &
> nohup sh -x ip5.sh > 5.log &
> 运行约10分钟,ip5执行缩容操作,缩容ip为192.168.130.3 :
> 2022-07-18 *13:56:40*,902 [main] INFO o.a.i.db.service.DataNode:201 - Remove
> result TDataNodeRemoveResp(status:TSStatus(code:200, message:Server accept
> the request))
> 3. iotdb的日志见附件
--
This message was sent by Atlassian Jira
(v8.20.10#820010)