[
https://issues.apache.org/jira/browse/IOTDB-3452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17557783#comment-17557783
]
QiangShaowei commented on IOTDB-3452:
-------------------------------------
昨天测试情况,发现在addPeer时失败
1. ConfigNode 2022-06-22 16:19:45发起region迁移请求,2022-06-22 16:20:54收到迁移失败,原因是add
peer 失败。
{color:red}2022-06-22 16:19:45,001{color} [Exec-RemoveDataNode-Thread] DEBUG
o.a.i.c.m.DataNodeRemoveManager:449 - send region
TConsensusGroupId(type:DataRegion, id:11) migrate action to
TDataNodeLocation(dataNodeId:2, externalEndPoint:TEndPoint(ip:192.168.42.196,
port:6697), internalEndPoint:TEndPoint(ip:192.168.42.196, port:9009),
dataBlockManagerEndPoint:TEndPoint(ip:192.168.42.196, port:8797),
dataRegionConsensusEndPoint:TEndPoint(ip:192.168.42.196, port:40090),
schemaRegionConsensusEndPoint:TEndPoint(ip:192.168.42.196, port:50090)), wait
it finished
{color:red}2022-06-22 16:20:54,065
[{color}pool-2-IoTDB-ConfigNodeRPC-Client-16] DEBUG
o.a.i.c.m.DataNodeRemoveManager:498 - accept region
TConsensusGroupId(type:DataRegion, id:11) migrate result, result:
TRegionMigrateResultReportReq(regionId:TConsensusGroupId(type:DataRegion,
id:11), migrateResult:TSStatus(code:911, message:add peer:
TEndPoint(ip:192.168.42.16, port:40090) for region: DataRegion[11] failed.
exception: Ratis request failed),
failedNodeAndReason:{TDataNodeLocation(dataNodeId:1,
externalEndPoint:TEndPoint(ip:192.168.42.16, port:6697),
internalEndPoint:TEndPoint(ip:192.168.42.16, port:9009),
dataBlockManagerEndPoint:TEndPoint(ip:192.168.42.16, port:8797),
dataRegionConsensusEndPoint:TEndPoint(ip:192.168.42.16, port:40090),
schemaRegionConsensusEndPoint:TEndPoint(ip:192.168.42.16,
port:50090))={color:red}AddPeerFailed{color}})
2.要缩容的节点196上,执行region 11 的 addPeer时的jstack如下,会执行setConfiguration方法
"pool-20-IoTDB-Region-Migrate-Pool-1" prio=5 os_prio=0 tid=0x0000fffe2403f800
nid=0x59f9 waiting on condition [0x0000fffd6b9f6000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at java.lang.Thread.sleep(Thread.java:342)
at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386)
at org.apache.ratis.util.TimeDuration.sleep(TimeDuration.java:325)
at org.apache.ratis.util.TimeDuration.sleep(TimeDuration.java:310)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:121)
at org.apache.ratis.client.impl.AdminImpl.setConfiguration(AdminImpl.java:47)
at org.apache.ratis.client.api.AdminApi.setConfiguration(AdminApi.java:36)
at
org.apache.iotdb.consensus.ratis.RatisConsensus.sendReconfiguration(RatisConsensus.java:621)
at
org.apache.iotdb.consensus.ratis.RatisConsensus.addPeer(RatisConsensus.java:348)
at
org.apache.iotdb.db.service.RegionMigrateService$RegionMigrateTask.addPeer(RegionMigrateService.java:259)
at
org.apache.iotdb.db.service.RegionMigrateService$RegionMigrateTask.run(RegionMigrateService.java:160)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
缩容196节点16:20:54报错日志如下
2022-06-22 16:20:54,050 [grpc-default-worker-ELG-3-12] DEBUG
o.a.r.t.i.n.u.i.l.AbstractInternalLogger:214 - [id: 0x262750fe,
L:/192.168.42.196:39172 - R:192.168.42.204/192.168.42.204:40090] INBOUND
HEADERS: streamId=3 headers=GrpcHttp2ResponseHeaders[:status: 200,
content-type: application/grpc, exception-type:
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException,
exception-object-bin:
rO0ABXNyAERvcmcuYXBhY2hlLnJhdGlzLnByb3RvY29sLmV4Y2VwdGlvbnMuUmVjb25maWd1cmF0aW9uVGltZW91dEV4Y2VwdGlvbmw90jhgLZ4SAgAAeHIAMm9yZy5hcGFjaGUucmF0aXMucHJvdG9jb2wuZXhjZXB0aW9ucy5SYWZ0RXhjZXB0aW9uAAAAAAAAAAECAAB4cgATamF2YS5pby5JT0V4Y2VwdGlvbmyAc2RlJfCrAgAAeHIAE2phdmEubGFuZy5FeGNlcHRpb27Q/R8+GjscxAIAAHhyABNqYXZhLmxhbmcuVGhyb3dhYmxl1cY1Jzl3uMsDAARMAAVjYXVzZXQAFUxqYXZhL2xhbmcvVGhyb3dhYmxlO0wADWRldGFpbE1lc3NhZ2V0ABJMamF2YS9sYW5nL1N0cmluZztbAApzdGFja1RyYWNldAAeW0xqYXZhL2xhbmcvU3RhY2tUcmFjZUVsZW1lbnQ7TAAUc3VwcHJlc3NlZEV4Y2VwdGlvbnN0ABBMamF2YS91dGlsL0xpc3Q7eHBxAH4ACXQBbDE5Mi4xNjguNDIuMjA0XzQwMDkwQGdyb3VwLTAwMDEwMDAwMDAwQi1Db25maWd1cmF0aW9uU3RhZ2luZ1N0YXRlOiBGYWlsIHRvIHNldCBjb25maWd1cmF0aW9uIFsxOTIuMTY4LjQyLjE2XzQwMDkwfHJwYzoxOTIuMTY4LjQyLjE2OjQwMDkwfGFkbWluOnxjbGllbnQ6fGRhdGFTdHJlYW06fHByaW9yaXR5OjAsIDE5Mi4xNjguNDIuMTk2XzQwMDkwfHJwYzoxOTIuMTY4LjQyLjE5Njo0MDA5MHxhZG1pbjp8Y2xpZW50OnxkYXRhU3RyZWFtOnxwcmlvcml0eTowLCAxOTIuMTY4LjQyLjIwNF80MDA5MHxycGM6MTkyLjE2OC40Mi4yMDQ6NDAwOTB8YWRtaW46fGNsaWVudDp8ZGF0YVN0cmVhbTp8cHJpb3JpdHk6MV0gZHVlIHRvIE5PUFJPR1JFU1N1cgAeW0xqYXZhLmxhbmcuU3RhY2tUcmFjZUVsZW1lbnQ7AkYqPDz9IjkCAAB4cAAAAARzcgAbamF2YS5sYW5nLlN0YWNrVHJhY2VFbGVtZW50YQnFmiY23YUCAARJAApsaW5lTnVtYmVyTAAOZGVjbGFyaW5nQ2xhc3NxAH4ABkwACGZpbGVOYW1lcQB+AAZMAAptZXRob2ROYW1lcQB+AAZ4cAAABEJ0AEZvcmcuYXBhY2hlLnJhdGlzLnNlcnZlci5pbXBsLkxlYWRlclN0YXRlSW1wbCRDb25maWd1cmF0aW9uU3RhZ2luZ1N0YXRldAAUTGVhZGVyU3RhdGVJbXBsLmphdmF0AARmYWlsc3EAfgANAAACwHQALG9yZy5hcGFjaGUucmF0aXMuc2VydmVyLmltcGwuTGVhZGVyU3RhdGVJbXBscQB+ABB0AAxjaGVja1N0YWdpbmdzcQB+AA0AAABfcQB+ABNxAH4AEHQACmFjY2VzcyQ1MDBzcQB+AA0AAAJ8dAA7b3JnLmFwYWNoZS5yYXRpcy5zZXJ2ZXIuaW1wbC5MZWFkZXJTdGF0ZUltcGwkRXZlbnRQcm9jZXNzb3JxAH4AEHQAA3J1bnNyACZqYXZhLnV0aWwuQ29sbGVjdGlvbnMkVW5tb2RpZmlhYmxlTGlzdPwPJTG17I4QAgABTAAEbGlzdHEAfgAIeHIALGphdmEudXRpbC5Db2xsZWN0aW9ucyRVbm1vZGlmaWFibGVDb2xsZWN0aW9uGUIAgMte9x4CAAFMAAFjdAAWTGphdmEvdXRpbC9Db2xsZWN0aW9uO3hwc3IAE2phdmEudXRpbC5BcnJheUxpc3R4gdIdmcdhnQMAAUkABHNpemV4cAAAAAB3BAAAAAB4cQB+AB94,
grpc-status: 13, grpc-message:
192.168.42.204_40090@group-00010000000B-ConfigurationStagingState:
{color:red}Fail to set configuration
{color}[192.168.42.16_40090|rpc:192.168.42.16:40090|admin:|client:|dataStream:|priority:0,
192.168.42.196_40090|rpc:192.168.42.196:40090|admin:|client:|dataStream:|priority:0,
192.168.42.204_40090|rpc:192.168.42.204:40090|admin:|client:|dataStream:|priority:1]
due to NOPROGRESS] padding=0 endStream=true
2022-06-22 16:20:54,054 [pool-20-IoTDB-Region-Migrate-Pool-1] DEBUG
o.a.r.c.i.RaftClientImpl:355 - client-7BC1E2CE1983: suggested new leader: null.
Failed
SetConfigurationRequest:client-7BC1E2CE1983->192.168.42.204_40090@group-00010000000B,
cid=19, seq=0, RW, null,
peers:[192.168.42.16_40090|rpc:192.168.42.16:40090|priority:0,
192.168.42.196_40090|rpc:192.168.42.196:40090|admin:|client:|dataStream:|priority:0,
192.168.42.204_40090|rpc:192.168.42.204:40090|admin:|client:|dataStream:|priority:1]
with {}
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException:
192.168.42.204_40090@group-00010000000B-ConfigurationStagingState: Fail to set
configuration
[192.168.42.16_40090|rpc:192.168.42.16:40090|admin:|client:|dataStream:|priority:0,
192.168.42.196_40090|rpc:192.168.42.196:40090|admin:|client:|dataStream:|priority:0,
192.168.42.204_40090|rpc:192.168.42.204:40090|admin:|client:|dataStream:|priority:1]
due to NOPROGRESS
at
org.apache.ratis.server.impl.LeaderStateImpl$ConfigurationStagingState.fail(LeaderStateImpl.java:1090)
at
org.apache.ratis.server.impl.LeaderStateImpl.checkStaging(LeaderStateImpl.java:704)
at
org.apache.ratis.server.impl.LeaderStateImpl.access$500(LeaderStateImpl.java:95)
at
org.apache.ratis.server.impl.LeaderStateImpl$EventProcessor.run(LeaderStateImpl.java:636)
3. 看204节点日志16:20:54
2022-06-22 16:20:54,044
[192.168.42.204_40090@group-00010000000B-LeaderStateImpl] DEBUG
o.a.r.s.i.LeaderStateImpl$ConfigurationStagingState:1085 -
192.168.42.204_40090@group-00010000000B-ConfigurationStagingState: Fail to set
configuration
[192.168.42.16_40090|rpc:192.168.42.16:40090|admin:|client:|dataStream:|priority:0,
192.168.42.196_40090|rpc:192.168.42.196:40090|admin:|client:|dataStream:|priority:0,
192.168.42.204_40090|rpc:192.168.42.204:40090|admin:|client:|dataStream:|priority:1]
due to NOPROGRESS
2022-06-22 16:20:54,044
[192.168.42.204_40090@group-00010000000B-LeaderStateImpl] DEBUG
o.a.r.m.RatisMetrics:46 - Unregistering Metrics Registry :
ratis_grpc.log_appender.192.168.42.204_40090@group-00010000000B
2022-06-22 16:20:54,044
[192.168.42.204_40090@group-00010000000B-LeaderStateImpl] DEBUG
o.a.r.u.LifeCycle$State:116 -
192.168.42.204_40090@group-00010000000B->192.168.42.16_40090-GrpcLogAppender-LogAppenderDaemon:
RUNNING -> CLOSING
2022-06-22 16:20:54,045
[192.168.42.204_40090@group-00010000000B->192.168.42.16_40090-GrpcLogAppender-LogAppenderDaemon]
DEBUG o.a.r.u.LifeCycle$State:116 -
192.168.42.204_40090@group-00010000000B->192.168.42.16_40090-GrpcLogAppender-LogAppenderDaemon:
CLOSING -> CLOSED
2022-06-22 16:20:54,047 [grpc-default-worker-ELG-3-13] DEBUG
o.a.r.t.i.n.u.i.l.AbstractInternalLogger:214 - [id: 0x38e2240f,
L:/192.168.42.204:40090 - R:/192.168.42.196:39172] OUTBOUND HEADERS: streamId=3
headers=GrpcHttp2OutboundHeaders[:status: 200, content-type: application/grpc,
exception-type:
org.apache.ratis.protocol.exceptions.ReconfigurationTimeoutException,
exception-object-bin:
rO0ABXNyAERvcmcuYXBhY2hlLnJhdGlzLnByb3RvY29sLmV4Y2VwdGlvbnMuUmVjb25maWd1cmF0aW9uVGltZW91dEV4Y2VwdGlvbmw90jhgLZ4SAgAAeHIAMm9yZy5hcGFjaGUucmF0aXMucHJvdG9jb2wuZXhjZXB0aW9ucy5SYWZ0RXhjZXB0aW9uAAAAAAAAAAECAAB4cgATamF2YS5pby5JT0V4Y2VwdGlvbmyAc2RlJfCrAgAAeHIAE2phdmEubGFuZy5FeGNlcHRpb27Q/R8+GjscxAIAAHhyABNqYXZhLmxhbmcuVGhyb3dhYmxl1cY1Jzl3uMsDAARMAAVjYXVzZXQAFUxqYXZhL2xhbmcvVGhyb3dhYmxlO0wADWRldGFpbE1lc3NhZ2V0ABJMamF2YS9sYW5nL1N0cmluZztbAApzdGFja1RyYWNldAAeW0xqYXZhL2xhbmcvU3RhY2tUcmFjZUVsZW1lbnQ7TAAUc3VwcHJlc3NlZEV4Y2VwdGlvbnN0ABBMamF2YS91dGlsL0xpc3Q7eHBxAH4ACXQBbDE5Mi4xNjguNDIuMjA0XzQwMDkwQGdyb3VwLTAwMDEwMDAwMDAwQi1Db25maWd1cmF0aW9uU3RhZ2luZ1N0YXRlOiBGYWlsIHRvIHNldCBjb25maWd1cmF0aW9uIFsxOTIuMTY4LjQyLjE2XzQwMDkwfHJwYzoxOTIuMTY4LjQyLjE2OjQwMDkwfGFkbWluOnxjbGllbnQ6fGRhdGFTdHJlYW06fHByaW9yaXR5OjAsIDE5Mi4xNjguNDIuMTk2XzQwMDkwfHJwYzoxOTIuMTY4LjQyLjE5Njo0MDA5MHxhZG1pbjp8Y2xpZW50OnxkYXRhU3RyZWFtOnxwcmlvcml0eTowLCAxOTIuMTY4LjQyLjIwNF80MDA5MHxycGM6MTkyLjE2OC40Mi4yMDQ6NDAwOTB8YWRtaW46fGNsaWVudDp8ZGF0YVN0cmVhbTp8cHJpb3JpdHk6MV0gZHVlIHRvIE5PUFJPR1JFU1N1cgAeW0xqYXZhLmxhbmcuU3RhY2tUcmFjZUVsZW1lbnQ7AkYqPDz9IjkCAAB4cAAAAARzcgAbamF2YS5sYW5nLlN0YWNrVHJhY2VFbGVtZW50YQnFmiY23YUCAARJAApsaW5lTnVtYmVyTAAOZGVjbGFyaW5nQ2xhc3NxAH4ABkwACGZpbGVOYW1lcQB+AAZMAAptZXRob2ROYW1lcQB+AAZ4cAAABEJ0AEZvcmcuYXBhY2hlLnJhdGlzLnNlcnZlci5pbXBsLkxlYWRlclN0YXRlSW1wbCRDb25maWd1cmF0aW9uU3RhZ2luZ1N0YXRldAAUTGVhZGVyU3RhdGVJbXBsLmphdmF0AARmYWlsc3EAfgANAAACwHQALG9yZy5hcGFjaGUucmF0aXMuc2VydmVyLmltcGwuTGVhZGVyU3RhdGVJbXBscQB+ABB0AAxjaGVja1N0YWdpbmdzcQB+AA0AAABfcQB+ABNxAH4AEHQACmFjY2VzcyQ1MDBzcQB+AA0AAAJ8dAA7b3JnLmFwYWNoZS5yYXRpcy5zZXJ2ZXIuaW1wbC5MZWFkZXJTdGF0ZUltcGwkRXZlbnRQcm9jZXNzb3JxAH4AEHQAA3J1bnNyACZqYXZhLnV0aWwuQ29sbGVjdGlvbnMkVW5tb2RpZmlhYmxlTGlzdPwPJTG17I4QAgABTAAEbGlzdHEAfgAIeHIALGphdmEudXRpbC5Db2xsZWN0aW9ucyRVbm1vZGlmaWFibGVDb2xsZWN0aW9uGUIAgMte9x4CAAFMAAFjdAAWTGphdmEvdXRpbC9Db2xsZWN0aW9uO3hwc3IAE2phdmEudXRpbC5BcnJheUxpc3R4gdIdmcdhnQMAAUkABHNpemV4cAAAAAB3BAAAAAB4cQB+AB94,
grpc-status: 13, grpc-message:
192.168.42.204_40090@group-00010000000B-ConfigurationStagingState: Fail to set
configuration
[192.168.42.16_40090|rpc:192.168.42.16:40090|admin:|client:|dataStream:|priority:0,
192.168.42.196_40090|rpc:192.168.42.196:40090|admin:|client:|dataStream:|priority:0,
192.168.42.204_40090|rpc:192.168.42.204:40090|admin:|client:|dataStream:|priority:1]
due to NOPROGRESS] padding=0 endStream=true
*然后过了约一分钟,报之前的Broken pipe。 由于add peer在16:20:54就失败了,所以不是后面的Broken pipe直接导致的
*2022-06-22 16:21:36,824 [ForkJoinPool.commonPool-worker-9] DEBUG
o.a.r.g.s.GrpcServerProtocolService$ServerRequestStreamObserver:102 -
192.168.42.204_40090: reply
192.168.42.196_40090<-192.168.42.204_40090#250:OK-t1,SUCCESS,nextIndex=1,followerCommit=0
2022-06-22 16:21:36,824 [grpc-default-worker-ELG-3-3] DEBUG
o.a.r.t.i.n.u.i.l.AbstractInternalLogger:214 - [id: 0x9d090570,
L:/192.168.42.204:40090 - R:/192.168.42.196:38776] OUTBOUND DATA: streamId=13
padding=0 endStream=false length=93
bytes=00000000580a450a143139322e3136382e34322e3139365f343030393012143139322e3136382e34322e3230345f34303039301a120a10474747474747474747...
2022-06-22 16:21:36,857 [grpc-default-worker-ELG-3-2] DEBUG
o.a.r.t.i.n.u.i.l.AbstractInternalLogger:214 - [id: 0x154a1b7a,
L:/192.168.42.204:52562 - R:192.168.42.16/192.168.42.16:50090] OUTBOUND
GO_AWAY: lastStreamId=2147483647 errorCode=2 length=48
bytes=7265616441646472657373282e2e29206661696c65643a20436f6e6e656374696f6e2072657365742062792070656572
2022-06-22 16:21:36,859 [grpc-default-worker-ELG-3-2] DEBUG
o.a.r.t.i.n.h.c.h.Http2ConnectionHandler:928 - [id: 0x154a1b7a,
L:/192.168.42.204:52562 ! R:192.168.42.16/192.168.42.16:50090] Sending GOAWAY
failed: lastStreamId '2147483647', errorCode '2', debugData 'readAddress(..)
failed: Connection reset by peer'. Forcing shutdown of the connection.
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors$NativeIoException:
writevAddresses(..) failed: Broken pipe
> DataNode缩容过程中,Region迁移时,ratis server退出
> --------------------------------------
>
> Key: IOTDB-3452
> URL: https://issues.apache.org/jira/browse/IOTDB-3452
> Project: Apache IoTDB
> Issue Type: Test
> Reporter: QiangShaowei
> Assignee: Song Ziyang
> Priority: Major
> Attachments: 128_log_all.rar, 128_log_debug.rar, 16$196日志.rar,
> 74_log_all.rar, 74_log_debug.rar, 91_log_all.rar, 91_log_debug.rar,
> config&204日志.rar, confignode--log_debug.rar, confignode_log_all.rar
>
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)