gaodayue opened a new issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168
**Describe the bug (描述bug)**
当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如
1) 下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not
connected to 10.26.44.32:8060 yet"
```
W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from
Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection
reset by peer [104]
W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc
batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896
fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer
[R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not
connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to
10.26.44.32:8060 yet, server_id=896
....忽略类似内容....
W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc
batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060
yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet,
server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
[R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
```
2)查看netstat发现没有Y与10.26.44.32的TCP连接
3)查看Y的/connections发现Socket状态为Broken,信息如下
```
$ curl http://localhost:8060/connections | grep 10.26.44.32:8060
Broken |10.26.44.32:8060 |55309|- |- |-
|- |- |- |- |- |- |- |- |-
|896
$ curl http://localhost:8060/sockets/896
# This is a broken Socket
version=1
shared_part={
ref_count=1
socket_pool=null
creator_socket=896
in_size=316616369
in_num_messages=12120483
out_size=114960271066
out_num_messages=12120511
}
nref=1
nevent=1
fd=4228
tos=0
reset_fd_to_now=485975008182us
remote_side=10.26.44.32:8060
local_side=10.22.180.15:55309
on_et_events=0x1bc5dd0
user=(brpc::InputMessenger*)0x5c0ab40
this_id=896
preferred_index=1 (baidu_std)
hc_count=0
avg_input_msg_size=26
read_buf=0
last_read_to_now=960432766us
last_write_to_now=960412394us
overcrowded=0
id_wait_list={}
parsing_context=0
pipeline_q=0
hc_interval_s=3
ninprocess=1
auth_flag_error=0
auth_id=177098681547473
auth_context=0
logoff_flag=0
recycle_flag=1
agent_socket_id=(none)
cid=0
write_head=0
ssl_state=SSL_OFF
tcpi={
state=7
ca_state=0
retransmits=0
probes=0
backoff=0
options=7
snd_wscale=7
rcv_wscale=7
rto=205000
ato=40000
snd_mss=1448
rcv_mss=736
unacked=0
sacked=0
lost=0
retrans=0
fackets=0
last_data_sent=960413
last_ack_sent=0
last_data_recv=960433
last_ack_recv=960413
pmtu=1500
rcv_ssthresh=52260
rtt=2750
rttvar=3000
snd_ssthresh=18
snd_cwnd=18
advmss=1448
reordering=3
}
```
4)Y日志中没有"Checking Socket"和"Revived
Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)
**To Reproduce (复现方法)**
生产环境小概率出现,目前需要通过重启上游节点恢复。
**Versions (各种版本)**
OS: CentOS Linux release 7.1.1503 (Core)
Compiler: gcc (GCC) 7.2.0
brpc: 0.9.5
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]