To be more precise,

the logs occurs when the hang is finished.

I have looked at stats on 10 differents hang, and the duration is always around 
15 minutes.

Maybe related to:

ms tcp read timeout
Description:    If a client or daemon makes a request to another Ceph daemon 
and does not drop an unused connection, the ms tcp read timeout defines the 
connection as idle after the specified number of seconds.
Type:   Unsigned 64-bit Integer
Required:       No
Default:        900 15 minutes.

?

Find a similar bug report with firewall too:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-October/013841.html


----- Mail original -----
De: "aderumier" <[email protected]>
À: "ceph-users" <[email protected]>
Envoyé: Jeudi 8 Novembre 2018 18:16:20
Objet: [ceph-users] cephfs kernel, hang with libceph: osdx X.X.X.X socket 
closed (con state OPEN)

Hi, 

we are currently test cephfs with kernel module (4.17 and 4.18) instead fuse 
(worked fine), 

and we have hang, iowait jump like crazy for around 20min. 

client is a qemu 2.12 vm with virtio-net interface. 


Is the client logs, we are seeing this kind of logs: 

[jeu. nov. 8 12:20:18 2018] libceph: osd14 x.x.x.x:6801 socket closed (con 
state OPEN) 
[jeu. nov. 8 12:42:03 2018] libceph: osd9 x.x.x.x:6821 socket closed (con state 
OPEN) 


and in osd logs: 

osd14: 
2018-11-08 12:20:25.247 7f31ffac8700 0 -- x.x.x.x:6801/1745 >> 
x.x.x.x:0/3678871522 conn(0x558c430ec300 :6801 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 

osd9: 
2018-11-08 12:42:09.820 7f7ca970e700 0 -- x.x.x.x:6821/1739 >> 
x.x.x.x:0/3678871522 conn(0x564fcbec5100 :6821 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1) 


cluster is ceph 13.2.1 

Note that we have a physical firewall between client and server, I'm not sure 
yet if the session could be dropped. (I don't have find any logs in the 
firewall). 

Any idea ? I would like to known if it's a network bug, or ceph bug (not sure 
how to understand the osd logs) 

Regards, 

Alexandre 



client ceph.conf 
---------------- 
[client] 
fuse_disable_pagecache = true 
client_reconnect_stale = true 


_______________________________________________ 
ceph-users mailing list 
[email protected] 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to