Hi Ilya,

While trying to reproduce the issue I've found that:
- it is relatively easy to reproduce 5-6 minutes hangs just by killing active 
mds process (triggering failover) while writing a lot of data. Unacceptable 
timeout, but not the case of http://tracker.ceph.com/issues/15255
- it is hard to reproduce the endless hang (I've spent an hour without success)

One thing I've noticed analysing logs is that "endless hang" always was 
accompanied with following messages:
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session lost, hunting for new mon
Jul 20 15:31:57 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 
session established
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon1 10.50.67.26:6789 
session lost, hunting for new mon
Jul 20 15:32:27 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session lost, hunting for new mon
Jul 20 15:32:57 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session established
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon0 10.50.67.25:6789 
session lost, hunting for new mon
Jul 20 15:33:28 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established
Jul 20 15:33:58 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session lost, hunting for new mon
Jul 20 15:34:29 mn-ceph-nfs-gw-01 kernel: libceph: mon2 10.50.67.27:6789 
session established

Bug http://tracker.ceph.com/issues/17664 describes such behaviour and it was 
fixed in releases starting with v11.1.0 (I'm using 10.2.7). So, the lost 
session somehow triggers client disconnection and fencing (as described at 
http://docs.ceph.com/docs/master/cephfs/troubleshooting/#disconnected-remounted-fs).

Do you still think it should be posted to http://tracker.ceph.com/issues/15255 ?

> 20 июля 2017 г., в 17:02, Ilya Dryomov <idryo...@gmail.com> написал(а):
> 
> On Thu, Jul 20, 2017 at 3:23 PM, Дмитрий Глушенок <gl...@jet.msk.su> wrote:
>> Looks like I have similar issue as described in this bug:
>> http://tracker.ceph.com/issues/15255
>> Writer (dd in my case) can be restarted and then writing continues, but
>> until restart dd looks like hanged on write.
>> 
>> 20 июля 2017 г., в 16:12, Дмитрий Глушенок <gl...@jet.msk.su> написал(а):
>> 
>> Hi,
>> 
>> Repeated the test using kernel 4.12.0. OSD node crash seems to be handled
>> fine now, but MDS crash still leads to hanged writes to CephFS. Now it was
>> enough just to crash the first MDS - failover didn't happened. At the same
>> time FUSE client was running on another client - no problems with it.
> 
> Could you please post the exact steps for reproducing with 4.12 to that
> ticket?  It sounds like something that should be prioritized.
> 
> Thanks,
> 
>                Ilya

--
Dmitry Glushenok
Jet Infosystems

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to