Hello Gregory, in the meantime, I managed to break it further :(
I tried getting rid of active+remapped pgs and got some undersized
instead.. nto sure whether this can be related..
anyways here's the status:
ceph -s
cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a
health HEALTH_WARN
3 pgs degraded
2 pgs stale
3 pgs stuck degraded
1 pgs stuck inactive
2 pgs stuck stale
242 pgs stuck unclean
3 pgs stuck undersized
3 pgs undersized
recovery 65/3374343 objects degraded (0.002%)
recovery 186187/3374343 objects misplaced (5.518%)
mds0: Behind on trimming (155/30)
monmap e3: 3 mons at
{remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0}
election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c
mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby
osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs
pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects
12947 GB used, 22591 GB / 35538 GB avail
65/3374343 objects degraded (0.002%)
186187/3374343 objects misplaced (5.518%)
1612 active+clean
238 active+remapped
3 active+undersized+degraded
2 stale+active+clean
1 creating
client io 0 B/s rd, 40830 B/s wr, 17 op/s
> What's the full output of "ceph -s"? Have you looked at the MDS admin
> socket at all — what state does it say it's in?
[root@remrprv1c ceph]# ceph --admin-daemon
/var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight
{
"ops": [
{
"description": "client_request(client.3052096:83 getattr Fs
#10000000288 2016-02-03 10:10:46.361591 RETRY=1)",
"initiated_at": "2016-02-03 10:23:25.791790",
"age": 3963.093615,
"duration": 9.519091,
"type_data": [
"failed to rdlock, waiting",
"client.3052096:83",
"client_request",
{
"client": "client.3052096",
"tid": 83
},
[
{
"time": "2016-02-03 10:23:25.791790",
"event": "initiated"
},
{
"time": "2016-02-03 10:23:35.310881",
"event": "failed to rdlock, waiting"
}
]
]
}
],
"num_ops": 1
}
seems there's some lock stuck here..
Killing stuck client (it's postgres trying to access cephfs file
doesn't help..)
> -Greg
>
> >
> > My question here is:
> >
> > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15
> > which could lead to cephfs hangs?
> >
> > 2) what can I do to debug what is the cause of this hang?
> >
> > 3) is there a way to recover this without hard resetting
> > node with hung cephfs mount?
> >
> > If I could provide more information, please let me know
> >
> > I'd really appreciate any help
> >
> > with best regards
> >
> > nik
> >
> >
> >
> >
> > --
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28.rijna 168, 709 00 Ostrava
> >
> > tel.: +420 591 166 214
> > fax: +420 596 621 273
> > mobil: +420 777 093 799
> > www.linuxbox.cz
> >
> > mobil servis: +420 737 238 656
> > email servis: [email protected]
> > -------------------------------------
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: [email protected]
-------------------------------------
pgpGoG5McCNrp.pgp
Description: PGP signature
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
