On Wed, Feb 3, 2016 at 2:32 AM, Nikola Ciprich
<[email protected]> wrote:
> Hello Gregory,
>
> in the meantime, I managed to break it further :(
>
> I tried getting rid of active+remapped pgs and got some undersized
> instead.. nto sure whether this can be related..
>
> anyways here's the status:
>
> ceph -s
> cluster ff21618e-5aea-4cfe-83b6-a0d2d5b4052a
> health HEALTH_WARN
> 3 pgs degraded
> 2 pgs stale
> 3 pgs stuck degraded
> 1 pgs stuck inactive
> 2 pgs stuck stale
> 242 pgs stuck unclean
> 3 pgs stuck undersized
> 3 pgs undersized
> recovery 65/3374343 objects degraded (0.002%)
> recovery 186187/3374343 objects misplaced (5.518%)
> mds0: Behind on trimming (155/30)
> monmap e3: 3 mons at
> {remrprv1a=10.0.0.1:6789/0,remrprv1b=10.0.0.2:6789/0,remrprv1c=10.0.0.3:6789/0}
> election epoch 522, quorum 0,1,2 remrprv1a,remrprv1b,remrprv1c
> mdsmap e342: 1/1/1 up {0=remrprv1c=up:active}, 2 up:standby
> osdmap e4385: 21 osds: 21 up, 21 in; 238 remapped pgs
> pgmap v18679192: 1856 pgs, 7 pools, 4223 GB data, 1103 kobjects
> 12947 GB used, 22591 GB / 35538 GB avail
> 65/3374343 objects degraded (0.002%)
> 186187/3374343 objects misplaced (5.518%)
> 1612 active+clean
> 238 active+remapped
> 3 active+undersized+degraded
> 2 stale+active+clean
> 1 creating
> client io 0 B/s rd, 40830 B/s wr, 17 op/s
Yeah, these inactive PGs are basically guaranteed to be the cause of
the problem. There are lots of threads about getting PGs healthy
again; you should dig around the archives and the documentation
troubleshooting page(s). :)
-Greg
>
>
>> What's the full output of "ceph -s"? Have you looked at the MDS admin
>> socket at all — what state does it say it's in?
>
> [root@remrprv1c ceph]# ceph --admin-daemon
> /var/run/ceph/ceph-mds.remrprv1c.asok dump_ops_in_flight
> {
> "ops": [
> {
> "description": "client_request(client.3052096:83 getattr Fs
> #10000000288 2016-02-03 10:10:46.361591 RETRY=1)",
> "initiated_at": "2016-02-03 10:23:25.791790",
> "age": 3963.093615,
> "duration": 9.519091,
> "type_data": [
> "failed to rdlock, waiting",
> "client.3052096:83",
> "client_request",
> {
> "client": "client.3052096",
> "tid": 83
> },
> [
> {
> "time": "2016-02-03 10:23:25.791790",
> "event": "initiated"
> },
> {
> "time": "2016-02-03 10:23:35.310881",
> "event": "failed to rdlock, waiting"
> }
> ]
> ]
> }
> ],
> "num_ops": 1
> }
>
> seems there's some lock stuck here..
>
> Killing stuck client (it's postgres trying to access cephfs file
> doesn't help..)
>
>
>> -Greg
>>
>> >
>> > My question here is:
>> >
>> > 1) is there some known issue with hammer 0.94.5 or kernel 4.1.15
>> > which could lead to cephfs hangs?
>> >
>> > 2) what can I do to debug what is the cause of this hang?
>> >
>> > 3) is there a way to recover this without hard resetting
>> > node with hung cephfs mount?
>> >
>> > If I could provide more information, please let me know
>> >
>> > I'd really appreciate any help
>> >
>> > with best regards
>> >
>> > nik
>> >
>> >
>> >
>> >
>> > --
>> > -------------------------------------
>> > Ing. Nikola CIPRICH
>> > LinuxBox.cz, s.r.o.
>> > 28.rijna 168, 709 00 Ostrava
>> >
>> > tel.: +420 591 166 214
>> > fax: +420 596 621 273
>> > mobil: +420 777 093 799
>> > www.linuxbox.cz
>> >
>> > mobil servis: +420 737 238 656
>> > email servis: [email protected]
>> > -------------------------------------
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > [email protected]
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.: +420 591 166 214
> fax: +420 596 621 273
> mobil: +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: [email protected]
> -------------------------------------
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com