Re: [ceph-users] Troubles MDS

Yan, Zheng Thu, 17 Apr 2014 01:57:13 -0700

On Thu, Apr 17, 2014 at 4:10 PM, Georg Höllrigl
<[email protected]> wrote:
> Whatever happened - It fixed itself!?
>
> When restarting, I got ~ 165k log messages like:
> 2014-04-17 07:30:14.856421 7fc50b991700  0 log [WRN] :  ino 10000f24fe0
> 2014-04-17 07:30:14.856422 7fc50b991700  0 log [WRN] :  ino 10000f24fe1
> 2014-04-17 07:30:14.856423 7fc50b991700  0 log [WRN] :  ino 10000f24fe2
> 2014-04-17 07:30:14.856424 7fc50b991700  0 log [WRN] :  ino 10000f24fe3
> 2014-04-17 07:30:14.856427 7fc50b991700  0 log [WRN] :  ino 10000f24fe4
> 2014-04-17 07:30:14.856428 7fc50b991700  0 log [WRN] :  ino 10000f24fe5
>
> And the clients recovered!?
>
> I would be really interested, what happend!
>
> Georg
>
>
> On 17.04.2014 09:45, Georg Höllrigl wrote:
>>
>> Hello Greg,
>>
>> I've searched - but don't see any backtraces... I've tried to get some
>> more info out of the logs. I really hope, there is something interesting
>> in it:
>>
>> It all started two days ago with an authentication error:
>>
>> 2014-04-14 21:08:55.929396 7fd93d53f700  1 mds.0.0
>> standby_replay_restart (as standby)
>> 2014-04-14 21:09:07.989547 7fd93b62e700  1 mds.0.0 replay_done (as
>> standby)
>> 2014-04-14 21:09:08.989647 7fd93d53f700  1 mds.0.0
>> standby_replay_restart (as standby)
>> 2014-04-14 21:09:10.633786 7fd93b62e700  1 mds.0.0 replay_done (as
>> standby)
>> 2014-04-14 21:09:11.633886 7fd93d53f700  1 mds.0.0
>> standby_replay_restart (as standby)
>> 2014-04-14 21:09:17.995105 7fd93f644700  0 mds.0.0 handle_mds_beacon no
>> longer laggy
>> 2014-04-14 21:09:39.798798 7fd93f644700  0 monclient: hunting for new mon
>> 2014-04-14 21:09:39.955078 7fd93f644700  1 mds.-1.-1 handle_mds_map i
>> (10.0.1.107:6800/16503) dne in the mdsmap, respawning myself
>> 2014-04-14 21:09:39.955094 7fd93f644700  1 mds.-1.-1 respawn
>> 2014-04-14 21:09:39.955106 7fd93f644700  1 mds.-1.-1  e:
>> '/usr/bin/ceph-mds'
>> 2014-04-14 21:09:39.955109 7fd93f644700  1 mds.-1.-1  0:
>> '/usr/bin/ceph-mds'
>> 2014-04-14 21:09:39.955110 7fd93f644700  1 mds.-1.-1  1: '-i'
>> 2014-04-14 21:09:39.955112 7fd93f644700  1 mds.-1.-1  2: 'ceph-m-02'
>> 2014-04-14 21:09:39.955113 7fd93f644700  1 mds.-1.-1  3: '--pid-file'
>> 2014-04-14 21:09:39.955114 7fd93f644700  1 mds.-1.-1  4:
>> '/var/run/ceph/mds.ceph-m-02.pid'
>> 2014-04-14 21:09:39.955116 7fd93f644700  1 mds.-1.-1  5: '-c'
>> 2014-04-14 21:09:39.955117 7fd93f644700  1 mds.-1.-1  6:
>> '/etc/ceph/ceph.conf'
>> 2014-04-14 21:09:39.979138 7fd93f644700  1 mds.-1.-1  cwd /
>> 2014-04-14 19:09:40.922683 7f8ba9973780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 16505
>> 2014-04-14 19:09:40.975024 7f8ba9973780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-14 19:09:40.975070 7f8ba9973780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>>
>> That was fixed with restarting mds (+ the whole server).
>>
>> 2014-04-15 07:07:15.948650 7f9fdec0d780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 506
>> 2014-04-15 07:07:15.954386 7f9fdec0d780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:07:15.954422 7f9fdec0d780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:15:49.177861 7fe8a1d60780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 26401
>> 2014-04-15 07:15:49.184027 7fe8a1d60780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:15:49.184046 7fe8a1d60780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:17:32.598031 7fab123e6780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 30531
>> 2014-04-15 07:17:32.604560 7fab123e6780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:17:32.604592 7fab123e6780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:21:56.099203 7fd37b951780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11335
>> 2014-04-15 07:21:56.105229 7fd37b951780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:21:56.105254 7fd37b951780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:22:09.345800 7f23392ef780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11461
>> 2014-04-15 07:22:09.390001 7f23392ef780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:22:09.391087 7f23392ef780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:28:01.762191 7fab6d14b780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 28263
>> 2014-04-15 07:28:01.779485 7fab6d14b780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-15 07:28:01.779507 7fab6d14b780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>> 2014-04-15 07:35:49.065110 7fe4f6b0d780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 1233
>> 2014-04-15 07:35:52.191856 7fe4f6b05700  0 -- 10.0.1.107:6800/1233 >>
>> 10.0.1.108:6789/0 pipe(0x2f9f500 sd=8 :0 s=1 pgs=0 cs=0 l=1
>> c=0x2f81580).fault
>> 2014-04-15 07:35:56.352553 7fe4f1b96700  1 mds.-1.0 handle_mds_map standby
>> 2014-04-15 07:35:56.419499 7fe4f1b96700  1 mds.0.0 handle_mds_map i am
>> now mds.7905854.0replaying mds.0.0
>> 2014-04-15 07:35:56.419507 7fe4f1b96700  1 mds.0.0 handle_mds_map state
>> change up:standby --> up:standby-replay
>> 2014-04-15 07:35:56.419512 7fe4f1b96700  1 mds.0.0 replay_start
>> 2014-04-15 07:35:56.425229 7fe4f1b96700  1 mds.0.0  recovery set is
>> 2014-04-15 07:35:56.425241 7fe4f1b96700  1 mds.0.0  need osdmap epoch
>> 1391, have 3696
>> 2014-04-15 07:35:56.497573 7fe4f1b96700  0 mds.0.cache creating system
>> inode with ino:100
>> 2014-04-15 07:35:56.501393 7fe4f1b96700  0 mds.0.cache creating system
>> inode with ino:1
>> 2014-04-15 07:35:58.294776 7fe4ed67b700  1 mds.0.0 replay_done (as
>> standby)
>> 2014-04-15 07:35:59.294874 7fe4ef990700  1 mds.0.0
>> standby_replay_restart (as standby)
>>
>> Here everything runs normal...
>>
>>
>> 2014-04-15 20:26:19.724564 7fbcca70d700  0 mds.0.3 handle_mds_beacon no
>> longer laggy
>> 2014-04-16 04:31:13.485103 7fbcc55ed700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.81:0/61463 pipe(0xf930780 sd=55 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x71db6e0).fault with nothing to send, going to standby
>> 2014-04-16 04:31:13.485222 7fbcc46de700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.68:0/4128280551 pipe(0xf6ac780 sd=28 :6800 s=2 pgs=4 cs=1 l=0
>> c=0xe31d6e0).fault with nothing to send, going to standby
>> 2014-04-16 04:31:14.506269 7fbcc7505700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.42:0/2817517925 pipe(0x1a7c9780 sd=36 :6800 s=2 pgs=2 cs=1 l=0
>> c=0xf6749a0).fault with nothing to send, going to standby
>> 2014-04-16 04:31:19.852504 7fbcc12aa700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.156:0/4413 pipe(0x128165c80 sd=66 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x12b5b9dc0).fault with nothing to send, going to standby
>> 2014-04-16 04:31:38.971665 7fbcc30c8700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.0.25:0/31945 pipe(0xf930500 sd=56 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x71db420).fault with nothing to send, going to standby
>> 2014-04-16 04:31:38.971680 7fbcc1bb3700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.39:0/13485 pipe(0xf6aca00 sd=29 :6800 s=2 pgs=2 cs=1 l=0
>> c=0xe31d2c0).fault with nothing to send, going to standby
>> 2014-04-16 04:31:38.971714 7fbcc63f4700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.40:0/12395 pipe(0x31fe000 sd=46 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x726b160).fault with nothing to send, going to standby
>> 2014-04-16 04:31:38.979886 7fbcc2cc4700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.152:0/14639 pipe(0x1a7c9000 sd=64 :6800 s=2 pgs=2 cs=1 l=0
>> c=0x71db580).fault with nothing to send, going to standby
>> 2014-04-16 04:31:38.984613 7fbcc37cf700  0 -- 10.0.1.108:6800/22033 >>
>> 10.0.1.66:0/114165773 pipe(0x16624000 sd=48 :6800 s=2 pgs=4 cs=1 l=0
>> c=0xf674c60).fault with nothing to send, going to standby
>> 2014-04-16 04:32:41.689287 7fbcca70d700  1 mds.-1.-1 handle_mds_map i
>> (10.0.1.108:6800/22033) dne in the mdsmap, respawning myself
>> 2014-04-16 04:32:41.689294 7fbcca70d700  1 mds.-1.-1 respawn
>> 2014-04-16 04:32:41.689301 7fbcca70d700  1 mds.-1.-1  e:
>> '/usr/bin/ceph-mds'
>> 2014-04-16 04:32:41.689304 7fbcca70d700  1 mds.-1.-1  0:
>> '/usr/bin/ceph-mds'
>> 2014-04-16 04:32:41.689306 7fbcca70d700  1 mds.-1.-1  1: '-i'
>> 2014-04-16 04:32:41.689308 7fbcca70d700  1 mds.-1.-1  2: 'ceph-m-03'
>> 2014-04-16 04:32:41.689309 7fbcca70d700  1 mds.-1.-1  3: '--pid-file'
>> 2014-04-16 04:32:41.689310 7fbcca70d700  1 mds.-1.-1  4:
>> '/var/run/ceph/mds.ceph-m-03.pid'
>> 2014-04-16 04:32:41.689312 7fbcca70d700  1 mds.-1.-1  5: '-c'
>> 2014-04-16 04:32:41.689313 7fbcca70d700  1 mds.-1.-1  6:
>> '/etc/ceph/ceph.conf'
>> 2014-04-16 04:32:41.742899 7fbcca70d700  1 mds.-1.-1  cwd /
>> 2014-04-16 02:32:43.116287 7ff7375ad780  0 ceph version 0.72.2
>> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 22035
>> 2014-04-16 02:32:43.226015 7ff7375ad780 -1 mds.-1.0 ERROR: failed to
>> authenticate: (1) Operation not permitted
>> 2014-04-16 02:32:43.226040 7ff7375ad780  1 mds.-1.0 suicide.  wanted
>> down:dne, now up:boot
>>
>> Nothing worked - mounting not possible. Restarting of the process not
>> possible.
>> Logs filled with
>>
>> 2014-04-16 12:29:46.417441 7f1de27f5700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.108:6789/0 pipe(0xabc80000 sd=250 :44413 s=4 pgs=0 cs=0 l=1
>> c=0x178870580).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:47.065881 7f1de22f0700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x25b25280 sd=277 :39688 s=4 pgs=0 cs=0 l=1
>> c=0x159b4c840).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:47.231550 7f1de05d3700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x1049a5280 sd=252 :39919 s=4 pgs=0 cs=0 l=1
>> c=0x4e53d840).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:47.359261 7f1de1ae8700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x103d3b500 sd=244 :40070 s=4 pgs=0 cs=0 l=1
>> c=0x1130f92c0).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:47.486590 7f1de30fe700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x47889780 sd=245 :40175 s=4 pgs=0 cs=0 l=1
>> c=0xc6c09420).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:47.904427 7f1de3603700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x88069000 sd=262 :40554 s=4 pgs=0 cs=0 l=1
>> c=0x116652580).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:48.272279 7f1de12e0700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0xed8f9c80 sd=257 :40929 s=4 pgs=0 cs=0 l=1
>> c=0xf2c2f160).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:49.822148 7f1de18e6700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x110926780 sd=264 :42222 s=4 pgs=0 cs=0 l=1
>> c=0x10508e9a0).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:50.769693 7f1ddf5c3700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x6dbc6280 sd=243 :43169 s=4 pgs=0 cs=0 l=1
>> c=0x1358ed580).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:51.621605 7f1de05d3700  0 -- 10.0.1.107:6803/21953 >>
>> 10.0.1.107:6789/0 pipe(0x21154780 sd=243 :43929 s=4 pgs=0 cs=0 l=1
>> c=0xecb1a580).connect got RESETSESSION but no longer connecting
>> 2014-04-16 12:29:51.886405 7f1dea867700 -1 mds.0.13 *** got signal
>> Terminated ***
>> 2014-04-16 12:29:51.886894 7f1dea867700  1 mds.0.13 suicide.  wanted
>> down:dne, now up:rejoin
>>
>>
>> Also see lots of these:
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.900657 7f937a5e1700  0
>> log [ERR] : dir 10000c4639d.10000c4639d object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.913245 7f937a5e1700  0
>> log [ERR] : dir 10000c4617e.10000c4617e object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.925811 7f937a5e1700  0
>> log [ERR] : dir 10000c45d08.10000c45d08 object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.941476 7f937a5e1700  0
>> log [ERR] : dir 10000c45d9e.10000c45d9e object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.956158 7f937a5e1700  0
>> log [ERR] : dir 10000c461e5.10000c461e5 object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.968524 7f937a5e1700  0
>> log [ERR] : dir 10000c46608.10000c46608 object missing on disk; some
>> files may be lost
>> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.979229 7f937a5e1700  0
>> log [ERR] : dir 10000c468b6.10000c468b6 object missing on disk; some
>> files may be lost
>>
>> At the moment, I've only one mds running - but clients (mainly using
>> fuse) can't connect.
>>
>>


Looks like you enabled directory fragments, which is buggy in ceph version 0.72.

Regards
Yan, Zheng

>>
>> Regards,
>> Georg
>>
>>
>>
>> On 16.04.2014 16:27, Gregory Farnum wrote:
>>>
>>> What's the backtrace from the MDS crash?
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Wed, Apr 16, 2014 at 7:11 AM, Georg Höllrigl
>>> <[email protected]> wrote:
>>>>
>>>> Hello,
>>>>
>>>> Using Ceph MDS with one active and one standby server - a day ago one
>>>> of the
>>>> mds crashed and I restarted it.
>>>> Tonight it crashed again, a few hours later, also the second mds
>>>> crashed.
>>>>
>>>> #ceph -v
>>>> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>>>>
>>>> At the moment cephfs is dead - with following health status:
>>>>
>>>> #ceph -s
>>>>      cluster b04fc583-9e71-48b7-a741-92f4dff4cfef
>>>>       health HEALTH_WARN mds cluster is degraded; mds c is laggy
>>>>       monmap e3: 3 mons at
>>>>
>>>> {ceph-m-01=10.0.0.176:6789/0,ceph-m-02=10.0.1.107:6789/0,ceph-m-03=10.0.1.108:6789/0},
>>>>
>>>> election epoch 6274, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03
>>>>       mdsmap e2055: 1/1/1 up {0=ceph-m-03=up:rejoin(laggy or crashed)}
>>>>       osdmap e3752: 39 osds: 39 up, 39 in
>>>>        pgmap v3277576: 8328 pgs, 17 pools, 6461 GB data, 17066 kobjects
>>>>              13066 GB used, 78176 GB / 91243 GB avail
>>>>                  8328 active+clean
>>>>    client io 1193 B/s rd, 0 op/s
>>>>
>>>> I couldn't really find any useful infos in the logfiles nor searching in
>>>> documentations. Any ideas how to get cephfs up and running?
>>>>
>>>> Here is part of mds log:
>>>> 2014-04-16 14:07:05.603501 7ff184c64700  1 mds.0.server reconnect
>>>> gave up on
>>>> client.7846580 10.0.1.152:0/14639
>>>> 2014-04-16 14:07:05.603525 7ff184c64700  1 mds.0.46 reconnect_done
>>>> 2014-04-16 14:07:05.674990 7ff186d69700  1 mds.0.46 handle_mds_map i
>>>> am now
>>>> mds.0.46
>>>> 2014-04-16 14:07:05.674996 7ff186d69700  1 mds.0.46 handle_mds_map state
>>>> change up:reconnect --> up:rejoin
>>>> 2014-04-16 14:07:05.674998 7ff186d69700  1 mds.0.46 rejoin_start
>>>> 2014-04-16 14:07:22.347521 7ff17f825700  0 -- 10.0.1.107:6815/17325 >>
>>>> 10.0.1.68:0/4128280551 pipe(0x5e2ac80 sd=930 :6815 s=2 pgs=153 cs=1 l=0
>>>> c=0x5e2e160).fault with nothing to send, going to standby
>>>>
>>>> Any ideas, how to solve "laggy or crashed" ?
>>>>
>>>>
>>>> Georg
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubles MDS

Reply via email to