On Thu, Apr 17, 2014 at 4:10 PM, Georg Höllrigl <[email protected]> wrote: > Whatever happened - It fixed itself!? > > When restarting, I got ~ 165k log messages like: > 2014-04-17 07:30:14.856421 7fc50b991700 0 log [WRN] : ino 10000f24fe0 > 2014-04-17 07:30:14.856422 7fc50b991700 0 log [WRN] : ino 10000f24fe1 > 2014-04-17 07:30:14.856423 7fc50b991700 0 log [WRN] : ino 10000f24fe2 > 2014-04-17 07:30:14.856424 7fc50b991700 0 log [WRN] : ino 10000f24fe3 > 2014-04-17 07:30:14.856427 7fc50b991700 0 log [WRN] : ino 10000f24fe4 > 2014-04-17 07:30:14.856428 7fc50b991700 0 log [WRN] : ino 10000f24fe5 > > And the clients recovered!? > > I would be really interested, what happend! > > Georg > > > On 17.04.2014 09:45, Georg Höllrigl wrote: >> >> Hello Greg, >> >> I've searched - but don't see any backtraces... I've tried to get some >> more info out of the logs. I really hope, there is something interesting >> in it: >> >> It all started two days ago with an authentication error: >> >> 2014-04-14 21:08:55.929396 7fd93d53f700 1 mds.0.0 >> standby_replay_restart (as standby) >> 2014-04-14 21:09:07.989547 7fd93b62e700 1 mds.0.0 replay_done (as >> standby) >> 2014-04-14 21:09:08.989647 7fd93d53f700 1 mds.0.0 >> standby_replay_restart (as standby) >> 2014-04-14 21:09:10.633786 7fd93b62e700 1 mds.0.0 replay_done (as >> standby) >> 2014-04-14 21:09:11.633886 7fd93d53f700 1 mds.0.0 >> standby_replay_restart (as standby) >> 2014-04-14 21:09:17.995105 7fd93f644700 0 mds.0.0 handle_mds_beacon no >> longer laggy >> 2014-04-14 21:09:39.798798 7fd93f644700 0 monclient: hunting for new mon >> 2014-04-14 21:09:39.955078 7fd93f644700 1 mds.-1.-1 handle_mds_map i >> (10.0.1.107:6800/16503) dne in the mdsmap, respawning myself >> 2014-04-14 21:09:39.955094 7fd93f644700 1 mds.-1.-1 respawn >> 2014-04-14 21:09:39.955106 7fd93f644700 1 mds.-1.-1 e: >> '/usr/bin/ceph-mds' >> 2014-04-14 21:09:39.955109 7fd93f644700 1 mds.-1.-1 0: >> '/usr/bin/ceph-mds' >> 2014-04-14 21:09:39.955110 7fd93f644700 1 mds.-1.-1 1: '-i' >> 2014-04-14 21:09:39.955112 7fd93f644700 1 mds.-1.-1 2: 'ceph-m-02' >> 2014-04-14 21:09:39.955113 7fd93f644700 1 mds.-1.-1 3: '--pid-file' >> 2014-04-14 21:09:39.955114 7fd93f644700 1 mds.-1.-1 4: >> '/var/run/ceph/mds.ceph-m-02.pid' >> 2014-04-14 21:09:39.955116 7fd93f644700 1 mds.-1.-1 5: '-c' >> 2014-04-14 21:09:39.955117 7fd93f644700 1 mds.-1.-1 6: >> '/etc/ceph/ceph.conf' >> 2014-04-14 21:09:39.979138 7fd93f644700 1 mds.-1.-1 cwd / >> 2014-04-14 19:09:40.922683 7f8ba9973780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 16505 >> 2014-04-14 19:09:40.975024 7f8ba9973780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-14 19:09:40.975070 7f8ba9973780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> >> That was fixed with restarting mds (+ the whole server). >> >> 2014-04-15 07:07:15.948650 7f9fdec0d780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 506 >> 2014-04-15 07:07:15.954386 7f9fdec0d780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:07:15.954422 7f9fdec0d780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:15:49.177861 7fe8a1d60780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 26401 >> 2014-04-15 07:15:49.184027 7fe8a1d60780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:15:49.184046 7fe8a1d60780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:17:32.598031 7fab123e6780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 30531 >> 2014-04-15 07:17:32.604560 7fab123e6780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:17:32.604592 7fab123e6780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:21:56.099203 7fd37b951780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11335 >> 2014-04-15 07:21:56.105229 7fd37b951780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:21:56.105254 7fd37b951780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:22:09.345800 7f23392ef780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11461 >> 2014-04-15 07:22:09.390001 7f23392ef780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:22:09.391087 7f23392ef780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:28:01.762191 7fab6d14b780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 28263 >> 2014-04-15 07:28:01.779485 7fab6d14b780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-15 07:28:01.779507 7fab6d14b780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> 2014-04-15 07:35:49.065110 7fe4f6b0d780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 1233 >> 2014-04-15 07:35:52.191856 7fe4f6b05700 0 -- 10.0.1.107:6800/1233 >> >> 10.0.1.108:6789/0 pipe(0x2f9f500 sd=8 :0 s=1 pgs=0 cs=0 l=1 >> c=0x2f81580).fault >> 2014-04-15 07:35:56.352553 7fe4f1b96700 1 mds.-1.0 handle_mds_map standby >> 2014-04-15 07:35:56.419499 7fe4f1b96700 1 mds.0.0 handle_mds_map i am >> now mds.7905854.0replaying mds.0.0 >> 2014-04-15 07:35:56.419507 7fe4f1b96700 1 mds.0.0 handle_mds_map state >> change up:standby --> up:standby-replay >> 2014-04-15 07:35:56.419512 7fe4f1b96700 1 mds.0.0 replay_start >> 2014-04-15 07:35:56.425229 7fe4f1b96700 1 mds.0.0 recovery set is >> 2014-04-15 07:35:56.425241 7fe4f1b96700 1 mds.0.0 need osdmap epoch >> 1391, have 3696 >> 2014-04-15 07:35:56.497573 7fe4f1b96700 0 mds.0.cache creating system >> inode with ino:100 >> 2014-04-15 07:35:56.501393 7fe4f1b96700 0 mds.0.cache creating system >> inode with ino:1 >> 2014-04-15 07:35:58.294776 7fe4ed67b700 1 mds.0.0 replay_done (as >> standby) >> 2014-04-15 07:35:59.294874 7fe4ef990700 1 mds.0.0 >> standby_replay_restart (as standby) >> >> Here everything runs normal... >> >> >> 2014-04-15 20:26:19.724564 7fbcca70d700 0 mds.0.3 handle_mds_beacon no >> longer laggy >> 2014-04-16 04:31:13.485103 7fbcc55ed700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.81:0/61463 pipe(0xf930780 sd=55 :6800 s=2 pgs=2 cs=1 l=0 >> c=0x71db6e0).fault with nothing to send, going to standby >> 2014-04-16 04:31:13.485222 7fbcc46de700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.68:0/4128280551 pipe(0xf6ac780 sd=28 :6800 s=2 pgs=4 cs=1 l=0 >> c=0xe31d6e0).fault with nothing to send, going to standby >> 2014-04-16 04:31:14.506269 7fbcc7505700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.42:0/2817517925 pipe(0x1a7c9780 sd=36 :6800 s=2 pgs=2 cs=1 l=0 >> c=0xf6749a0).fault with nothing to send, going to standby >> 2014-04-16 04:31:19.852504 7fbcc12aa700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.156:0/4413 pipe(0x128165c80 sd=66 :6800 s=2 pgs=2 cs=1 l=0 >> c=0x12b5b9dc0).fault with nothing to send, going to standby >> 2014-04-16 04:31:38.971665 7fbcc30c8700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.0.25:0/31945 pipe(0xf930500 sd=56 :6800 s=2 pgs=2 cs=1 l=0 >> c=0x71db420).fault with nothing to send, going to standby >> 2014-04-16 04:31:38.971680 7fbcc1bb3700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.39:0/13485 pipe(0xf6aca00 sd=29 :6800 s=2 pgs=2 cs=1 l=0 >> c=0xe31d2c0).fault with nothing to send, going to standby >> 2014-04-16 04:31:38.971714 7fbcc63f4700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.40:0/12395 pipe(0x31fe000 sd=46 :6800 s=2 pgs=2 cs=1 l=0 >> c=0x726b160).fault with nothing to send, going to standby >> 2014-04-16 04:31:38.979886 7fbcc2cc4700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.152:0/14639 pipe(0x1a7c9000 sd=64 :6800 s=2 pgs=2 cs=1 l=0 >> c=0x71db580).fault with nothing to send, going to standby >> 2014-04-16 04:31:38.984613 7fbcc37cf700 0 -- 10.0.1.108:6800/22033 >> >> 10.0.1.66:0/114165773 pipe(0x16624000 sd=48 :6800 s=2 pgs=4 cs=1 l=0 >> c=0xf674c60).fault with nothing to send, going to standby >> 2014-04-16 04:32:41.689287 7fbcca70d700 1 mds.-1.-1 handle_mds_map i >> (10.0.1.108:6800/22033) dne in the mdsmap, respawning myself >> 2014-04-16 04:32:41.689294 7fbcca70d700 1 mds.-1.-1 respawn >> 2014-04-16 04:32:41.689301 7fbcca70d700 1 mds.-1.-1 e: >> '/usr/bin/ceph-mds' >> 2014-04-16 04:32:41.689304 7fbcca70d700 1 mds.-1.-1 0: >> '/usr/bin/ceph-mds' >> 2014-04-16 04:32:41.689306 7fbcca70d700 1 mds.-1.-1 1: '-i' >> 2014-04-16 04:32:41.689308 7fbcca70d700 1 mds.-1.-1 2: 'ceph-m-03' >> 2014-04-16 04:32:41.689309 7fbcca70d700 1 mds.-1.-1 3: '--pid-file' >> 2014-04-16 04:32:41.689310 7fbcca70d700 1 mds.-1.-1 4: >> '/var/run/ceph/mds.ceph-m-03.pid' >> 2014-04-16 04:32:41.689312 7fbcca70d700 1 mds.-1.-1 5: '-c' >> 2014-04-16 04:32:41.689313 7fbcca70d700 1 mds.-1.-1 6: >> '/etc/ceph/ceph.conf' >> 2014-04-16 04:32:41.742899 7fbcca70d700 1 mds.-1.-1 cwd / >> 2014-04-16 02:32:43.116287 7ff7375ad780 0 ceph version 0.72.2 >> (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 22035 >> 2014-04-16 02:32:43.226015 7ff7375ad780 -1 mds.-1.0 ERROR: failed to >> authenticate: (1) Operation not permitted >> 2014-04-16 02:32:43.226040 7ff7375ad780 1 mds.-1.0 suicide. wanted >> down:dne, now up:boot >> >> Nothing worked - mounting not possible. Restarting of the process not >> possible. >> Logs filled with >> >> 2014-04-16 12:29:46.417441 7f1de27f5700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.108:6789/0 pipe(0xabc80000 sd=250 :44413 s=4 pgs=0 cs=0 l=1 >> c=0x178870580).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:47.065881 7f1de22f0700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x25b25280 sd=277 :39688 s=4 pgs=0 cs=0 l=1 >> c=0x159b4c840).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:47.231550 7f1de05d3700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x1049a5280 sd=252 :39919 s=4 pgs=0 cs=0 l=1 >> c=0x4e53d840).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:47.359261 7f1de1ae8700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x103d3b500 sd=244 :40070 s=4 pgs=0 cs=0 l=1 >> c=0x1130f92c0).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:47.486590 7f1de30fe700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x47889780 sd=245 :40175 s=4 pgs=0 cs=0 l=1 >> c=0xc6c09420).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:47.904427 7f1de3603700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x88069000 sd=262 :40554 s=4 pgs=0 cs=0 l=1 >> c=0x116652580).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:48.272279 7f1de12e0700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0xed8f9c80 sd=257 :40929 s=4 pgs=0 cs=0 l=1 >> c=0xf2c2f160).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:49.822148 7f1de18e6700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x110926780 sd=264 :42222 s=4 pgs=0 cs=0 l=1 >> c=0x10508e9a0).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:50.769693 7f1ddf5c3700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x6dbc6280 sd=243 :43169 s=4 pgs=0 cs=0 l=1 >> c=0x1358ed580).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:51.621605 7f1de05d3700 0 -- 10.0.1.107:6803/21953 >> >> 10.0.1.107:6789/0 pipe(0x21154780 sd=243 :43929 s=4 pgs=0 cs=0 l=1 >> c=0xecb1a580).connect got RESETSESSION but no longer connecting >> 2014-04-16 12:29:51.886405 7f1dea867700 -1 mds.0.13 *** got signal >> Terminated *** >> 2014-04-16 12:29:51.886894 7f1dea867700 1 mds.0.13 suicide. wanted >> down:dne, now up:rejoin >> >> >> Also see lots of these: >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.900657 7f937a5e1700 0 >> log [ERR] : dir 10000c4639d.10000c4639d object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.913245 7f937a5e1700 0 >> log [ERR] : dir 10000c4617e.10000c4617e object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.925811 7f937a5e1700 0 >> log [ERR] : dir 10000c45d08.10000c45d08 object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.941476 7f937a5e1700 0 >> log [ERR] : dir 10000c45d9e.10000c45d9e object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.956158 7f937a5e1700 0 >> log [ERR] : dir 10000c461e5.10000c461e5 object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.968524 7f937a5e1700 0 >> log [ERR] : dir 10000c46608.10000c46608 object missing on disk; some >> files may be lost >> ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.979229 7f937a5e1700 0 >> log [ERR] : dir 10000c468b6.10000c468b6 object missing on disk; some >> files may be lost >> >> At the moment, I've only one mds running - but clients (mainly using >> fuse) can't connect. >> >>
Looks like you enabled directory fragments, which is buggy in ceph version 0.72. Regards Yan, Zheng >> >> Regards, >> Georg >> >> >> >> On 16.04.2014 16:27, Gregory Farnum wrote: >>> >>> What's the backtrace from the MDS crash? >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Wed, Apr 16, 2014 at 7:11 AM, Georg Höllrigl >>> <[email protected]> wrote: >>>> >>>> Hello, >>>> >>>> Using Ceph MDS with one active and one standby server - a day ago one >>>> of the >>>> mds crashed and I restarted it. >>>> Tonight it crashed again, a few hours later, also the second mds >>>> crashed. >>>> >>>> #ceph -v >>>> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) >>>> >>>> At the moment cephfs is dead - with following health status: >>>> >>>> #ceph -s >>>> cluster b04fc583-9e71-48b7-a741-92f4dff4cfef >>>> health HEALTH_WARN mds cluster is degraded; mds c is laggy >>>> monmap e3: 3 mons at >>>> >>>> {ceph-m-01=10.0.0.176:6789/0,ceph-m-02=10.0.1.107:6789/0,ceph-m-03=10.0.1.108:6789/0}, >>>> >>>> election epoch 6274, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03 >>>> mdsmap e2055: 1/1/1 up {0=ceph-m-03=up:rejoin(laggy or crashed)} >>>> osdmap e3752: 39 osds: 39 up, 39 in >>>> pgmap v3277576: 8328 pgs, 17 pools, 6461 GB data, 17066 kobjects >>>> 13066 GB used, 78176 GB / 91243 GB avail >>>> 8328 active+clean >>>> client io 1193 B/s rd, 0 op/s >>>> >>>> I couldn't really find any useful infos in the logfiles nor searching in >>>> documentations. Any ideas how to get cephfs up and running? >>>> >>>> Here is part of mds log: >>>> 2014-04-16 14:07:05.603501 7ff184c64700 1 mds.0.server reconnect >>>> gave up on >>>> client.7846580 10.0.1.152:0/14639 >>>> 2014-04-16 14:07:05.603525 7ff184c64700 1 mds.0.46 reconnect_done >>>> 2014-04-16 14:07:05.674990 7ff186d69700 1 mds.0.46 handle_mds_map i >>>> am now >>>> mds.0.46 >>>> 2014-04-16 14:07:05.674996 7ff186d69700 1 mds.0.46 handle_mds_map state >>>> change up:reconnect --> up:rejoin >>>> 2014-04-16 14:07:05.674998 7ff186d69700 1 mds.0.46 rejoin_start >>>> 2014-04-16 14:07:22.347521 7ff17f825700 0 -- 10.0.1.107:6815/17325 >> >>>> 10.0.1.68:0/4128280551 pipe(0x5e2ac80 sd=930 :6815 s=2 pgs=153 cs=1 l=0 >>>> c=0x5e2e160).fault with nothing to send, going to standby >>>> >>>> Any ideas, how to solve "laggy or crashed" ? >>>> >>>> >>>> Georg >>>> _______________________________________________ >>>> ceph-users mailing list >>>> [email protected] >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
