On Thu, Apr 17, 2014 at 12:45 AM, Georg Höllrigl <[email protected]> wrote: > Hello Greg, > > I've searched - but don't see any backtraces... I've tried to get some more > info out of the logs. I really hope, there is something interesting in it: > > It all started two days ago with an authentication error: > > 2014-04-14 21:08:55.929396 7fd93d53f700 1 mds.0.0 standby_replay_restart > (as standby) > 2014-04-14 21:09:07.989547 7fd93b62e700 1 mds.0.0 replay_done (as standby) > 2014-04-14 21:09:08.989647 7fd93d53f700 1 mds.0.0 standby_replay_restart > (as standby) > 2014-04-14 21:09:10.633786 7fd93b62e700 1 mds.0.0 replay_done (as standby) > 2014-04-14 21:09:11.633886 7fd93d53f700 1 mds.0.0 standby_replay_restart > (as standby) > 2014-04-14 21:09:17.995105 7fd93f644700 0 mds.0.0 handle_mds_beacon no > longer laggy > 2014-04-14 21:09:39.798798 7fd93f644700 0 monclient: hunting for new mon > 2014-04-14 21:09:39.955078 7fd93f644700 1 mds.-1.-1 handle_mds_map i > (10.0.1.107:6800/16503) dne in the mdsmap, respawning myself > 2014-04-14 21:09:39.955094 7fd93f644700 1 mds.-1.-1 respawn > 2014-04-14 21:09:39.955106 7fd93f644700 1 mds.-1.-1 e: '/usr/bin/ceph-mds' > 2014-04-14 21:09:39.955109 7fd93f644700 1 mds.-1.-1 0: '/usr/bin/ceph-mds' > 2014-04-14 21:09:39.955110 7fd93f644700 1 mds.-1.-1 1: '-i' > 2014-04-14 21:09:39.955112 7fd93f644700 1 mds.-1.-1 2: 'ceph-m-02' > 2014-04-14 21:09:39.955113 7fd93f644700 1 mds.-1.-1 3: '--pid-file' > 2014-04-14 21:09:39.955114 7fd93f644700 1 mds.-1.-1 4: > '/var/run/ceph/mds.ceph-m-02.pid' > 2014-04-14 21:09:39.955116 7fd93f644700 1 mds.-1.-1 5: '-c' > 2014-04-14 21:09:39.955117 7fd93f644700 1 mds.-1.-1 6: > '/etc/ceph/ceph.conf' > 2014-04-14 21:09:39.979138 7fd93f644700 1 mds.-1.-1 cwd / > 2014-04-14 19:09:40.922683 7f8ba9973780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 16505 > 2014-04-14 19:09:40.975024 7f8ba9973780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-14 19:09:40.975070 7f8ba9973780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot
Well, as it says, this is a failure to authenticate with the cluster. It could conceivably be a result of a huge clock skew, or maybe the keyring wasn't accessible for some reason... > > That was fixed with restarting mds (+ the whole server). > > 2014-04-15 07:07:15.948650 7f9fdec0d780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 506 > 2014-04-15 07:07:15.954386 7f9fdec0d780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:07:15.954422 7f9fdec0d780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:15:49.177861 7fe8a1d60780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 26401 > 2014-04-15 07:15:49.184027 7fe8a1d60780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:15:49.184046 7fe8a1d60780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:17:32.598031 7fab123e6780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 30531 > 2014-04-15 07:17:32.604560 7fab123e6780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:17:32.604592 7fab123e6780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:21:56.099203 7fd37b951780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11335 > 2014-04-15 07:21:56.105229 7fd37b951780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:21:56.105254 7fd37b951780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:22:09.345800 7f23392ef780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 11461 > 2014-04-15 07:22:09.390001 7f23392ef780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:22:09.391087 7f23392ef780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:28:01.762191 7fab6d14b780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 28263 > 2014-04-15 07:28:01.779485 7fab6d14b780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-15 07:28:01.779507 7fab6d14b780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot > 2014-04-15 07:35:49.065110 7fe4f6b0d780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 1233 > 2014-04-15 07:35:52.191856 7fe4f6b05700 0 -- 10.0.1.107:6800/1233 >> > 10.0.1.108:6789/0 pipe(0x2f9f500 sd=8 :0 s=1 pgs=0 cs=0 l=1 > c=0x2f81580).fault > 2014-04-15 07:35:56.352553 7fe4f1b96700 1 mds.-1.0 handle_mds_map standby > 2014-04-15 07:35:56.419499 7fe4f1b96700 1 mds.0.0 handle_mds_map i am now > mds.7905854.0replaying mds.0.0 > 2014-04-15 07:35:56.419507 7fe4f1b96700 1 mds.0.0 handle_mds_map state > change up:standby --> up:standby-replay > 2014-04-15 07:35:56.419512 7fe4f1b96700 1 mds.0.0 replay_start > 2014-04-15 07:35:56.425229 7fe4f1b96700 1 mds.0.0 recovery set is > 2014-04-15 07:35:56.425241 7fe4f1b96700 1 mds.0.0 need osdmap epoch 1391, > have 3696 > 2014-04-15 07:35:56.497573 7fe4f1b96700 0 mds.0.cache creating system inode > with ino:100 > 2014-04-15 07:35:56.501393 7fe4f1b96700 0 mds.0.cache creating system inode > with ino:1 > 2014-04-15 07:35:58.294776 7fe4ed67b700 1 mds.0.0 replay_done (as standby) > 2014-04-15 07:35:59.294874 7fe4ef990700 1 mds.0.0 standby_replay_restart > (as standby) > > Here everything runs normal... And that's the standby doing what it's supposed to. > > > 2014-04-15 20:26:19.724564 7fbcca70d700 0 mds.0.3 handle_mds_beacon no > longer laggy > 2014-04-16 04:31:13.485103 7fbcc55ed700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.81:0/61463 pipe(0xf930780 sd=55 :6800 s=2 pgs=2 cs=1 l=0 > c=0x71db6e0).fault with nothing to send, going to standby > 2014-04-16 04:31:13.485222 7fbcc46de700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.68:0/4128280551 pipe(0xf6ac780 sd=28 :6800 s=2 pgs=4 cs=1 l=0 > c=0xe31d6e0).fault with nothing to send, going to standby > 2014-04-16 04:31:14.506269 7fbcc7505700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.42:0/2817517925 pipe(0x1a7c9780 sd=36 :6800 s=2 pgs=2 cs=1 l=0 > c=0xf6749a0).fault with nothing to send, going to standby > 2014-04-16 04:31:19.852504 7fbcc12aa700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.156:0/4413 pipe(0x128165c80 sd=66 :6800 s=2 pgs=2 cs=1 l=0 > c=0x12b5b9dc0).fault with nothing to send, going to standby > 2014-04-16 04:31:38.971665 7fbcc30c8700 0 -- 10.0.1.108:6800/22033 >> > 10.0.0.25:0/31945 pipe(0xf930500 sd=56 :6800 s=2 pgs=2 cs=1 l=0 > c=0x71db420).fault with nothing to send, going to standby > 2014-04-16 04:31:38.971680 7fbcc1bb3700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.39:0/13485 pipe(0xf6aca00 sd=29 :6800 s=2 pgs=2 cs=1 l=0 > c=0xe31d2c0).fault with nothing to send, going to standby > 2014-04-16 04:31:38.971714 7fbcc63f4700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.40:0/12395 pipe(0x31fe000 sd=46 :6800 s=2 pgs=2 cs=1 l=0 > c=0x726b160).fault with nothing to send, going to standby > 2014-04-16 04:31:38.979886 7fbcc2cc4700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.152:0/14639 pipe(0x1a7c9000 sd=64 :6800 s=2 pgs=2 cs=1 l=0 > c=0x71db580).fault with nothing to send, going to standby > 2014-04-16 04:31:38.984613 7fbcc37cf700 0 -- 10.0.1.108:6800/22033 >> > 10.0.1.66:0/114165773 pipe(0x16624000 sd=48 :6800 s=2 pgs=4 cs=1 l=0 > c=0xf674c60).fault with nothing to send, going to standby > 2014-04-16 04:32:41.689287 7fbcca70d700 1 mds.-1.-1 handle_mds_map i > (10.0.1.108:6800/22033) dne in the mdsmap, respawning myself > 2014-04-16 04:32:41.689294 7fbcca70d700 1 mds.-1.-1 respawn > 2014-04-16 04:32:41.689301 7fbcca70d700 1 mds.-1.-1 e: '/usr/bin/ceph-mds' > 2014-04-16 04:32:41.689304 7fbcca70d700 1 mds.-1.-1 0: '/usr/bin/ceph-mds' > 2014-04-16 04:32:41.689306 7fbcca70d700 1 mds.-1.-1 1: '-i' > 2014-04-16 04:32:41.689308 7fbcca70d700 1 mds.-1.-1 2: 'ceph-m-03' > 2014-04-16 04:32:41.689309 7fbcca70d700 1 mds.-1.-1 3: '--pid-file' > 2014-04-16 04:32:41.689310 7fbcca70d700 1 mds.-1.-1 4: > '/var/run/ceph/mds.ceph-m-03.pid' > 2014-04-16 04:32:41.689312 7fbcca70d700 1 mds.-1.-1 5: '-c' > 2014-04-16 04:32:41.689313 7fbcca70d700 1 mds.-1.-1 6: > '/etc/ceph/ceph.conf' > 2014-04-16 04:32:41.742899 7fbcca70d700 1 mds.-1.-1 cwd / > 2014-04-16 02:32:43.116287 7ff7375ad780 0 ceph version 0.72.2 > (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 22035 > 2014-04-16 02:32:43.226015 7ff7375ad780 -1 mds.-1.0 ERROR: failed to > authenticate: (1) Operation not permitted > 2014-04-16 02:32:43.226040 7ff7375ad780 1 mds.-1.0 suicide. wanted > down:dne, now up:boot Another auth failure > > Nothing worked - mounting not possible. Restarting of the process not > possible. > Logs filled with > > 2014-04-16 12:29:46.417441 7f1de27f5700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.108:6789/0 pipe(0xabc80000 sd=250 :44413 s=4 pgs=0 cs=0 l=1 > c=0x178870580).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:47.065881 7f1de22f0700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x25b25280 sd=277 :39688 s=4 pgs=0 cs=0 l=1 > c=0x159b4c840).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:47.231550 7f1de05d3700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x1049a5280 sd=252 :39919 s=4 pgs=0 cs=0 l=1 > c=0x4e53d840).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:47.359261 7f1de1ae8700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x103d3b500 sd=244 :40070 s=4 pgs=0 cs=0 l=1 > c=0x1130f92c0).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:47.486590 7f1de30fe700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x47889780 sd=245 :40175 s=4 pgs=0 cs=0 l=1 > c=0xc6c09420).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:47.904427 7f1de3603700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x88069000 sd=262 :40554 s=4 pgs=0 cs=0 l=1 > c=0x116652580).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:48.272279 7f1de12e0700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0xed8f9c80 sd=257 :40929 s=4 pgs=0 cs=0 l=1 > c=0xf2c2f160).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:49.822148 7f1de18e6700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x110926780 sd=264 :42222 s=4 pgs=0 cs=0 l=1 > c=0x10508e9a0).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:50.769693 7f1ddf5c3700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x6dbc6280 sd=243 :43169 s=4 pgs=0 cs=0 l=1 > c=0x1358ed580).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:51.621605 7f1de05d3700 0 -- 10.0.1.107:6803/21953 >> > 10.0.1.107:6789/0 pipe(0x21154780 sd=243 :43929 s=4 pgs=0 cs=0 l=1 > c=0xecb1a580).connect got RESETSESSION but no longer connecting > 2014-04-16 12:29:51.886405 7f1dea867700 -1 mds.0.13 *** got signal > Terminated *** > 2014-04-16 12:29:51.886894 7f1dea867700 1 mds.0.13 suicide. wanted > down:dne, now up:rejoin And something in the external environment killed it for some reason. The RESETSESSION also looks a little weird but I can't see any more of the state. > > > Also see lots of these: > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.900657 7f937a5e1700 0 log > [ERR] : dir 10000c4639d.10000c4639d object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.913245 7f937a5e1700 0 log > [ERR] : dir 10000c4617e.10000c4617e object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.925811 7f937a5e1700 0 log > [ERR] : dir 10000c45d08.10000c45d08 object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.941476 7f937a5e1700 0 log > [ERR] : dir 10000c45d9e.10000c45d9e object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.956158 7f937a5e1700 0 log > [ERR] : dir 10000c461e5.10000c461e5 object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.968524 7f937a5e1700 0 log > [ERR] : dir 10000c46608.10000c46608 object missing on disk; some files may > be lost > ceph-mds.ceph-m-02.log.1.gz:2014-04-16 13:22:44.979229 7f937a5e1700 0 log > [ERR] : dir 10000c468b6.10000c468b6 object missing on disk; some files may > be lost And that's exactly what it sounds like — the MDS isn't finding objects that are supposed to be in the RADOS cluster. Anyway, glad it fixed itself, but it sounds like you've got some infrastructure issues or something you need to sort out first. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
