[ceph-users] Second Ceph Berlin MeetUp
Hi, the second meetup takes place at March 24. For more details please have a look at http://www.meetup.com/Ceph-Berlin/events/163029162/ Regards -- Robert Sander Heinlein Support GmbH Schwedter Str. 8/9b, 10119 Berlin http://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec
Hi All, I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel 3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers). Here's an example of starting all the OSD's after a reboot. top - 09:10:51 up 2 min, 1 user, load average: 332.93, 112.28, 39.96 Tasks: 310 total, 1 running, 309 sleeping, 0 stopped, 0 zombie Cpu(s): 50.3%us, 32.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 17.2%hi, 0.0%si, 0.0%st Mem: 32917276k total, 6331224k used, 26586052k free, 1332k buffers Swap: 33496060k total,0k used, 33496060k free, 1474084k cached PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 15875 root 20 0 910m 381m 50m S 60 1.2 0:50.57 ceph-osd 2996 root 20 0 867m 330m 44m S 59 1.0 0:58.32 ceph-osd 4502 root 20 0 907m 372m 47m S 58 1.2 0:55.14 ceph-osd 12465 root 20 0 949m 418m 55m S 58 1.3 0:51.79 ceph-osd 4171 root 20 0 886m 348m 45m S 57 1.1 0:56.17 ceph-osd 3707 root 20 0 941m 405m 50m S 57 1.3 0:59.68 ceph-osd 3560 root 20 0 924m 394m 51m S 56 1.2 0:59.37 ceph-osd 4318 root 20 0 965m 435m 55m S 56 1.4 0:54.80 ceph-osd 3337 root 20 0 935m 407m 51m S 56 1.3 1:01.96 ceph-osd 3854 root 20 0 897m 366m 48m S 55 1.1 1:00.55 ceph-osd 3143 root 20 0 1364m 424m 24m S 16 1.3 1:08.72 ceph-osd 2509 root 20 0 652m 261m 62m S2 0.8 0:26.42 ceph-mon 4 root 20 0 000 S0 0.0 0:00.08 kworker/0:0 Regards, Quenten Grasso From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso Sent: Tuesday, 18 March 2014 10:19 PM To: 'ceph-users@lists.ceph.com' Subject: [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec Hi All, I'm trying to troubleshoot a strange issue with my Ceph cluster. We're Running Ceph Version 0.72.2 All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS Drives and 2 x 100GB Intel DC S3700 SSD's for Journals. All Pools have a replica of 2 or better. I.e. metadata replica of 3. I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a single node (any node) the load average of that node shoots up to 230+ and the whole cluster starts blocking IO requests until it settles down and its fine again. Any ideas on why the load average goes so crazy starts to block IO? snips from my ceph.conf [osd] osd data = /var/ceph/osd.$id osd journal size = 15000 osd mkfs type = xfs osd mkfs options xfs = -i size=2048 -f osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k osd max backfills = 5 osd recovery max active = 3 [osd.0] host = pbnerbd01 public addr = 10.100.96.10 cluster addr = 10.100.128.10 osd journal = /dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1 devs = /dev/sda4 /end Thanks, Quenten ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD clone for OpenStack Nova ephemeral volumes
On 03/20/2014 02:07 PM, Dmitry Borodaenko wrote: The patch series that implemented clone operation for RBD backed ephemeral volumes in Nova did not make it into Icehouse. We have tried our best to help it land, but it was ultimately rejected. Furthermore, an additional requirement was imposed to make this patch series dependent on full support of Glance API v2 across Nova (due to its dependency on direct_url that was introduced in v2). You can find the most recent discussion of this patch series in the FFE (feature freeze exception) thread on openstack-dev ML: http://lists.openstack.org/pipermail/openstack-dev/2014-March/029127.html As I explained in that thread, I believe this feature is essential for using Ceph as a storage backend for Nova, so I'm going to try and keep it alive outside of OpenStack mainline until it is allowed to land. I have created rbd-ephemeral-clone branch in my nova repo fork on GitHub: https://github.com/angdraug/nova/tree/rbd-ephemeral-clone I will keep it rebased over nova master, and will create an rbd-ephemeral-clone-stable-icehouse to track the same patch series over nova stable/icehouse once it's branched. I also plan to make sure that this patch series is included in Mirantis OpenStack 5.0 which will be based on Icehouse. If you're interested in this feature, please review and test. Bug reports and patches are welcome, as long as their scope is limited to this patch series and is not applicable for mainline OpenStack. Thanks for taking this on Dmitry! Having rebased those patches many times during icehouse, I can tell you it's often not trivial. Do you think the imagehandler-based approach is best for Juno? I'm leaning towards the older way [1] for simplicity of review, and to avoid using glance's v2 api by default. I doubt that full support for v2 will land very fast in nova, although I'd be happy to be proven wrong. Josh [1] https://review.openstack.org/#/c/46879/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS crash when client goes to sleep
When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults. Has anyone seen this? Below is a part of MDS log. This happened in emperor and recent 0.77 release. I am running Debian Wheezy with testing kernels 3.13. What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info. Regards, Hong -43 2014-03-20 20:08:42.463357 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580 -42 2014-03-20 20:08:42.463787 7fee411d4700 1 -- 192.168.1.20:6801/17079 == mon.0 192.168.1.20:6789/0 21764 mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580 -41 2014-03-20 20:08:43.373099 7fee3f0cf700 2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode -40 2014-03-20 20:08:44.494963 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0 -39 2014-03-20 20:08:44.495033 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0) -38 2014-03-20 20:08:44.495565 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION -37 2014-03-20 20:08:44.496015 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault 0: Success -36 2014-03-20 20:08:44.496099 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127 -35 2014-03-20 20:08:44.496120 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127 -34 2014-03-20 20:08:44.496207 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne -33 2014-03-20 20:08:44.653628 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0 -32 2014-03-20 20:08:44.653677 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52027/0) -31 2014-03-20 20:08:44.925618 7fee411d4700 1 -- 192.168.1.20:6801/17079 == client.6019 192.168.1.101:0/2113152127 1 client_reconnect(77349 caps) v2 0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0 -30 2014-03-20 20:08:44.925682 7fee411d4700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close -29 2014-03-20 20:08:44.925735 7fee411d4700 0 log [INF] : denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45) -28 2014-03-20 20:08:44.925748 7fee411d4700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 0x1f0e22c0 -27 2014-03-20 20:08:44.927727 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).reader couldn't read tag, Success -26 2014-03-20 20:08:44.927797 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault 0: Success -25 2014-03-20 20:08:44.927849 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault, server, going to standby -24 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick -23 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-20 20:08:16.372333) -22 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no -21 2014-03-20 20:08:46.372403 7fee401d2700 10 log_queue is 1 last_log 2 sent 1 num 1 unsent 1 sending 1 -20 2014-03-20 20:08:46.372421 7fee401d2700 10 will send 2014-03-20 20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45) -19 2014-03-20 20:08:46.372466 7fee401d2700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0 -18 2014-03-20 20:08:46.372483 7fee401d2700 1 -- 192.168.1.20:6801/17079 --
Re: [ceph-users] MDS crash when client goes to sleep
Hi Hong, May I know what has happened to your MDS once it crashed? Was it able to recover from replay? We also facing this issue and I am interested to know on how to reproduce it. Thanks. Bazli From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616 Sent: Friday, March 21, 2014 10:29 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] MDS crash when client goes to sleep When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults. Has anyone seen this? Below is a part of MDS log. This happened in emperor and recent 0.77 release. I am running Debian Wheezy with testing kernels 3.13. What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info. Regards, Hong -43 2014-03-20 20:08:42.463357 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580 -42 2014-03-20 20:08:42.463787 7fee411d4700 1 -- 192.168.1.20:6801/17079 == mon.0 192.168.1.20:6789/0 21764 mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580 -41 2014-03-20 20:08:43.373099 7fee3f0cf700 2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode -40 2014-03-20 20:08:44.494963 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0 -39 2014-03-20 20:08:44.495033 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0) -38 2014-03-20 20:08:44.495565 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION -37 2014-03-20 20:08:44.496015 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault 0: Success -36 2014-03-20 20:08:44.496099 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127 -35 2014-03-20 20:08:44.496120 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127 -34 2014-03-20 20:08:44.496207 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne -33 2014-03-20 20:08:44.653628 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0 -32 2014-03-20 20:08:44.653677 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52027/0) -31 2014-03-20 20:08:44.925618 7fee411d4700 1 -- 192.168.1.20:6801/17079 == client.6019 192.168.1.101:0/2113152127 1 client_reconnect(77349 caps) v2 0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0 -30 2014-03-20 20:08:44.925682 7fee411d4700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close -29 2014-03-20 20:08:44.925735 7fee411d4700 0 log [INF] : denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45) -28 2014-03-20 20:08:44.925748 7fee411d4700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 0x1f0e22c0 -27 2014-03-20 20:08:44.927727 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).reader couldn't read tag, Success -26 2014-03-20 20:08:44.927797 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault 0: Success -25 2014-03-20 20:08:44.927849 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault, server, going to standby -24 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick -23 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-20 20:08:16.372333) -22 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no -21 2014-03-20 20:08:46.372403 7fee401d2700 10 log_queue is 1 last_log 2 sent 1 num 1 unsent 1 sending 1 -20 2014-03-20 20:08:46.372421 7fee401d2700 10
Re: [ceph-users] MDS crash when client goes to sleep
Did you see any messages in dmesg saying ceph-mds respawnning or stuffs like that? Regards, Luke On Mar 21, 2014, at 11:09 AM, hjcho616 hjcho...@yahoo.commailto:hjcho...@yahoo.com wrote: On client, I was no longer able to access the filesystem. It would hang. Makes sense since MDS has crashed. I tried running 3 MDS demon on the same machine. Two crashes and one appears to be hung up(?). ceph health says MDS is in degraded state when that happened. I was able to recover by restarting every node. I currently have three machine, one with MDS and MON, and two with OSDs. It is failing everytime my client machine goes to sleep. If you need me to run something let me know what and how. Regards, Hong From: Mohd Bazli Ab Karim bazli.abka...@mimos.mymailto:bazli.abka...@mimos.my To: hjcho616 hjcho...@yahoo.commailto:hjcho...@yahoo.com; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Sent: Thursday, March 20, 2014 9:40 PM Subject: RE: [ceph-users] MDS crash when client goes to sleep Hi Hong, May I know what has happened to your MDS once it crashed? Was it able to recover from replay? We also facing this issue and I am interested to know on how to reproduce it. Thanks. Bazli From: ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616 Sent: Friday, March 21, 2014 10:29 AM To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] MDS crash when client goes to sleep When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults. Has anyone seen this? Below is a part of MDS log. This happened in emperor and recent 0.77 release. I am running Debian Wheezy with testing kernels 3.13. What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info. Regards, Hong -43 2014-03-20 20:08:42.463357 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580 -42 2014-03-20 20:08:42.463787 7fee411d4700 1 -- 192.168.1.20:6801/17079 == mon.0 192.168.1.20:6789/0 21764 mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580 -41 2014-03-20 20:08:43.373099 7fee3f0cf700 2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode -40 2014-03-20 20:08:44.494963 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0 -39 2014-03-20 20:08:44.495033 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0) -38 2014-03-20 20:08:44.495565 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION -37 2014-03-20 20:08:44.496015 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault 0: Success -36 2014-03-20 20:08:44.496099 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127 -35 2014-03-20 20:08:44.496120 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127 -34 2014-03-20 20:08:44.496207 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne -33 2014-03-20 20:08:44.653628 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0 -32 2014-03-20 20:08:44.653677 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52027/0) -31 2014-03-20 20:08:44.925618 7fee411d4700 1 -- 192.168.1.20:6801/17079 == client.6019 192.168.1.101:0/2113152127 1 client_reconnect(77349 caps) v2 0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0 -30 2014-03-20 20:08:44.925682 7fee411d4700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close -29 2014-03-20 20:08:44.925735 7fee411d4700 0 log [INF] : denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45) -28 2014-03-20 20:08:44.925748 7fee411d4700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.101:0/2113152127 --
Re: [ceph-users] MDS crash when client goes to sleep
Nope just these segfaults. [149884.709608] ceph-mds[17366]: segfault at 200 ip 7f09de9d60b8 sp 7f09db461520 error 4 in libgcc_s.so.1[7f09de9c7000+15000] [211263.265402] ceph-mds[17135]: segfault at 200 ip 7f59eec280b8 sp 7f59eb6b3520 error 4 in libgcc_s.so.1[7f59eec19000+15000] [214638.927759] ceph-mds[16896]: segfault at 200 ip 7fcb2c89e0b8 sp 7fcb29329520 error 4 in libgcc_s.so.1[7fcb2c88f000+15000] [289338.461271] ceph-mds[20878]: segfault at 200 ip 7f4b7211c0b8 sp 7f4b6eba7520 error 4 in libgcc_s.so.1[7f4b7210d000+15000] [373738.961475] ceph-mds[21341]: segfault at 200 ip 7f36c3d480b8 sp 7f36c07d3520 error 4 in libgcc_s.so.1[7f36c3d39000+15000] Regards, Hong From: Luke Jing Yuan jyl...@mimos.my To: hjcho616 hjcho...@yahoo.com Cc: Mohd Bazli Ab Karim bazli.abka...@mimos.my; ceph-users@lists.ceph.com ceph-users@lists.ceph.com Sent: Thursday, March 20, 2014 10:53 PM Subject: Re: [ceph-users] MDS crash when client goes to sleep Did you see any messages in dmesg saying ceph-mds respawnning or stuffs like that? Regards, Luke On Mar 21, 2014, at 11:09 AM, hjcho616 hjcho...@yahoo.com wrote: On client, I was no longer able to access the filesystem. It would hang. Makes sense since MDS has crashed. I tried running 3 MDS demon on the same machine. Two crashes and one appears to be hung up(?). ceph health says MDS is in degraded state when that happened. I was able to recover by restarting every node. I currently have three machine, one with MDS and MON, and two with OSDs. It is failing everytime my client machine goes to sleep. If you need me to run something let me know what and how. Regards, Hong From: Mohd Bazli Ab Karim bazli.abka...@mimos.my To: hjcho616 hjcho...@yahoo.com; ceph-users@lists.ceph.com ceph-users@lists.ceph.com Sent: Thursday, March 20, 2014 9:40 PM Subject: RE: [ceph-users] MDS crash when client goes to sleep Hi Hong, May I know what has happened to your MDS once it crashed? Was it able to recover from replay? We also facing this issue and I am interested to know on how to reproduce it. Thanks. Bazli From:ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616 Sent: Friday, March 21, 2014 10:29 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] MDS crash when client goes to sleep When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults. Has anyone seen this? Below is a part of MDS log. This happened in emperor and recent 0.77 release. I am running Debian Wheezy with testing kernels 3.13. What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info. Regards, Hong -43 2014-03-20 20:08:42.463357 7fee3f0cf700 1 -- 192.168.1.20:6801/17079 -- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580 -42 2014-03-20 20:08:42.463787 7fee411d4700 1 -- 192.168.1.20:6801/17079 == mon.0 192.168.1.20:6789/0 21764 mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580 -41 2014-03-20 20:08:43.373099 7fee3f0cf700 2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode -40 2014-03-20 20:08:44.494963 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0 -39 2014-03-20 20:08:44.495033 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0) -38 2014-03-20 20:08:44.495565 7fee3d7c4700 0 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION -37 2014-03-20 20:08:44.496015 7fee3d7c4700 2 -- 192.168.1.20:6801/17079 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault 0: Success -36 2014-03-20 20:08:44.496099 7fee411d4700 5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127 -35 2014-03-20 20:08:44.496120 7fee411d4700 3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127 -34 2014-03-20 20:08:44.496207 7fee411d4700 1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne -33 2014-03-20 20:08:44.653628 7fee3d7c4700 1 -- 192.168.1.20:6801/17079 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0 -32 2014-03-20 20:08:44.653677 7fee3d7c4700 0 -- 192.168.1.20:6801/17079