[ceph-users] Second Ceph Berlin MeetUp

2014-03-20 Thread Robert Sander
Hi,

the second meetup takes place at March 24.

For more details please have a look at
http://www.meetup.com/Ceph-Berlin/events/163029162/

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Restarts cause excessively high load average and requests are blocked 32 sec

2014-03-20 Thread Quenten Grasso
Hi All,

I left out my OS/kernel version, Ubuntu 12.04.4 LTS w/ Kernel 
3.10.33-031033-generic (We upgrade our kernels to 3.10 due to Dell Drivers).

Here's an example of starting all the OSD's after a reboot.

top - 09:10:51 up 2 min,  1 user,  load average: 332.93, 112.28, 39.96
Tasks: 310 total,   1 running, 309 sleeping,   0 stopped,   0 zombie
Cpu(s): 50.3%us, 32.5%sy,  0.0%ni,  0.0%id,  0.0%wa, 17.2%hi,  0.0%si,  0.0%st
Mem:  32917276k total,  6331224k used, 26586052k free, 1332k buffers
Swap: 33496060k total,0k used, 33496060k free,  1474084k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
15875 root  20   0  910m 381m  50m S   60  1.2   0:50.57 ceph-osd
2996 root  20   0  867m 330m  44m S   59  1.0   0:58.32 ceph-osd
4502 root  20   0  907m 372m  47m S   58  1.2   0:55.14 ceph-osd
12465 root  20   0  949m 418m  55m S   58  1.3   0:51.79 ceph-osd
4171 root  20   0  886m 348m  45m S   57  1.1   0:56.17 ceph-osd
3707 root  20   0  941m 405m  50m S   57  1.3   0:59.68 ceph-osd
3560 root  20   0  924m 394m  51m S   56  1.2   0:59.37 ceph-osd
4318 root  20   0  965m 435m  55m S   56  1.4   0:54.80 ceph-osd
3337 root  20   0  935m 407m  51m S   56  1.3   1:01.96 ceph-osd
3854 root  20   0  897m 366m  48m S   55  1.1   1:00.55 ceph-osd
3143 root  20   0 1364m 424m  24m S   16  1.3   1:08.72 ceph-osd
2509 root  20   0  652m 261m  62m S2  0.8   0:26.42 ceph-mon
4 root  20   0 000 S0  0.0   0:00.08 kworker/0:0

Regards,
Quenten Grasso

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Quenten Grasso
Sent: Tuesday, 18 March 2014 10:19 PM
To: 'ceph-users@lists.ceph.com'
Subject: [ceph-users] OSD Restarts cause excessively high load average and 
requests are blocked  32 sec

Hi All,

I'm trying to troubleshoot a strange issue with my Ceph cluster.

We're Running Ceph Version 0.72.2
All Nodes are Dell R515's w/ 6C AMD CPU w/ 32GB Ram, 12 x 3TB NearlineSAS 
Drives and 2 x 100GB Intel DC S3700 SSD's for Journals.
All Pools have a replica of 2 or better. I.e. metadata replica of 3.

I have 55 OSD's in the cluster across 5 nodes. When I restart the OSD's on a 
single node (any node) the load average of that node shoots up to 230+ and the 
whole cluster starts blocking IO requests until it settles down and its fine 
again.

Any ideas on why the load average goes so crazy  starts to block IO?


snips from my ceph.conf
[osd]
osd data = /var/ceph/osd.$id
osd journal size = 15000
osd mkfs type = xfs
osd mkfs options xfs = -i size=2048 -f
osd mount options xfs = 
rw,noexec,nodev,noatime,nodiratime,barrier=0,inode64,logbufs=8,logbsize=256k
osd max backfills = 5
osd recovery max active = 3

[osd.0]
host = pbnerbd01
public addr = 10.100.96.10
cluster addr = 10.100.128.10
osd journal = 
/dev/disk/by-id/scsi-36b8ca3a0eaa2660019deaf8d3a40bec4-part1
devs = /dev/sda4
/end

Thanks,
Quenten

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD clone for OpenStack Nova ephemeral volumes

2014-03-20 Thread Josh Durgin

On 03/20/2014 02:07 PM, Dmitry Borodaenko wrote:

The patch series that implemented clone operation for RBD backed
ephemeral volumes in Nova did not make it into Icehouse. We have tried
our best to help it land, but it was ultimately rejected. Furthermore,
an additional requirement was imposed to make this patch series
dependent on full support of Glance API v2 across Nova (due to its
dependency on direct_url that was introduced in v2).

You can find the most recent discussion of this patch series in the
FFE (feature freeze exception) thread on openstack-dev ML:
http://lists.openstack.org/pipermail/openstack-dev/2014-March/029127.html

As I explained in that thread, I believe this feature is essential for
using Ceph as a storage backend for Nova, so I'm going to try and keep
it alive outside of OpenStack mainline until it is allowed to land.

I have created rbd-ephemeral-clone branch in my nova repo fork on GitHub:
https://github.com/angdraug/nova/tree/rbd-ephemeral-clone

I will keep it rebased over nova master, and will create an
rbd-ephemeral-clone-stable-icehouse to track the same patch series
over nova stable/icehouse once it's branched. I also plan to make sure
that this patch series is included in Mirantis OpenStack 5.0 which
will be based on Icehouse.

If you're interested in this feature, please review and test. Bug
reports and patches are welcome, as long as their scope is limited to
this patch series and is not applicable for mainline OpenStack.


Thanks for taking this on Dmitry! Having rebased those patches many
times during icehouse, I can tell you it's often not trivial.

Do you think the imagehandler-based approach is best for Juno? I'm
leaning towards the older way [1] for simplicity of review, and to
avoid using glance's v2 api by default. I doubt that full support for
v2 will land very fast in nova, although I'd be happy to be proven
wrong.

Josh

[1] https://review.openstack.org/#/c/46879/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS crash when client goes to sleep

2014-03-20 Thread hjcho616
When CephFS is mounted on a client and when client decides to go to sleep, MDS 
segfaults.  Has anyone seen this?  Below is a part of MDS log.  This happened 
in emperor and recent 0.77 release.  I am running Debian Wheezy with testing 
kernels 3.13.  What can I do to not crash the whole system if a client goes to 
sleep (and looks like disconnect may do the same)? Let me know if you need any 
more info.

Regards,
Hong

   -43 2014-03-20 20:08:42.463357 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 
-- ?+0 0x1ee9f080 con 0x2e56580
   -42 2014-03-20 20:08:42.463787 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== mon.0 192.168.1.20:6789/0 21764  mdsbeacon(6798/MDS1.2 up:active seq 
21120 v6970) v2  108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
   -41 2014-03-20 20:08:43.373099 7fee3f0cf700  2 mds.0.cache 
check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, 
baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 
caps per inode
   -40 2014-03-20 20:08:44.494963 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 
192.168.1.101:52026/0
   -39 2014-03-20 20:08:44.495033 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52026/0)
   -38 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION
   -37 2014-03-20 20:08:44.496015 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 
c=0x1f0e2160).fault 0: Success
   -36 2014-03-20 20:08:44.496099 7fee411d4700  5 mds.0.35 ms_handle_reset on 
192.168.1.101:0/2113152127
   -35 2014-03-20 20:08:44.496120 7fee411d4700  3 mds.0.35 ms_handle_reset 
closing connection for session client.6019 192.168.1.101:0/2113152127
   -34 2014-03-20 20:08:44.496207 7fee411d4700  1 -- 192.168.1.20:6801/17079 
mark_down 0x1f0e2160 -- pipe dne
   -33 2014-03-20 20:08:44.653628 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 
192.168.1.101:52027/0
   -32 2014-03-20 20:08:44.653677 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52027/0)
   -31 2014-03-20 20:08:44.925618 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== client.6019 192.168.1.101:0/2113152127 1  client_reconnect(77349 caps) 
v2  0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0
   -30 2014-03-20 20:08:44.925682 7fee411d4700  1 mds.0.server  no longer in 
reconnect state, ignoring reconnect, sending close
   -29 2014-03-20 20:08:44.925735 7fee411d4700  0 log [INF] : denied reconnect 
attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 
2014-03-20 20:08:44.925679 (allowed interval 45)
   -28 2014-03-20 20:08:44.925748 7fee411d4700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 
0x1f0e22c0
   -27 2014-03-20 20:08:44.927727 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).reader couldn't read tag, Success
   -26 2014-03-20 20:08:44.927797 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).fault 0: Success
   -25 2014-03-20 20:08:44.927849 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).fault, server, going to standby
   -24 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick
   -23 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2014-03-20 
20:08:16.372333)
   -22 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 
2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no
   -21 2014-03-20 20:08:46.372403 7fee401d2700 10  log_queue is 1 last_log 2 
sent 1 num 1 unsent 1 sending 1
   -20 2014-03-20 20:08:46.372421 7fee401d2700 10  will send 2014-03-20 
20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect 
attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 
2014-03-20 20:08:44.925679 (allowed interval 45)
   -19 2014-03-20 20:08:46.372466 7fee401d2700 10 monclient: _send_mon_message 
to mon.MDS1 at 192.168.1.20:6789/0
   -18 2014-03-20 20:08:46.372483 7fee401d2700  1 -- 192.168.1.20:6801/17079 
-- 

Re: [ceph-users] MDS crash when client goes to sleep

2014-03-20 Thread Mohd Bazli Ab Karim
Hi Hong,
May I know what has happened to your MDS once it crashed? Was it able to 
recover from replay?
We also facing this issue and I am interested to know on how to reproduce it.

Thanks.
Bazli

From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616
Sent: Friday, March 21, 2014 10:29 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] MDS crash when client goes to sleep

When CephFS is mounted on a client and when client decides to go to sleep, MDS 
segfaults.  Has anyone seen this?  Below is a part of MDS log.  This happened 
in emperor and recent 0.77 release.  I am running Debian Wheezy with testing 
kernels 3.13.  What can I do to not crash the whole system if a client goes to 
sleep (and looks like disconnect may do the same)? Let me know if you need any 
more info.

Regards,
Hong

   -43 2014-03-20 20:08:42.463357 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 
-- ?+0 0x1ee9f080 con 0x2e56580
   -42 2014-03-20 20:08:42.463787 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== mon.0 192.168.1.20:6789/0 21764  mdsbeacon(6798/MDS1.2 up:active seq 
21120 v6970) v2  108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
   -41 2014-03-20 20:08:43.373099 7fee3f0cf700  2 mds.0.cache 
check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, 
baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 
caps per inode
   -40 2014-03-20 20:08:44.494963 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 
192.168.1.101:52026/0
   -39 2014-03-20 20:08:44.495033 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52026/0)
   -38 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION
   -37 2014-03-20 20:08:44.496015 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 
c=0x1f0e2160).fault 0: Success
   -36 2014-03-20 20:08:44.496099 7fee411d4700  5 mds.0.35 ms_handle_reset on 
192.168.1.101:0/2113152127
   -35 2014-03-20 20:08:44.496120 7fee411d4700  3 mds.0.35 ms_handle_reset 
closing connection for session client.6019 192.168.1.101:0/2113152127
   -34 2014-03-20 20:08:44.496207 7fee411d4700  1 -- 192.168.1.20:6801/17079 
mark_down 0x1f0e2160 -- pipe dne
   -33 2014-03-20 20:08:44.653628 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 
192.168.1.101:52027/0
   -32 2014-03-20 20:08:44.653677 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52027/0)
   -31 2014-03-20 20:08:44.925618 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== client.6019 192.168.1.101:0/2113152127 1  client_reconnect(77349 caps) 
v2  0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0
   -30 2014-03-20 20:08:44.925682 7fee411d4700  1 mds.0.server  no longer in 
reconnect state, ignoring reconnect, sending close
   -29 2014-03-20 20:08:44.925735 7fee411d4700  0 log [INF] : denied reconnect 
attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 
2014-03-20 20:08:44.925679 (allowed interval 45)
   -28 2014-03-20 20:08:44.925748 7fee411d4700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 
0x1f0e22c0
   -27 2014-03-20 20:08:44.927727 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).reader couldn't read tag, Success
   -26 2014-03-20 20:08:44.927797 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).fault 0: Success
   -25 2014-03-20 20:08:44.927849 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 
c=0x1f0e22c0).fault, server, going to standby
   -24 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick
   -23 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: 
_check_auth_rotating have uptodate secrets (they expire after 2014-03-20 
20:08:16.372333)
   -22 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 
2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no
   -21 2014-03-20 20:08:46.372403 7fee401d2700 10  log_queue is 1 last_log 2 
sent 1 num 1 unsent 1 sending 1
   -20 2014-03-20 20:08:46.372421 7fee401d2700 10  

Re: [ceph-users] MDS crash when client goes to sleep

2014-03-20 Thread Luke Jing Yuan
Did you see any messages in dmesg saying ceph-mds respawnning or stuffs like 
that?

Regards,
Luke

On Mar 21, 2014, at 11:09 AM, hjcho616 
hjcho...@yahoo.commailto:hjcho...@yahoo.com wrote:

On client, I was no longer able to access the filesystem.  It would hang.  
Makes sense since MDS has crashed.  I tried running 3 MDS demon on the same 
machine.  Two crashes and one appears to be hung up(?). ceph health says MDS is 
in degraded state when that happened.

I was able to recover by restarting every node.  I currently have three 
machine, one with MDS and MON, and two with OSDs.

It is failing everytime my client machine goes to sleep.  If you need me to run 
something let me know what and how.

Regards,
Hong


From: Mohd Bazli Ab Karim 
bazli.abka...@mimos.mymailto:bazli.abka...@mimos.my
To: hjcho616 hjcho...@yahoo.commailto:hjcho...@yahoo.com; 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com 
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Sent: Thursday, March 20, 2014 9:40 PM
Subject: RE: [ceph-users] MDS crash when client goes to sleep

Hi Hong,
May I know what has happened to your MDS once it crashed? Was it able to 
recover from replay?
We also facing this issue and I am interested to know on how to reproduce it.

Thanks.
Bazli

From: 
ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616
Sent: Friday, March 21, 2014 10:29 AM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] MDS crash when client goes to sleep

When CephFS is mounted on a client and when client decides to go to sleep, MDS 
segfaults.  Has anyone seen this?  Below is a part of MDS log.  This happened 
in emperor and recent 0.77 release.  I am running Debian Wheezy with testing 
kernels 3.13.  What can I do to not crash the whole system if a client goes to 
sleep (and looks like disconnect may do the same)? Let me know if you need any 
more info.

Regards,
Hong

   -43 2014-03-20 20:08:42.463357 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 
-- ?+0 0x1ee9f080 con 0x2e56580
   -42 2014-03-20 20:08:42.463787 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== mon.0 192.168.1.20:6789/0 21764  mdsbeacon(6798/MDS1.2 up:active seq 
21120 v6970) v2  108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
   -41 2014-03-20 20:08:43.373099 7fee3f0cf700  2 mds.0.cache 
check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, 
baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 
caps per inode
   -40 2014-03-20 20:08:44.494963 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 
192.168.1.101:52026/0
   -39 2014-03-20 20:08:44.495033 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52026/0)
   -38 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION
   -37 2014-03-20 20:08:44.496015 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 
c=0x1f0e2160).fault 0: Success
   -36 2014-03-20 20:08:44.496099 7fee411d4700  5 mds.0.35 ms_handle_reset on 
192.168.1.101:0/2113152127
   -35 2014-03-20 20:08:44.496120 7fee411d4700  3 mds.0.35 ms_handle_reset 
closing connection for session client.6019 192.168.1.101:0/2113152127
   -34 2014-03-20 20:08:44.496207 7fee411d4700  1 -- 192.168.1.20:6801/17079 
mark_down 0x1f0e2160 -- pipe dne
   -33 2014-03-20 20:08:44.653628 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 
192.168.1.101:52027/0
   -32 2014-03-20 20:08:44.653677 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52027/0)
   -31 2014-03-20 20:08:44.925618 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== client.6019 192.168.1.101:0/2113152127 1  client_reconnect(77349 caps) 
v2  0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0
   -30 2014-03-20 20:08:44.925682 7fee411d4700  1 mds.0.server  no longer in 
reconnect state, ignoring reconnect, sending close
   -29 2014-03-20 20:08:44.925735 7fee411d4700  0 log [INF] : denied reconnect 
attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 
2014-03-20 20:08:44.925679 (allowed interval 45)
   -28 2014-03-20 20:08:44.925748 7fee411d4700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.101:0/2113152127 -- 

Re: [ceph-users] MDS crash when client goes to sleep

2014-03-20 Thread hjcho616
Nope just these segfaults.

[149884.709608] ceph-mds[17366]: segfault at 200 ip 7f09de9d60b8 sp 
7f09db461520 error 4 in libgcc_s.so.1[7f09de9c7000+15000]
[211263.265402] ceph-mds[17135]: segfault at 200 ip 7f59eec280b8 sp 
7f59eb6b3520 error 4 in libgcc_s.so.1[7f59eec19000+15000]
[214638.927759] ceph-mds[16896]: segfault at 200 ip 7fcb2c89e0b8 sp 
7fcb29329520 error 4 in libgcc_s.so.1[7fcb2c88f000+15000]
[289338.461271] ceph-mds[20878]: segfault at 200 ip 7f4b7211c0b8 sp 
7f4b6eba7520 error 4 in libgcc_s.so.1[7f4b7210d000+15000]
[373738.961475] ceph-mds[21341]: segfault at 200 ip 7f36c3d480b8 sp 
7f36c07d3520 error 4 in libgcc_s.so.1[7f36c3d39000+15000]

Regards,
Hong



 From: Luke Jing Yuan jyl...@mimos.my
To: hjcho616 hjcho...@yahoo.com 
Cc: Mohd Bazli Ab Karim bazli.abka...@mimos.my; ceph-users@lists.ceph.com 
ceph-users@lists.ceph.com 
Sent: Thursday, March 20, 2014 10:53 PM
Subject: Re: [ceph-users] MDS crash when client goes to sleep
 


Did you see any messages in dmesg saying ceph-mds respawnning or stuffs like 
that?

Regards, 
Luke

On Mar 21, 2014, at 11:09 AM, hjcho616 hjcho...@yahoo.com wrote:


On client, I was no longer able to access the filesystem.  It would hang.  
Makes sense since MDS has crashed.  I tried running 3 MDS demon on the same 
machine.  Two crashes and one appears to be hung up(?). ceph health says MDS is 
in degraded state when that happened.


I was able to recover by restarting every node.  I currently have three 
machine, one with MDS and MON, and two with OSDs.


It is failing everytime my client machine goes to sleep.  If you need me to 
run something let me know what and how.


Regards,
Hong




 From: Mohd Bazli Ab Karim bazli.abka...@mimos.my
To: hjcho616 hjcho...@yahoo.com; ceph-users@lists.ceph.com 
ceph-users@lists.ceph.com 
Sent: Thursday, March 20, 2014 9:40 PM
Subject: RE: [ceph-users] MDS crash when client goes to sleep



 
Hi Hong,
May I know what has happened to your MDS once it crashed? Was it able to 
recover from replay?
We also facing this issue and I am interested to know on how to reproduce it.
 
Thanks.
Bazli
 
From:ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of hjcho616
Sent: Friday, March 21, 2014 10:29 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] MDS crash when client goes to sleep
 
When CephFS is mounted on a client and when client decides to go to sleep, MDS 
segfaults.  Has anyone seen this?  Below is a part of MDS log.  This happened 
in emperor and recent 0.77 release.  I am running Debian Wheezy with testing 
kernels 3.13.  What can I do to not crash the whole system if a client goes to 
sleep (and looks like disconnect may do the same)? Let me know if you need any 
more info.
 
Regards,
Hong
 
   -43 2014-03-20 20:08:42.463357 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 
-- 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 
-- ?+0 0x1ee9f080 con 0x2e56580
   -42 2014-03-20 20:08:42.463787 7fee411d4700  1 -- 192.168.1.20:6801/17079 
== mon.0 192.168.1.20:6789/0 21764  mdsbeacon(6798/MDS1.2 up:active seq 
21120 v6970) v2  108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
   -41 2014-03-20 20:08:43.373099 7fee3f0cf700  2 mds.0.cache 
check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, 
baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 
caps per inode
   -40 2014-03-20 20:08:44.494963 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept 
sd=18 192.168.1.101:52026/0
   -39 2014-03-20 20:08:44.495033 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 
192.168.1.101:52026/0)
   -38 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 
c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION
   -37 2014-03-20 20:08:44.496015 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 
 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 
c=0x1f0e2160).fault 0: Success
   -36 2014-03-20 20:08:44.496099 7fee411d4700  5 mds.0.35 ms_handle_reset on 
192.168.1.101:0/2113152127
   -35 2014-03-20 20:08:44.496120 7fee411d4700  3 mds.0.35 ms_handle_reset 
closing connection for session client.6019 192.168.1.101:0/2113152127
   -34 2014-03-20 20:08:44.496207 7fee411d4700  1 -- 192.168.1.20:6801/17079 
mark_down 0x1f0e2160 -- pipe dne
   -33 2014-03-20 20:08:44.653628 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 
 :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept 
sd=18 192.168.1.101:52027/0
   -32 2014-03-20 20:08:44.653677 7fee3d7c4700  0 -- 192.168.1.20:6801/17079