[ceph-users] Wrong PG information after increase pg_num
Hello all, I am testing cluster with mixed type OSD on same data node (yes, it's the idea from: http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/), and run into a strange status: ceph -s or ceph pg dump shows incorrect PG information after set pg_num to pool which is using different ruleset to select faster OSD. Please advise what's wrong and if I can fix the issue without recreate new pool with final pg_num directly: Soe more detail: 1) update crushmap to have different root ruleset to select different OSDs like this: rule replicated_ruleset_ssd { ruleset 50 type replicated min_size 1 max_size 10 step take sdd step chooseleaf firstn 0 type host step emit } 2) create new pool and set crush_ruleset to use this new rule $ ceph osd pool create ssd 64 64 replicated replicated_ruleset_ssd (however after this command it's still using default ruleset 0) $ ceph osd pool set ssd crush_ruleset 50 3) it looks good now: $ ceph osd dump | grep pool pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0 pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0 pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1574 flags hashpspool stripe_width 0 $ ceph -s cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569 health HEALTH_OK monmap e2: 3 mons at {DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0}, election epoch 84, quorum 0,1,2 DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3 osdmap e1578: 21 osds: 15 up, 15 in pgmap v560681: 1472 pgs, 5 pools, 285 GB data, 73352 objects 80151 MB used, 695 GB / 779 GB avail 1472 active+clean 4) increase pg_num pgp_num but total PG number is still 1472 in ceph -s: $ ceph osd pool set ssd pg_num 128 set pool 9 pg_num to 128 $ ceph osd pool set ssd pgp_num 128 set pool 9 pgp_num to 128 $ ceph osd dump | grep pool pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 50 flags hashpspool stripe_width 0 pool 8 'xfs' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 1570 flags hashpspool stripe_width 0 pool 9 'ssd' replicated size 3 min_size 2 crush_ruleset 50 object_hash rjenkins pg_num 128 pgp_num 128 last_change 1581 flags hashpspool stripe_width 0 $ ceph -s cluster 5f8ae2a8-f143-42d9-b50d-246ac0874569 health HEALTH_OK monmap e2: 3 mons at {DEV-rhel7-vildn1=10.0.2.156:6789/0,DEV-rhel7-vildn2=10.0.2.157:6789/0,DEV-rhel7-vildn3=10.0.2.158:6789/0}, election epoch 84, quorum 0,1,2 DEV-rhel7-vildn1,DEV-rhel7-vildn2,DEV-rhel7-vildn3 osdmap e1582: 21 osds: 15 up, 15 in pgmap v560709: 1472 pgs, 5 pools, 285 GB data, 73352 objects 80158 MB used, 695 GB / 779 GB avail 1472 active+clean 5) same problem with pg dump: $ ceph pg dump | grep '^9\.' | wc dumped all in format plain 641472 10288 6) looks pg are created under /var/lib/ceph/osd/ceph-osd/current folder: $ ls -ld /var/lib/ceph/osd/ceph-15/current/9.* | wc 74 6666133 ]$ ls -ld /var/lib/ceph/osd/ceph-16/current/9.* | wc 54 4864475 6 osd for this ruleset = 128 * 3 / 6 ~= 64 Thanks a lot BR, Luke Kao MYCOM-OSI This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com
[ceph-users] rados-java issue tracking and release
hi, does anyone know who is maintaining rados-java and perform release to the Maven central? In May, there was a release to Maven central *[1], but the release version is not based on the latest code base from: https://github.com/ceph/rados-java I wonder if the one who do the Maven release could tag a version and release the current snapshot. Besides, I am not sure if the rados-java developers will notice any issue reported in the ceph issue tracker. would it be better if the rados-java project could enable issue tracking at github? thx [1] http://search.maven.org/#artifactdetails%7Ccom.ceph%7Crados%7C0.1.4%7Cjar regards, mingfai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados-java issue tracking and release
Hi, On 14-07-15 11:05, Mingfai wrote: hi, does anyone know who is maintaining rados-java and perform release to the Maven central? In May, there was a release to Maven central *[1], but the release version is not based on the latest code base from: https://github.com/ceph/rados-java I wonder if the one who do the Maven release could tag a version and release the current snapshot. From the CloudStack project Laszlo pushed it to Maven central with my permission, but it seems he used a different source then from Github. CC'ing him if he knows which source he used. Besides, I am not sure if the rados-java developers will notice any issue reported in the ceph issue tracker. would it be better if the rados-java project could enable issue tracking at github? thx I have to be honest that I simply forgot to look at the outstanding issues. Any help is more then appreciated since I don't have the time to look at them. Always feel free to send in a pull request on Github: https://github.com/ceph/rados-java/pulls If it fixes a issue, please add that in the git commit message. Wido [1] http://search.maven.org/#artifactdetails%7Ccom.ceph%7Crados%7C0.1.4%7Cjar regards, mingfai ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs omap
On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer j...@schermer.cz wrote: Thank you for your reply. Comments inline. I’m still hoping to get some more input, but there are many people running ceph on ext4, and it sounds like it works pretty good out of the box. Maybe I’m overthinking this, then? I think so — somebody did a lot of work making sure we were well-tuned on the standard filesystems; I believe it was David. -Greg Jan On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote: inline -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Monday, July 13, 2015 2:32 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] xattrs vs omap Sorry for reviving an old thread, but could I get some input on this, pretty please? ext4 has 256-byte inodes by default (at least according to docs) but the fragment below says: OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) The default 512b is too much if the inode is just 256b, so shouldn’t that be 256b in case people use the default ext4 inode size? Anyway, is it better to format ext4 with larger inodes (say 2048b) and set filestore_max_inline_xattr_size_other=1536, or leave it at defaults? [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any harm though, but, curious. AFAIK there is other information in the inode other than xattrs, also you need to count the xattra labels into this - so if I want to store 1536B of “values” it would cost more, and there still needs to be some space left. (As I understand it, on ext4 xattrs ale limited to one block, inode size + something can spill to one different inode - maybe someone knows better). [Somnath] The xttr size (_) is now more than 256 bytes and it will spill over, so, bigger inode size will be good. But, I would suggest do your benchmark before putting it into production. Good poin and I am going to do that, but I’d like to avoid the guesswork. Also, not all patterns are always replicable…. Is filestore_max_inline_xattr_size and absolute limit, or is it filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality? [Somnath] The *_size is tracking the xttr size per attribute and *inline_xattrs keep track of max number of inline attributes allowed. So, if a xattr size is *_size , it will go to omap and also if the total number of xattra *inline_xattrs , it will go to omap. If you are only using rbd, the number of inline xattrs will be always 2 and it will not cross that default max limit. If I’m reading this correctly then with my setting of filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B (2 xattrs), so I should in reality use 4K inodes…? Does OSD do the sane thing if for some reason the xattrs do not fit? What are the performance implications of storing the xattrs in leveldb? [Somnath] Even though I don't have the exact numbers, but, it has a significant overhead if the xattrs go to leveldb. And lastly - what size of xattrs should I really expect if all I use is RBD for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and pool snapshots). This overhead is quite large [Somnath] It will be 2 xattrs, default _ will be little bigger than 256 bytes and _snapset is small depends on number of snaps/clones, but unlikely will cross 256 bytes range. I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd volumes. Does this come into play somehow? My plan so far is to format the drives like this: mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b inode, 4096b block size, one inode for 512k of space and set filestore_max_inline_xattr_size_other=1536 [Somnath] Not much idea on ext4, sorry.. Does that make sense? Thanks! Jan On 02 Jul 2015, at 12:18, Jan Schermer j...@schermer.cz wrote: Does anyone have a known-good set of parameters for ext4? I want to try it as well but I’m a bit worried what happnes if I get it wrong. Thanks Jan On 02 Jul 2015, at 09:40, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 02 July 2015 02:23 To: Ceph Users Subject: Re: [ceph-users] xattrs vs omap On Thu, 2 Jul 2015 00:36:18 + Somnath Roy wrote: It is replaced with the following config option.. // Use omap for xattrs for attrs over // filestore_max_inline_xattr_size or OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536) OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048) OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) // for more than filestore_max_inline_xattrs attrs OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattrs_xfs, OPT_U32,
Re: [ceph-users] xattrs vs omap
Thank you for your reply. Comments inline. I’m still hoping to get some more input, but there are many people running ceph on ext4, and it sounds like it works pretty good out of the box. Maybe I’m overthinking this, then? Jan On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote: inline -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Monday, July 13, 2015 2:32 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] xattrs vs omap Sorry for reviving an old thread, but could I get some input on this, pretty please? ext4 has 256-byte inodes by default (at least according to docs) but the fragment below says: OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) The default 512b is too much if the inode is just 256b, so shouldn’t that be 256b in case people use the default ext4 inode size? Anyway, is it better to format ext4 with larger inodes (say 2048b) and set filestore_max_inline_xattr_size_other=1536, or leave it at defaults? [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any harm though, but, curious. AFAIK there is other information in the inode other than xattrs, also you need to count the xattra labels into this - so if I want to store 1536B of “values” it would cost more, and there still needs to be some space left. (As I understand it, on ext4 xattrs ale limited to one block, inode size + something can spill to one different inode - maybe someone knows better). [Somnath] The xttr size (_) is now more than 256 bytes and it will spill over, so, bigger inode size will be good. But, I would suggest do your benchmark before putting it into production. Good poin and I am going to do that, but I’d like to avoid the guesswork. Also, not all patterns are always replicable…. Is filestore_max_inline_xattr_size and absolute limit, or is it filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality? [Somnath] The *_size is tracking the xttr size per attribute and *inline_xattrs keep track of max number of inline attributes allowed. So, if a xattr size is *_size , it will go to omap and also if the total number of xattra *inline_xattrs , it will go to omap. If you are only using rbd, the number of inline xattrs will be always 2 and it will not cross that default max limit. If I’m reading this correctly then with my setting of filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B (2 xattrs), so I should in reality use 4K inodes…? Does OSD do the sane thing if for some reason the xattrs do not fit? What are the performance implications of storing the xattrs in leveldb? [Somnath] Even though I don't have the exact numbers, but, it has a significant overhead if the xattrs go to leveldb. And lastly - what size of xattrs should I really expect if all I use is RBD for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and pool snapshots). This overhead is quite large [Somnath] It will be 2 xattrs, default _ will be little bigger than 256 bytes and _snapset is small depends on number of snaps/clones, but unlikely will cross 256 bytes range. I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd volumes. Does this come into play somehow? My plan so far is to format the drives like this: mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b inode, 4096b block size, one inode for 512k of space and set filestore_max_inline_xattr_size_other=1536 [Somnath] Not much idea on ext4, sorry.. Does that make sense? Thanks! Jan On 02 Jul 2015, at 12:18, Jan Schermer j...@schermer.cz wrote: Does anyone have a known-good set of parameters for ext4? I want to try it as well but I’m a bit worried what happnes if I get it wrong. Thanks Jan On 02 Jul 2015, at 09:40, Nick Fisk n...@fisk.me.uk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 02 July 2015 02:23 To: Ceph Users Subject: Re: [ceph-users] xattrs vs omap On Thu, 2 Jul 2015 00:36:18 + Somnath Roy wrote: It is replaced with the following config option.. // Use omap for xattrs for attrs over // filestore_max_inline_xattr_size or OPTION(filestore_max_inline_xattr_size, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536) OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048) OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) // for more than filestore_max_inline_xattrs attrs OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10) OPTION(filestore_max_inline_xattrs_btrfs, OPT_U32, 10) OPTION(filestore_max_inline_xattrs_other, OPT_U32, 2) If these limits crossed, xattrs will be stored in omap.. Sounds
Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall
Hi , The output of ceph -s : cluster 50961297-815c-4598-8efe-5e08203f9fea health HEALTH_OK monmap e5: 5 mons at {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0}, election epoch 258, quorum 0,1,2,3,4 pshn05,pshn06,pshn13,psosctl111,psosctl112 mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby osdmap e21319: 16 osds: 16 up, 16 in pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects 9940 GB used, 10170 GB / 21187 GB avail 384 active+clean I don't use any ceph client (kernel or fuse) on the same nodes that run osd/mon/mds daemons. Yes, I see slow operations warnings from time to time when I'm looking at ceph -w. The number of iops on the servers aren't that high and I think the write-back cache of the RAID controller sould be able to help with the journal ops. Simion Rad. From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, July 14, 2015 12:38 To: Simion Rad Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote: Hi , I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and big files). Every day there are times when a lot of processes running on the client servers ( using either fuse of kernel client) become stuck in D state and when I run a strace of them I see them waiting in FUTEX_WAIT syscall. The same issue I'm able to see on all OSD demons. The ceph version I'm running is Firefly 0.80.10 both on clients and on server daemons. I use ext4 as osd filesystem. Operating system on servers : Ubuntu 14.04 and kernel 3.13. Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on RAID controller Dell PERC H700 with 512MB BBU using write-back mode). The servers which the ceph daemons are running on are also hosting KVM VMs ( OpenStack Nova ). Because of this unfortunate setup the performance is really bad, but at least I shouldn't see as many locking issues (or shoud I ? ). The only thing which temporarily improves the performance is restarting every osd. After such a restart I see some processes on client machines resume I/O but only for a couple of hours, then the whole process must be repeated. I cannot afford to run a setup without RAID because there isn't enough RAM left for a couple of osd daemons. The ceph.conf settings I use : auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 128 osd pool default pgp num = 128 public network = 10.71.13.0/24 cluster network = 10.71.12.0/24 Did someone else experienced this kind of behaviour (stuck processes in FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ? What's the output of ceph -s on your cluster? When your clients get stuck, is the cluster complaining about stuck ops on the OSDs? Are you running kernel clients on the same boxes as your OSDs? If I were to guess I'd imagine that you might just have overloaded your cluster and the FUTEX_WAIT is the clients waiting for writes to get acknowledged, but if restarting the OSDs brings everything back up for a few hours that might not be the case. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall
On Tue, Jul 14, 2015 at 11:30 AM, Simion Rad simion@yardi.com wrote: Hi , The output of ceph -s : cluster 50961297-815c-4598-8efe-5e08203f9fea health HEALTH_OK monmap e5: 5 mons at {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0}, election epoch 258, quorum 0,1,2,3,4 pshn05,pshn06,pshn13,psosctl111,psosctl112 mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby osdmap e21319: 16 osds: 16 up, 16 in pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects 9940 GB used, 10170 GB / 21187 GB avail 384 active+clean I don't use any ceph client (kernel or fuse) on the same nodes that run osd/mon/mds daemons. Yes, I see slow operations warnings from time to time when I'm looking at ceph -w. Yeah, I think this is just it — especially if you've got some OSDs which are 9 times larger than others, the load will disproportionately go to them and they probably can't take it. The next time things get stuck you can look at the admin socket on the ceph-fuse machines and dump_ops_in_flight and see if any of them are very old, and which OSDs they're targeted at. (You can get similar information out of the kernel clients by cat'ing the files in /sys/kernel/debug/ceph/*/.) -Greg The number of iops on the servers aren't that high and I think the write-back cache of the RAID controller sould be able to help with the journal ops. Simion Rad. From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, July 14, 2015 12:38 To: Simion Rad Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote: Hi , I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and big files). Every day there are times when a lot of processes running on the client servers ( using either fuse of kernel client) become stuck in D state and when I run a strace of them I see them waiting in FUTEX_WAIT syscall. The same issue I'm able to see on all OSD demons. The ceph version I'm running is Firefly 0.80.10 both on clients and on server daemons. I use ext4 as osd filesystem. Operating system on servers : Ubuntu 14.04 and kernel 3.13. Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on RAID controller Dell PERC H700 with 512MB BBU using write-back mode). The servers which the ceph daemons are running on are also hosting KVM VMs ( OpenStack Nova ). Because of this unfortunate setup the performance is really bad, but at least I shouldn't see as many locking issues (or shoud I ? ). The only thing which temporarily improves the performance is restarting every osd. After such a restart I see some processes on client machines resume I/O but only for a couple of hours, then the whole process must be repeated. I cannot afford to run a setup without RAID because there isn't enough RAM left for a couple of osd daemons. The ceph.conf settings I use : auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 128 osd pool default pgp num = 128 public network = 10.71.13.0/24 cluster network = 10.71.12.0/24 Did someone else experienced this kind of behaviour (stuck processes in FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ? What's the output of ceph -s on your cluster? When your clients get stuck, is the cluster complaining about stuck ops on the OSDs? Are you running kernel clients on the same boxes as your OSDs? If I were to guess I'd imagine that you might just have overloaded your cluster and the FUTEX_WAIT is the clients waiting for writes to get acknowledged, but if restarting the OSDs brings everything back up for a few hours that might not be the case. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall
On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote: Hi , I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and big files). Every day there are times when a lot of processes running on the client servers ( using either fuse of kernel client) become stuck in D state and when I run a strace of them I see them waiting in FUTEX_WAIT syscall. The same issue I'm able to see on all OSD demons. The ceph version I'm running is Firefly 0.80.10 both on clients and on server daemons. I use ext4 as osd filesystem. Operating system on servers : Ubuntu 14.04 and kernel 3.13. Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on RAID controller Dell PERC H700 with 512MB BBU using write-back mode). The servers which the ceph daemons are running on are also hosting KVM VMs ( OpenStack Nova ). Because of this unfortunate setup the performance is really bad, but at least I shouldn't see as many locking issues (or shoud I ? ). The only thing which temporarily improves the performance is restarting every osd. After such a restart I see some processes on client machines resume I/O but only for a couple of hours, then the whole process must be repeated. I cannot afford to run a setup without RAID because there isn't enough RAM left for a couple of osd daemons. The ceph.conf settings I use : auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 128 osd pool default pgp num = 128 public network = 10.71.13.0/24 cluster network = 10.71.12.0/24 Did someone else experienced this kind of behaviour (stuck processes in FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ? What's the output of ceph -s on your cluster? When your clients get stuck, is the cluster complaining about stuck ops on the OSDs? Are you running kernel clients on the same boxes as your OSDs? If I were to guess I'd imagine that you might just have overloaded your cluster and the FUTEX_WAIT is the clients waiting for writes to get acknowledged, but if restarting the OSDs brings everything back up for a few hours that might not be the case. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs omap
Instead of guessing I took a look at one of my OSDs. TL;DR: I’m going to bump the inode size to 512 which should fit majority of xattrs, no need to touch filestore parameters. Short news first - I can’t find a file with more than 2 xattrs. (and that’s good) Then I extracted all the xattrs on all the ~100K files, counted their size and counted the occurences. The largest xattrs I have are 705 chars in base64 (so let’s say it’s half), and that particular file has about 512B total in xattr (that’s more than was expected with RBD-only workload, right?) # file: var/lib/ceph/osd/ceph-55//current/4.1ad7_head/rbd134udata.1a785181f15746a.0005a578__head_E5C51AD7__4 117 user.ceph._=0sCwjyBANKACkAAAByYmRfZGF0YS4xYTc4NTE4MWYxNTc0NmEuMDAwMDAwMDAwMDA1YTU3OP7/1xrF5QAABAAFAxQEAP8AAADrEKMAADB2DQAiDaMA AG11DQACAhUI1xSoAQD9CwAMAEAAABAgpFWoa6QVAgIV6xCjAAAwdg0= 347 user.ceph.snapset=0sAgL5AQAAgt8HAAABBgAAAILfBwAAb94HAAC23AcAAEnPBwAA470HAAB4ugcAAAQAAAC1ugcAAOO9BwAAStAHAACC3wcAAAQAAAC1ugcAAAQAAABQFGAUwAowHwAAAJAZ4DggBwAA470HAAAFEA8gDwAAACAFSBQAAABADgAAAJAioAI4JQAAAMgaAABK0AcAAAQAAADgAQAAAOgBeCYAAACAKHAAACkAFwAAgt8HAAAFoAEAAADAAQAAAIAMUA4QBgAAAIAU4ACAFQAAAIAqAAAEtboHAEAAAOO9BwBAAABK0AcAQAAAgt8HAE== 705 (If anyone wants to enlighten me on the contents that would be great - is this expected to grow much?) BUT most of the files have much smaller xattrs, and if I researched it correctly it seems ext4 uses free space in inode (which should be something like inode_size-128-28=free) and if that’s not enough it will allocate one more block. In other words, if I format ext4 with 2048 inode size and 4096 block size, there will be 2048-(128+28)=1892 bytes available in the inode, and 4096 bytes can be allocated from another block. With default format, there will be just 256-(128+28)=100 bytes in the inode + 4096 bytes in another block. In my case, majority of the files have xattr size 200B, which is larger than fits inside one inode, but not really that large, so it should be beneficial to bump the inode size to 512B (that leaves plenty of 356 bytes for xattrs). Jan On 14 Jul 2015, at 12:18, Gregory Farnum g...@gregs42.com wrote: On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer j...@schermer.cz wrote: Thank you for your reply. Comments inline. I’m still hoping to get some more input, but there are many people running ceph on ext4, and it sounds like it works pretty good out of the box. Maybe I’m overthinking this, then? I think so — somebody did a lot of work making sure we were well-tuned on the standard filesystems; I believe it was David. -Greg Jan On 13 Jul 2015, at 21:04, Somnath Roy somnath@sandisk.com wrote: inline -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan Schermer Sent: Monday, July 13, 2015 2:32 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] xattrs vs omap Sorry for reviving an old thread, but could I get some input on this, pretty please? ext4 has 256-byte inodes by default (at least according to docs) but the fragment below says: OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512) The default 512b is too much if the inode is just 256b, so shouldn’t that be 256b in case people use the default ext4 inode size? Anyway, is it better to format ext4 with larger inodes (say 2048b) and set filestore_max_inline_xattr_size_other=1536, or leave it at defaults? [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any harm though, but, curious. AFAIK there is other information in the inode other than xattrs, also you need to count the xattra labels into this - so if I want to store 1536B of “values” it would cost more, and there still needs to be some space left. (As I understand it, on ext4 xattrs ale limited to one block, inode size + something can spill to one different inode - maybe someone knows better). [Somnath] The xttr size (_) is now more than 256 bytes and it will spill over, so, bigger inode size will be good. But, I would suggest do your benchmark before putting it into production. Good poin and I am going to do that, but I’d like to avoid the guesswork. Also, not all patterns are always replicable…. Is filestore_max_inline_xattr_size and absolute limit, or is it filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality? [Somnath] The *_size is tracking the
Re: [ceph-users] slow requests going up and down
I don't think there were any stale or unclean PGs, (when there are, I have seen health detail list them and it did not in this case). I have since restarted the 2 osds and the health went immediately to HEALTH_OK. -- Tom -Original Message- From: Will.Boege [mailto:will.bo...@target.com] Sent: Monday, July 13, 2015 10:19 PM To: Deneau, Tom; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow requests going up and down Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Confusion in Erasure Code benchmark app
Hi All, I am trying to debug ceph_erasure_code_benchmark_app available in ceph repo. using cauchy_good technique. I am running gdb using following command: src/ceph_erasure_code_benchmark --plugin jerasure_neon --workload encode --iterations 10 --size 1048576 --parameter k=6 --parameter m=2 --parameter directory=src/.libs --parameter packetsize=3072 --parameter technique=cauchy_good My confusion here is why underlying GF(32) function galois_w32_region_xor() is called even if the parameter value of w passed in jerasure_schedule_encode() is 8. According to me since GF(8) is passed in jerasure_schedule_encode() (with parameter w==8) then underlying gf function galois_w8_region_xor() should have been called instead of GF(32) function galois_w32_region_xor Thanks in advance Nitin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] strange issues after upgrading to SL6.6 and latest kernel
Hi, This reminds me of when a buggy leveldb package slipped into the ceph repos (http://tracker.ceph.com/issues/7792). Which version of leveldb do you have installed? Cheers, Dan On Tue, Jul 14, 2015 at 3:39 PM, Barry O'Rourke barry.o'rou...@ed.ac.uk wrote: Hi, I managed to destroy my development cluster yesteday after upgrading it to Scientific Linux and kernel 2.6.32-504.23.4.el6.x86_64. Upon rebooting the development node hung whilst attempting to start the monitor. It was still in the same state after being left overnight to see if it would time out. I decided to start from scratch to see if I could recreate the issue on a clean install. I've followed both the quick install and manual install guides on the wiki and always see the following error whilst creating the initial monitor. https://gist.github.com/barryorourke/47b0a988d38a817afb5b#file-gistfile1-txt Has anyone seen anything similar? Regards, Barry -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Confusion in Erasure Code benchmark app
Hi, I've observed the same thing but never spent time to figure that out. It would be nice to know. I don't think it's a bug, just something slightly confusing. Cheers On 14/07/2015 14:52, Nitin Saxena wrote: Hi All, I am trying to debug ceph_erasure_code_benchmark_app available in ceph repo. using cauchy_good technique. I am running gdb using following command: src/ceph_erasure_code_benchmark --plugin jerasure_neon --workload encode --iterations 10 --size 1048576 --parameter k=6 --parameter m=2 --parameter directory=src/.libs --parameter packetsize=3072 --parameter technique=cauchy_good My confusion here is why underlying GF(32) function galois_w32_region_xor() is called even if the parameter value of w passed in jerasure_schedule_encode() is 8. According to me since GF(8) is passed in jerasure_schedule_encode() (with parameter w==8) then underlying gf function galois_w8_region_xor() should have been called instead of GF(32) function galois_w32_region_xor Thanks in advance Nitin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall
I'll consider looking into more detail at the slow OSDs. Thank you, Simion Rad. From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, July 14, 2015 13:42 To: Simion Rad Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall On Tue, Jul 14, 2015 at 11:30 AM, Simion Rad simion@yardi.com wrote: Hi , The output of ceph -s : cluster 50961297-815c-4598-8efe-5e08203f9fea health HEALTH_OK monmap e5: 5 mons at {pshn05=10.71.13.5:6789/0,pshn06=10.71.13.6:6789/0,pshn13=10.71.13.13:6789/0,psosctl111=10.71.13.111:6789/0,psosctl112=10.71.13.112:6789/0}, election epoch 258, quorum 0,1,2,3,4 pshn05,pshn06,pshn13,psosctl111,psosctl112 mdsmap e173: 1/1/1 up {0=pshn17=up:active}, 4 up:standby osdmap e21319: 16 osds: 16 up, 16 in pgmap v3301189: 384 pgs, 3 pools, 4906 GB data, 3794 kobjects 9940 GB used, 10170 GB / 21187 GB avail 384 active+clean I don't use any ceph client (kernel or fuse) on the same nodes that run osd/mon/mds daemons. Yes, I see slow operations warnings from time to time when I'm looking at ceph -w. Yeah, I think this is just it — especially if you've got some OSDs which are 9 times larger than others, the load will disproportionately go to them and they probably can't take it. The next time things get stuck you can look at the admin socket on the ceph-fuse machines and dump_ops_in_flight and see if any of them are very old, and which OSDs they're targeted at. (You can get similar information out of the kernel clients by cat'ing the files in /sys/kernel/debug/ceph/*/.) -Greg The number of iops on the servers aren't that high and I think the write-back cache of the RAID controller sould be able to help with the journal ops. Simion Rad. From: Gregory Farnum [g...@gregs42.com] Sent: Tuesday, July 14, 2015 12:38 To: Simion Rad Cc: ceph-us...@ceph.com Subject: Re: [ceph-users] ceph daemons stucked in FUTEX_WAIT syscall On Mon, Jul 13, 2015 at 11:00 PM, Simion Rad simion@yardi.com wrote: Hi , I'm running a small cephFS ( 21 TB , 16 OSDs having different sizes between 400G and 3.5 TB ) cluster that is used as a file warehouse (both small and big files). Every day there are times when a lot of processes running on the client servers ( using either fuse of kernel client) become stuck in D state and when I run a strace of them I see them waiting in FUTEX_WAIT syscall. The same issue I'm able to see on all OSD demons. The ceph version I'm running is Firefly 0.80.10 both on clients and on server daemons. I use ext4 as osd filesystem. Operating system on servers : Ubuntu 14.04 and kernel 3.13. Operaing system on clients : Ubuntu 12.04 LTS with HWE option kernel 3.13 The osd daemons are using RAID5 virtual disks (6 x 300 GB 10K RPM disks on RAID controller Dell PERC H700 with 512MB BBU using write-back mode). The servers which the ceph daemons are running on are also hosting KVM VMs ( OpenStack Nova ). Because of this unfortunate setup the performance is really bad, but at least I shouldn't see as many locking issues (or shoud I ? ). The only thing which temporarily improves the performance is restarting every osd. After such a restart I see some processes on client machines resume I/O but only for a couple of hours, then the whole process must be repeated. I cannot afford to run a setup without RAID because there isn't enough RAM left for a couple of osd daemons. The ceph.conf settings I use : auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 128 osd pool default pgp num = 128 public network = 10.71.13.0/24 cluster network = 10.71.12.0/24 Did someone else experienced this kind of behaviour (stuck processes in FUTEX_WAIT syscall) when running firefly release on Ubuntu 14.04 ? What's the output of ceph -s on your cluster? When your clients get stuck, is the cluster complaining about stuck ops on the OSDs? Are you running kernel clients on the same boxes as your OSDs? If I were to guess I'd imagine that you might just have overloaded your cluster and the FUTEX_WAIT is the clients waiting for writes to get acknowledged, but if restarting the OSDs brings everything back up for a few hours that might not be the case. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests going up and down
In my experience I have seen something like this this happen twice - First time there were unclean PGs because Ceph was down to one replica of a PG. When that happens Ceph blocks IO to remaining replicas when the number falls below the Œmin_size¹ parameter. That will manifest as blocked ops. Second time the disk was Œsoft-failing¹ - gaining many bad sectors but SMART still reported the drive as OK. Maybe check OSD.5 and OSD.7 for low level media errors with a tool like MegaCli, or whatever controller management tool comes with your hardware. At any rate, restarting the problem-child OSD is probably troubleshooting step #1, which you have done. On 7/14/15, 6:45 AM, Deneau, Tom tom.den...@amd.com wrote: I don't think there were any stale or unclean PGs, (when there are, I have seen health detail list them and it did not in this case). I have since restarted the 2 osds and the health went immediately to HEALTH_OK. -- Tom -Original Message- From: Will.Boege [mailto:will.bo...@target.com] Sent: Monday, July 13, 2015 10:19 PM To: Deneau, Tom; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow requests going up and down Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] strange issues after upgrading to SL6.6 and latest kernel
Hi, I managed to destroy my development cluster yesteday after upgrading it to Scientific Linux and kernel 2.6.32-504.23.4.el6.x86_64. Upon rebooting the development node hung whilst attempting to start the monitor. It was still in the same state after being left overnight to see if it would time out. I decided to start from scratch to see if I could recreate the issue on a clean install. I've followed both the quick install and manual install guides on the wiki and always see the following error whilst creating the initial monitor. https://gist.github.com/barryorourke/47b0a988d38a817afb5b#file-gistfile1-txt Has anyone seen anything similar? Regards, Barry -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cluster reliability
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm trying to understand the real world reliability of Ceph to provide some data to our upper management and may be valuable to others investigating Ceph. Things I'm trying to understand: 1. How many clusters are in production? 2. How long has the cluster(s) been in production? 3. The size of the cluster(s) (# OSDs TB of raw disk space). 4. Has there been a data loss event? 5. Has there been a data near-loss event (some manual process was required to recover some or all of the cluster data, but not from backup as that would be considered a loss event)? 6. How much data was involved in loss/near-loss event? 7. How long to recover to cluster to performing all I/O (not necessarily healthy). 8. What was the root cause of the loss/near-loss event? From what I've read on the mailing lists, it seems that most data loss event are around a design decisions to have 2 or less copies and experiencing multiple drive issues. Others seem to be related to human error, and I'm only aware of one instance where an upgrade caused a data availability issue. If you would take a few minutes to send me this information, I can summarize the findings and report back to the list. Thank you, - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVpUQtCRDmVDuy+mK58QAA3d8P/3tSSFXInMDLde9IZFfE jNjbN4zudF+ikNgBlOXFcU6vRIJyxj3V7Od+6PehLjyadmz+9Ju7KajRPb4z gdoAF6cqmWvu+urJ8AZ4Av0wdK6xRpWA3Dmz9VcdPOGOBztKB8ZC1L3htqty ysns5jgyzrFWPfWRGM0A/OC2r0JeAn2xDazZ3y2Gfhpi9sT7YqrqZj17MfdV S3MXNoHoVZFRSgS9jXJ0C1f1DKjR+kaoED3k+mJiiV/sR+RjK5OQi7tvU96L lWWPcOM02HJlfO//4cKXIqb4cs/p30y+VG4bh74c+svw5rq7SSAe0BFALZJY p9TyFUGQjp8BjA7ZwyT7UmqBPbwAbGRwToGOup9T2ZHbUW66vRyGPQcdVj7M z8PQDQWOg3AOaAgNXYNgwjTvfa5JwHHqIBhtTL8Kzt8ITteUr3N0sWNfydo6 h33sbhmv/SHmwFa267ounDj3M+MsVJq48iaHaZ+xznzYAsxfGQ0+Hv9nJxhB +KQWYCJTJcSXPDe7Ct3eHExYcO88I1mXqxuCdimk1DfNH0IieiUUYurLFTzy YvkJzbFaYNh/eccf7ASvt2ZjjGlrVozOijIopesaOJbJZP+Bd8FdGwc2IZXM ofkxbEBP96gpsatlSTlk1jEzTSiq+odgp9PJ2KjqgqYO6S4rNMKKgxnUsbjz IAJR =MGc9 -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Workaround for RHEL/CentOS 7.1 rbdmap service start warnings?
When starting the rbdmap.service to provide map/unmap of rbd devices across boot/shutdown cycles the /etc/init.d/rbdmap includes /lib/lsb/init-functions. This is not a problem except that the rbdmap script is making calls to the log_daemon_* log_progress_* log_actiion_* functions that are included in Ubuntu 14.04 distro's, but are not in the RHEL 7.1/RHCS 1.3 distro. Are there any recommended workaround for boot time startup in RHEL/Centos 7.1 clients? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSL for tracker.ceph.com
Hi, Curently tracker.ceph.com doesn't have SSL enabled. Every time I log in I'm sending my password over plain text which I'd rather not. Can we get SSL enabled on tracker.ceph.com? And while we are at it, can we enable IPv6 as well? :) -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Performance dégradation after upgrade to hammer
Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance dégradation after upgrade to hammer
On 07/14/2015 06:42 PM, Florent MONTHEL wrote: Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Hi Florent, Can you talk a little bit about how your write tests are setup? How many concurrent IOs and what size? Also, do you see similar problems with rados bench? We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi. Mark Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM with rbd volume hangs on write during load
On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote: Hi, We have a Openstack + Ceph cluster based on Giant release. We use ceph for the VMs volumes including the boot volumes. Under load, we see the write access to the volumes stuck from within the VM. The same would work after a VM reboot. The issue is seen with and without rbd cache. Let me know if this is some known issue and any way to debug further. The ceph cluster itself seems to be clean. We have currently disabled scrub and deep scrub. 'ceph -s' output as below. Are you seeing slow requests in the system? Are any of the disks under the OSDs 100% busy or close to it? Btw, the amount of PGs is rather high. You are at 88, while the formula recommends: num_osd * 100 / 3 = 14k (cluster total) Wido cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d health HEALTH_WARN noscrub,nodeep-scrub flag(s) set monmap e71: 9 mons at {gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=10.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:6789/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election epoch 22246, quorum 0,1,2,3,4,5,6,7,8 gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc011b,gngsvc011c,gngsvm011d osdmap e54600: 425 osds: 425 up, 425 in flags noscrub,nodeep-scrub pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects 402 TB used, 941 TB / 1344 TB avail 37620 active+clean client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s The traces we see in the VM's kernel are as below. [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 seconds. [ 1080.553027] Tainted: GF O 3.13.0-34-generic #60~precise1-Ubuntu [ 1080.553157] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 1080.553295] jbd2/vdb-8 D 88003687e3e0 0 813 2 0x [ 1080.553298] 880444fadb48 0002 880455114440 880444fadfd8 [ 1080.553302] 00014440 00014440 88044a9317f0 88044b7917f0 [ 1080.553303] 880444fadb48 880455114cd8 88044b7917f0 811fc670 [ 1080.553307] Call Trace: [ 1080.553309] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553311] [8175b8b9] schedule+0x29/0x70 [ 1080.553313] [8175b98f] io_schedule+0x8f/0xd0 [ 1080.553315] [811fc67e] sleep_on_buffer+0xe/0x20 [ 1080.553316] [8175c052] __wait_on_bit+0x62/0x90 [ 1080.553318] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553320] [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90 [ 1080.553322] [810aff70] ? wake_atomic_t_function+0x40/0x40 [ 1080.553324] [811fc66e] __wait_on_buffer+0x2e/0x30 [ 1080.553326] [8129806b] jbd2_journal_commit_transaction+0x136b/0x1520 [ 1080.553329] [810a1f75] ? sched_clock_local+0x25/0x90 [ 1080.553331] [8109a7b8] ? finish_task_switch+0x128/0x170 [ 1080.55] [8107891f] ? try_to_del_timer_sync+0x4f/0x70 [ 1080.553334] [8129c5d8] kjournald2+0xb8/0x240 [ 1080.553336] [810afef0] ? __wake_up_sync+0x20/0x20 [ 1080.553338] [8129c520] ? commit_timeout+0x10/0x10 [ 1080.553340] [8108fa79] kthread+0xc9/0xe0 [ 1080.553343] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 [ 1080.553346] [8176827c] ret_from_fork+0x7c/0xb0 [ 1080.553349] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 Thanks, Jeyaganesh. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM with rbd volume hangs on write during load
On 7/14/15, 4:56 PM, ceph-users on behalf of Wido den Hollander ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com wrote: On 07/15/2015 01:17 AM, Jeya Ganesh Babu Jegatheesan wrote: Hi, We have a Openstack + Ceph cluster based on Giant release. We use ceph for the VMs volumes including the boot volumes. Under load, we see the write access to the volumes stuck from within the VM. The same would work after a VM reboot. The issue is seen with and without rbd cache. Let me know if this is some known issue and any way to debug further. The ceph cluster itself seems to be clean. We have currently disabled scrub and deep scrub. 'ceph -s' output as below. Are you seeing slow requests in the system? I dont see slow requests in the cluster. Are any of the disks under the OSDs 100% busy or close to it? Most of the OSDs use 20% of a core. There is no OSD process busy at 100%. Btw, the amount of PGs is rather high. You are at 88, while the formula recommends: num_osd * 100 / 3 = 14k (cluster total) We used 30 * num_osd per pool. We do have 4 pools, i believe thats the why the PG seems to be be high. Wido cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d health HEALTH_WARN noscrub,nodeep-scrub flag(s) set monmap e71: 9 mons at {gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=1 0.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:67 89/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm 010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election epoch 22246, quorum 0,1,2,3,4,5,6,7,8 gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc0 11b,gngsvc011c,gngsvm011d osdmap e54600: 425 osds: 425 up, 425 in flags noscrub,nodeep-scrub pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects 402 TB used, 941 TB / 1344 TB avail 37620 active+clean client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s The traces we see in the VM's kernel are as below. [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 seconds. [ 1080.553027] Tainted: GF O 3.13.0-34-generic #60~precise1-Ubuntu [ 1080.553157] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 1080.553295] jbd2/vdb-8 D 88003687e3e0 0 813 2 0x [ 1080.553298] 880444fadb48 0002 880455114440 880444fadfd8 [ 1080.553302] 00014440 00014440 88044a9317f0 88044b7917f0 [ 1080.553303] 880444fadb48 880455114cd8 88044b7917f0 811fc670 [ 1080.553307] Call Trace: [ 1080.553309] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553311] [8175b8b9] schedule+0x29/0x70 [ 1080.553313] [8175b98f] io_schedule+0x8f/0xd0 [ 1080.553315] [811fc67e] sleep_on_buffer+0xe/0x20 [ 1080.553316] [8175c052] __wait_on_bit+0x62/0x90 [ 1080.553318] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553320] [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90 [ 1080.553322] [810aff70] ? wake_atomic_t_function+0x40/0x40 [ 1080.553324] [811fc66e] __wait_on_buffer+0x2e/0x30 [ 1080.553326] [8129806b] jbd2_journal_commit_transaction+0x136b/0x1520 [ 1080.553329] [810a1f75] ? sched_clock_local+0x25/0x90 [ 1080.553331] [8109a7b8] ? finish_task_switch+0x128/0x170 [ 1080.55] [8107891f] ? try_to_del_timer_sync+0x4f/0x70 [ 1080.553334] [8129c5d8] kjournald2+0xb8/0x240 [ 1080.553336] [810afef0] ? __wake_up_sync+0x20/0x20 [ 1080.553338] [8129c520] ? commit_timeout+0x10/0x10 [ 1080.553340] [8108fa79] kthread+0xc9/0xe0 [ 1080.553343] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 [ 1080.553346] [8176827c] ret_from_fork+0x7c/0xb0 [ 1080.553349] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 Thanks, Jeyaganesh. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CPU Hyperthreading ?
Hi list Do you recommend to enable or disable hyper threading on CPU ? Is it the case for Mon ? Osd ? Radosgw ? Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSL for tracker.ceph.com
On 07/14/2015 04:14 PM, Wido den Hollander wrote: Hi, Curently tracker.ceph.com doesn't have SSL enabled. Every time I log in I'm sending my password over plain text which I'd rather not. Can we get SSL enabled on tracker.ceph.com? And while we are at it, can we enable IPv6 as well? :) File a ... tracker ticket for it! :D I'm not sure what is involved with getting IPv6 on the rest of our servers, but we need to look into it. Particularly git.ceph.com. - Ken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] VM with rbd volume hangs on write during load
Hi, We have a Openstack + Ceph cluster based on Giant release. We use ceph for the VMs volumes including the boot volumes. Under load, we see the write access to the volumes stuck from within the VM. The same would work after a VM reboot. The issue is seen with and without rbd cache. Let me know if this is some known issue and any way to debug further. The ceph cluster itself seems to be clean. We have currently disabled scrub and deep scrub. 'ceph -s' output as below. cluster eaaeaa55-a8e7-4531-a5eb-03d73028b59d health HEALTH_WARN noscrub,nodeep-scrub flag(s) set monmap e71: 9 mons at {gngsvc009a=10.163.43.1:6789/0,gngsvc009b=10.163.43.2:6789/0,gngsvc010a=10.163.43.5:6789/0,gngsvc010b=10.163.43.6:6789/0,gngsvc011a=10.163.43.9:6789/0,gngsvc011b=10.163.43.10:6789/0,gngsvc011c=10.163.43.11:6789/0,gngsvm010d=10.163.43.8:6789/0,gngsvm011d=10.163.43.12:6789/0}, election epoch 22246, quorum 0,1,2,3,4,5,6,7,8 gngsvc009a,gngsvc009b,gngsvc010a,gngsvc010b,gngsvm010d,gngsvc011a,gngsvc011b,gngsvc011c,gngsvm011d osdmap e54600: 425 osds: 425 up, 425 in flags noscrub,nodeep-scrub pgmap v13257438: 37620 pgs, 4 pools, 134 TB data, 35289 kobjects 402 TB used, 941 TB / 1344 TB avail 37620 active+clean client io 94059 kB/s rd, 313 MB/s wr, 4623 op/s The traces we see in the VM's kernel are as below. [ 1080.552901] INFO: task jbd2/vdb-8:813 blocked for more than 120 seconds. [ 1080.553027] Tainted: GF O 3.13.0-34-generic #60~precise1-Ubuntu [ 1080.553157] echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. [ 1080.553295] jbd2/vdb-8 D 88003687e3e0 0 813 2 0x [ 1080.553298] 880444fadb48 0002 880455114440 880444fadfd8 [ 1080.553302] 00014440 00014440 88044a9317f0 88044b7917f0 [ 1080.553303] 880444fadb48 880455114cd8 88044b7917f0 811fc670 [ 1080.553307] Call Trace: [ 1080.553309] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553311] [8175b8b9] schedule+0x29/0x70 [ 1080.553313] [8175b98f] io_schedule+0x8f/0xd0 [ 1080.553315] [811fc67e] sleep_on_buffer+0xe/0x20 [ 1080.553316] [8175c052] __wait_on_bit+0x62/0x90 [ 1080.553318] [811fc670] ? __wait_on_buffer+0x30/0x30 [ 1080.553320] [8175c0fc] out_of_line_wait_on_bit+0x7c/0x90 [ 1080.553322] [810aff70] ? wake_atomic_t_function+0x40/0x40 [ 1080.553324] [811fc66e] __wait_on_buffer+0x2e/0x30 [ 1080.553326] [8129806b] jbd2_journal_commit_transaction+0x136b/0x1520 [ 1080.553329] [810a1f75] ? sched_clock_local+0x25/0x90 [ 1080.553331] [8109a7b8] ? finish_task_switch+0x128/0x170 [ 1080.55] [8107891f] ? try_to_del_timer_sync+0x4f/0x70 [ 1080.553334] [8129c5d8] kjournald2+0xb8/0x240 [ 1080.553336] [810afef0] ? __wake_up_sync+0x20/0x20 [ 1080.553338] [8129c520] ? commit_timeout+0x10/0x10 [ 1080.553340] [8108fa79] kthread+0xc9/0xe0 [ 1080.553343] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 [ 1080.553346] [8176827c] ret_from_fork+0x7c/0xb0 [ 1080.553349] [8108f9b0] ? flush_kthread_worker+0xb0/0xb0 Thanks, Jeyaganesh. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance dégradation after upgrade to hammer
Yes of course thanks Mark Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster servers No SSD drives used We're using Cosbench to send : - 8k object size : 100% read with 256 workers : better results with Hammer - 8k object size : 80% read - 20% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) - 8k object size : 100% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) Thanks Sent from my iPhone On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote: On 07/14/2015 06:42 PM, Florent MONTHEL wrote: Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Hi Florent, Can you talk a little bit about how your write tests are setup? How many concurrent IOs and what size? Also, do you see similar problems with rados bench? We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi. Mark Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Performance dégradation after upgrade to hammer
Hi Florent, 10x degradation is definitely unusual! A couple of things to look at: Are 8K rados bench writes to the rgw.buckets pool slow? You can with something like: rados -p rgw.buckets bench 30 write -t 256 -b 8192 You may also want to try targeting a specific RGW server to make sure the RR-DNS setup isn't interfering (at least while debugging). It may also be worth creating a new replicated pool and try writes to that pool as well to see if you see much difference. Mark On 07/14/2015 07:17 PM, Florent MONTHEL wrote: Yes of course thanks Mark Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster servers No SSD drives used We're using Cosbench to send : - 8k object size : 100% read with 256 workers : better results with Hammer - 8k object size : 80% read - 20% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) - 8k object size : 100% write with 256 workers : real degradation between Firefly and Hammer (divided by something like 10) Thanks Sent from my iPhone On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote: On 07/14/2015 06:42 PM, Florent MONTHEL wrote: Hi All, I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork mode I'm experiencing huge write performance degradation just after upgrade (Cosbench). Do you already run performance tests between Hammer and Firefly ? No problem with read performance that was amazing Hi Florent, Can you talk a little bit about how your write tests are setup? How many concurrent IOs and what size? Also, do you see similar problems with rados bench? We have done some testing and haven't seen significant performance degradation except when switching to civetweb which appears to perform deletes more slowly than what we saw with apache+fcgi. Mark Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
Hi John, I cut the test down to a single client running only Ganesha NFS without any ceph drivers loaded on the Ceph FS client. After deleting all the files in the Ceph file system, rebooting all the nodes, I restarted the create 5 million file test using 2 NFS clients to the one Ceph file system node running Ganesha NFS. After a couple hours I am seeing the client ede-c2-gw01 failing to respond to cache pressure error: $ ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN mds0: Client ede-c2-gw01 failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 22, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby osdmap e323: 8 osds: 8 up, 8 in pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects 182 GB used, 78459 MB / 263 GB avail 832 active+clean Dumping the mds daemon shows inodes inodes_max: # ceph daemon mds.ede-c2-mds02 perf dump mds { mds: { request: 21862302, reply: 21862302, reply_latency: { avgcount: 21862302, sum: 16728.480772060 }, forward: 0, dir_fetch: 13, dir_commit: 50788, dir_split: 0, inode_max: 10, inodes: 100010, inodes_top: 0, inodes_bottom: 0, inodes_pin_tail: 100010, inodes_pinned: 100010, inodes_expired: 4308279, inodes_with_caps: 8, caps: 8, subtrees: 2, traverse: 30802465, traverse_hit: 26394836, traverse_forward: 0, traverse_discover: 0, traverse_dir_fetch: 0, traverse_remote_ino: 0, traverse_lock: 0, load_cent: 2186230200, q: 0, exported: 0, exported_inodes: 0, imported: 0, imported_inodes: 0 } } Once this test finishes and I verify the files were all correctly written, I will retest using the SAMBA VFS interface, followed by the kernel test. Please let me know if there is more info you need and if you want me to open a ticket. Best regards Eric On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman eric.east...@keepertech.com wrote: Thanks John. I will back the test down to the simple case of 1 client without the kernel driver and only running NFS Ganesha, and work forward till I trip the problem and report my findings. Eric On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote: On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01=10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0} election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel client mounts to simplify the situation. We haven't tested the cache limit enforcement with NFS Ganesha, so there is a decent chance that it is broken. The ganehsha FSAL is doing ll_get/ll_put reference counting on inodes, so it seems quite possible that its cache is pinning things that we would otherwise be evicting in response to cache pressure. You mention samba as well, You can see if the MDS cache is indeed exceeding its limit by looking at the output of: ceph daemon mds.daemon id perf dump mds ...where the inodes value tells you how many are in the cache, vs. inode_max. If you can, it would be useful to boil this down to a straightforward test case: if you start with a healthy cluster, mount a single ganesha client, and do your 5 million file procedure, do you get the warning?
Re: [ceph-users] Ruby bindings for Librados
On 07/13/2015 02:11 PM, Wido den Hollander wrote: On 07/13/2015 09:43 PM, Corin Langosch wrote: Hi Wido, I'm the dev of https://github.com/netskin/ceph-ruby and still use it in production on some systems. It has everything I need so I didn't develop any further. If you find any bugs or need new features, just open an issue and I'm happy to have a look. Ah, that's great! We should look into making a Ruby binding official and moving it to Ceph's Github project. That would make it more clear for end-users. I see that RADOS namespaces are currently not implemented in the Ruby bindings. Not many bindings have them though. Might be worth looking at. I'll give the current bindings a try btw! I'd like to see this happen too. Corin, would you be amenable to moving this under the ceph GitHub org? You'd still have control over it, similar to the way Wido manages https://github.com/ceph/phprados - Ken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CPU Hyperthreading ?
I was getting better performance with HT enabled (Intel cpu) for ceph-osd. I guess for mon it doesn't matter, but, for RadosGW I didn't measure the difference...We are running our benchmark with HT enabled for all components though. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Florent MONTHEL Sent: Tuesday, July 14, 2015 5:19 PM To: ceph-users Subject: [ceph-users] CPU Hyperthreading ? Hi list Do you recommend to enable or disable hyper threading on CPU ? Is it the case for Mon ? Osd ? Radosgw ? Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mds0: Client failing to respond to cache pressure
I change the mds_cache_size to 50 from 10 get rid of the WARN temporary. Now dumping the mds daemon shows like this: inode_max: 50, inodes: 124213, But i have no idea if the indoes rises more than 50 , change the mds_cache_size again? Thanks. 2015-07-15 13:34 GMT+08:00 谷枫 feiche...@gmail.com: I change the mds_cache_size to 50 from 10 get rid of the WARN temporary. Now dumping the mds daemon shows like this: inode_max: 50, inodes: 124213, But i have no idea if the indoes rises more than 50 , change the mds_cache_size again? Thanks. 2015-07-15 11:06 GMT+08:00 Eric Eastman eric.east...@keepertech.com: Hi John, I cut the test down to a single client running only Ganesha NFS without any ceph drivers loaded on the Ceph FS client. After deleting all the files in the Ceph file system, rebooting all the nodes, I restarted the create 5 million file test using 2 NFS clients to the one Ceph file system node running Ganesha NFS. After a couple hours I am seeing the client ede-c2-gw01 failing to respond to cache pressure error: $ ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN mds0: Client ede-c2-gw01 failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01= 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0 } election epoch 22, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e1860: 1/1/1 up {0=ede-c2-mds02=up:active}, 2 up:standby osdmap e323: 8 osds: 8 up, 8 in pgmap v302142: 832 pgs, 4 pools, 162 GB data, 4312 kobjects 182 GB used, 78459 MB / 263 GB avail 832 active+clean Dumping the mds daemon shows inodes inodes_max: # ceph daemon mds.ede-c2-mds02 perf dump mds { mds: { request: 21862302, reply: 21862302, reply_latency: { avgcount: 21862302, sum: 16728.480772060 }, forward: 0, dir_fetch: 13, dir_commit: 50788, dir_split: 0, inode_max: 10, inodes: 100010, inodes_top: 0, inodes_bottom: 0, inodes_pin_tail: 100010, inodes_pinned: 100010, inodes_expired: 4308279, inodes_with_caps: 8, caps: 8, subtrees: 2, traverse: 30802465, traverse_hit: 26394836, traverse_forward: 0, traverse_discover: 0, traverse_dir_fetch: 0, traverse_remote_ino: 0, traverse_lock: 0, load_cent: 2186230200, q: 0, exported: 0, exported_inodes: 0, imported: 0, imported_inodes: 0 } } Once this test finishes and I verify the files were all correctly written, I will retest using the SAMBA VFS interface, followed by the kernel test. Please let me know if there is more info you need and if you want me to open a ticket. Best regards Eric On Mon, Jul 13, 2015 at 9:40 AM, Eric Eastman eric.east...@keepertech.com wrote: Thanks John. I will back the test down to the simple case of 1 client without the kernel driver and only running NFS Ganesha, and work forward till I trip the problem and report my findings. Eric On Mon, Jul 13, 2015 at 2:18 AM, John Spray john.sp...@redhat.com wrote: On 13/07/2015 04:02, Eric Eastman wrote: Hi John, I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all nodes. This system is using 4 Ceph FS client systems. They all have the kernel driver version of CephFS loaded, but none are mounting the file system. All 4 clients are using the libcephfs VFS interface to Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to share out the Ceph file system. # ceph -s cluster 6d8aae1e-1125-11e5-a708-001b78e265be health HEALTH_WARN 4 near full osd(s) mds0: Client ede-c2-gw01 failing to respond to cache pressure mds0: Client ede-c2-gw02:cephfs failing to respond to cache pressure mds0: Client ede-c2-gw03:cephfs failing to respond to cache pressure monmap e1: 3 mons at {ede-c2-mon01= 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0 } election epoch 8, quorum 0,1,2 ede-c2-mon01,ede-c2-mon02,ede-c2-mon03 mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby osdmap e272: 8 osds: 8 up, 8 in pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects 212 GB used, 48715 MB / 263 GB avail 832 active+clean client io 1379 kB/s rd, 20653 B/s wr, 98 op/s It would help if we knew whether it's the kernel clients or the userspace clients that are generating the warnings here. You've probably already done this, but I'd get rid of any unused kernel
Re: [ceph-users] CPU Hyperthreading ?
Thanks for feed-back Somnath Sent from my iPhone On 14 juil. 2015, at 20:24, Somnath Roy somnath@sandisk.com wrote: I was getting better performance with HT enabled (Intel cpu) for ceph-osd. I guess for mon it doesn't matter, but, for RadosGW I didn't measure the difference...We are running our benchmark with HT enabled for all components though. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Florent MONTHEL Sent: Tuesday, July 14, 2015 5:19 PM To: ceph-users Subject: [ceph-users] CPU Hyperthreading ? Hi list Do you recommend to enable or disable hyper threading on CPU ? Is it the case for Mon ? Osd ? Radosgw ? Thanks Sent from my iPhone ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph and Redhat Enterprise Virtualization (RHEV)
RHEV does not formally support Ceph yet. Future versions are looking to include Cinder support which will allow you to hook in Ceph. You should contact your RHEV contacts who can give an indication of the timeline for this. Neil On Tue, Jul 14, 2015 at 10:43 AM, Peter Michael Calum pe...@tdc.dk wrote: Hi, Does anyone know if it is possible to use Ceph storage in Redhat Enterprise Virtualization (RHEV), and connect it as a data domain in the Redhat Enterprise Virtualization Manager (RHEVM). My RHEV version and Hypervisors are the latest RHEV 6.5 version. Thanks, Peter Calum TDC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph and Redhat Enterprise Virtualization (RHEV)
Hi, Does anyone know if it is possible to use Ceph storage in Redhat Enterprise Virtualization (RHEV), and connect it as a data domain in the Redhat Enterprise Virtualization Manager (RHEVM). My RHEV version and Hypervisors are the latest RHEV 6.5 version. Thanks, Peter Calum TDC ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com