[ceph-users] luminous 12.2.6 -> 12.2.7 active+clean+inconsistent PGs workaround (or wait for 12.2.8+ ?)
Hi, For those facing (lots of) active+clean+inconsistent PGs after the luminous 12.2.6 metadata corruption and 12.2.7 upgrade, I'd like to explain how I finally got rid of those. Disclaimer : my cluster doesn't contain highly valuable data, and I can sort of recreate what is actually contains : VMs. The following is risky... One reason I needed to fix those issues is that I faced IO errors whit pool overlays/tiering which were apparently related to the inconsistencies, and the only way I could get my VMs running again was to completely disable the SSDs overlay, which is far from ideal. For those not feeling the need to fix this "harmless" issue, please stop reading. For the others, please understand the risks of the following... or wait for an official "pg repair" solution So : 1st step : since I was getting an ever growing list of damaged PGs, I decided to deep-scrub... all PGs. Yes. If you have 1+PB data... stop reading (or not ?). How to do that : # for j in ; do for i in `ceph pg ls-by-pool $j |cut -d " " -f 1|tail -n +2`; do ceph pg deep-scrub $i ; done ; done I think I already had a full list of damaged PGs until I upgraded to mimic and restarted the MONs/the OSDs : I believe daemons restarts caused ceph to forget about known inconsistencies. If you believe the number of damaged PGs is sort of stable for you then skip step 1... 2nd step is sort of easy : it is to apply the method described here : http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021054.html I tried to add some rados locking before overwriting the objects (4M rbd objects in my case), but was still able to overwrite a locked object even with "rados -p rbd lock get --lock-type exclusive" ... maybe I haven't tried hard enough. It would have been great if it were possible to make sure the object was not overwritten between a get and a put :/ - that would make this procedure much safer... In my case, I had 2000+ damaged PGs, so I wrote a small script that should process those PGs and should try to apply the procedure: https://gist.github.com/fschaer/cb851eae4f46287eaf30715e18f14524 My Ceph cluster has been healthy since Friday evening and I haven't seen any data corruption nor any hung VM... Cheers Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors
My cache pool seems affected by an old/closed bug... but I don't think this is (directly ?) related to the current issue - but this won't help anyway :-/ http://tracker.ceph.com/issues/12659 Since I got promote issues, I tried to flush only the affected rbd image : I got 6 unflush-able objects... rbd image 'dev7243': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.194b8c238e1f29 (...) => # for i in `rados -p ssd-hot-irfu-virt ls |egrep '^rbd_data.194b8c238e1f29'`; do rados -p ssd-hot-irfu-virt cache-flush $i ; rados -p ssd-hot-irfu-virt cache-evict $i ; done error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or resource busy error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or resource busy error from cache-flush rbd_data.194b8c238e1f29.0926: (16) Device or resource busy error from cache-flush rbd_data.194b8c238e1f29.0926: (16) Device or resource busy (...) Strange that the cache-evict error message is the same as the cache flush one... # rados -p ssd-hot-irfu-virt cache-evict rbd_data.194b8c238e1f29.082f error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or resource busy Anyway : I stopped the VM and... I still can't flush the objects. I don't think this is related anyway, as the OSD propote error is : 2018-07-25 10:51:44.386764 7fd27929b700 -1 log_channel(cluster) log [ERR] : 1.39 copy from 1:9c0e12cc:::rbd_data.1920e2238e1f29.0dfc:head to 1:9c0e12cc:::rbd_data.1920e2238e1f29.0dfc:head data digest 0x 632451e5 != source 0x73dfd8ab 2018-07-25 10:51:44.386769 7fd27929b700 -1 osd.74 pg_epoch: 182580 pg[1.39( v 182580'38868939 (182579'38867404,182580'38868939] local-lis/les=182563/182564 n=342 ec=2726/2726 lis/c 182563/182563 les/c/f 182564/182564/0 182563/ 182563/182558) [74,71,19] r=0 lpr=182563 crt=182580'38868939 lcod 182580'38868938 mlcod 182580'38868938 active+clean] finish_promote unexpected promote error (5) Input/output error And I don't see object rbd_data.1920e2238e1f29.0dfc (:head ?) in the unflush-able objects... Cheers -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : mercredi 25 juillet 2018 10:28 À : Dan van der Ster Cc : ceph-users Objet : [PROVENANCE INTERNET] Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors Hi again, Now with all OSDs restarted, I'm getting health: HEALTH_ERR 777 scrub errors Possible data damage: 36 pgs inconsistent (...) pgs: 4764 active+clean 36 active+clean+inconsistent But from what I could read up to now, this is what's expected and should auto-heal when objects are overwritten - fingers crossed as pg repair or scrub doesn't seem to help. New errors in the ceph logs include lines like the following, which I also hope/presume are expected - I still have posts to read on this list about omap and those errors : 2018-07-25 10:20:00.106227 osd.66 osd.66 192.54.207.75:6826/2430367 12 : cluster [ERR] 11.288 shard 207: soid 11:1155c332:::rbd_data.207dce238e1f29.0527:head data_digest 0xc8997a5b != data_digest 0x2ca15853 from auth oi 11:1155c332:::rbd_data.207dce238e1f29.0527:head(182554'240410 client.6084296.0:48463693 dirty|data_digest|omap_digest s 4194304 uv 49429318 dd 2ca15853 od alloc_hint [0 0 0]) 2018-07-25 10:20:00.106230 osd.66 osd.66 192.54.207.75:6826/2430367 13 : cluster [ERR] 11.288 soid 11:1155c332:::rbd_data.207dce238e1f29.0527:head: failed to pick suitable auth object But never mind : with the SSD cache in writeback, I just saw the same error again on one VM (only) for now : (lots of these) 2018-07-25 10:15:19.841746 osd.101 osd.101 192.54.207.206:6859/3392654 116 : cluster [ERR] 1.20 copy from 1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head to 1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head data digest 0x27451e3c != source 0x12c05014 (osd.101 is a SSD from the cache pool) => yum update => I/O error => Set the TIER pool to forward => yum update starts. Weird, but if that happens only on this host, I can cope with it (I have 780+ scrub errors to handle now :/ ) And just to be sure ;) [root@ceph10 ~]# ceph --admin-daemon /var/run/ceph/*osd*101* version {"version":"12.2.7","release":"luminous","release_type":"stable"} On the good side : this update is forcing us to dive into ceph internals : we'll be more ceph-aware tonight than this morning ;) Cheers Fred -Message d'origine- De : SCHAER Frederic Envoyé : mercredi 25 juillet 2018 09:57 À : 'Dan van der Ster' Cc : ceph-users Objet : RE: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors Hi Dan, Just checked again : argggh
Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors
Hi again, Now with all OSDs restarted, I'm getting health: HEALTH_ERR 777 scrub errors Possible data damage: 36 pgs inconsistent (...) pgs: 4764 active+clean 36 active+clean+inconsistent But from what I could read up to now, this is what's expected and should auto-heal when objects are overwritten - fingers crossed as pg repair or scrub doesn't seem to help. New errors in the ceph logs include lines like the following, which I also hope/presume are expected - I still have posts to read on this list about omap and those errors : 2018-07-25 10:20:00.106227 osd.66 osd.66 192.54.207.75:6826/2430367 12 : cluster [ERR] 11.288 shard 207: soid 11:1155c332:::rbd_data.207dce238e1f29.0527:head data_digest 0xc8997a5b != data_digest 0x2ca15853 from auth oi 11:1155c332:::rbd_data.207dce238e1f29.0527:head(182554'240410 client.6084296.0:48463693 dirty|data_digest|omap_digest s 4194304 uv 49429318 dd 2ca15853 od alloc_hint [0 0 0]) 2018-07-25 10:20:00.106230 osd.66 osd.66 192.54.207.75:6826/2430367 13 : cluster [ERR] 11.288 soid 11:1155c332:::rbd_data.207dce238e1f29.0527:head: failed to pick suitable auth object But never mind : with the SSD cache in writeback, I just saw the same error again on one VM (only) for now : (lots of these) 2018-07-25 10:15:19.841746 osd.101 osd.101 192.54.207.206:6859/3392654 116 : cluster [ERR] 1.20 copy from 1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head to 1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head data digest 0x27451e3c != source 0x12c05014 (osd.101 is a SSD from the cache pool) => yum update => I/O error => Set the TIER pool to forward => yum update starts. Weird, but if that happens only on this host, I can cope with it (I have 780+ scrub errors to handle now :/ ) And just to be sure ;) [root@ceph10 ~]# ceph --admin-daemon /var/run/ceph/*osd*101* version {"version":"12.2.7","release":"luminous","release_type":"stable"} On the good side : this update is forcing us to dive into ceph internals : we'll be more ceph-aware tonight than this morning ;) Cheers Fred -Message d'origine- De : SCHAER Frederic Envoyé : mercredi 25 juillet 2018 09:57 À : 'Dan van der Ster' Cc : ceph-users Objet : RE: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors Hi Dan, Just checked again : arggghhh... # grep AUTO_RESTART /etc/sysconfig/ceph CEPH_AUTO_RESTART_ON_UPGRADE=no So no :'( RPMs were upgraded, but OSD were not restarted as I thought. Or at least not restarted with new 12.2.7 binaries (but since the skip digest option was present in the running 12.2.6 OSDs, I guess the 12.2.6 osds did not understand that option) I just restarted all of the OSDs : I will check again the behavior and report here - thanks for pointing me in the good direction ! Fred -Message d'origine- De : Dan van der Ster [mailto:d...@vanderster.com] Envoyé : mardi 24 juillet 2018 16:50 À : SCHAER Frederic Cc : ceph-users Objet : Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors `ceph versions` -- you're sure all the osds are running 12.2.7 ? osd_skip_data_digest = true is supposed to skip any crc checks during reads. But maybe the cache tiering IO path is different and checks the crc anyway? -- dan On Tue, Jul 24, 2018 at 3:01 PM SCHAER Frederic wrote: > > Hi, > > > > I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before > I started upgrading from 12.2.6 on my Bluestore-only cluster. > > As far as I can tell, my OSDs all got restarted during the upgrade and all > got the option enabled : > > > > This is what I see for a specific OSD taken at random: > > # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep > data_digest > > "osd_skip_data_digest": "true", > > > > This is what I see when I try to injectarg the option data digest ignore > option : > > > > # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head > > osd.0: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.1: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.2: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.3: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > (…) > > > > This has been like that since I upgraded to 12.2.7. > > I read in the releanotes that the skip_data_digest option should be > sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal > on rewrite… > > > > However… > > > > My config : > > - Using tiering with an SSD hot storage
Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors
Hi Dan, Just checked again : arggghhh... # grep AUTO_RESTART /etc/sysconfig/ceph CEPH_AUTO_RESTART_ON_UPGRADE=no So no :'( RPMs were upgraded, but OSD were not restarted as I thought. Or at least not restarted with new 12.2.7 binaries (but since the skip digest option was present in the running 12.2.6 OSDs, I guess the 12.2.6 osds did not understand that option) I just restarted all of the OSDs : I will check again the behavior and report here - thanks for pointing me in the good direction ! Fred -Message d'origine- De : Dan van der Ster [mailto:d...@vanderster.com] Envoyé : mardi 24 juillet 2018 16:50 À : SCHAER Frederic Cc : ceph-users Objet : Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors `ceph versions` -- you're sure all the osds are running 12.2.7 ? osd_skip_data_digest = true is supposed to skip any crc checks during reads. But maybe the cache tiering IO path is different and checks the crc anyway? -- dan On Tue, Jul 24, 2018 at 3:01 PM SCHAER Frederic wrote: > > Hi, > > > > I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before > I started upgrading from 12.2.6 on my Bluestore-only cluster. > > As far as I can tell, my OSDs all got restarted during the upgrade and all > got the option enabled : > > > > This is what I see for a specific OSD taken at random: > > # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep > data_digest > > "osd_skip_data_digest": "true", > > > > This is what I see when I try to injectarg the option data digest ignore > option : > > > > # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head > > osd.0: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.1: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.2: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > osd.3: osd_skip_data_digest = 'true' (not observed, change may require > restart) > > (…) > > > > This has been like that since I upgraded to 12.2.7. > > I read in the releanotes that the skip_data_digest option should be > sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal > on rewrite… > > > > However… > > > > My config : > > - Using tiering with an SSD hot storage tier > > - HDDs for cold storage > > > > And… I get I/O errors on some VMs when running some commands as simple as > “yum check-update”. > > > > The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) : > > > > block I/O error in device 'drive-virtio-disk0': Input/output error (5) > > > > In the ceph logs, I can see these errors : > > > > 2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from > 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to > 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest > 0x3bb26e16 != source 0xec476c54 > > 2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from > 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to > 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest > 0x3bb26e16 != source 0xec476c54 > > > > (yes, my cluster is seen as healthy) > > > > On the affected OSDs, I can see these errors : > > > > 2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v > 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 > n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 > 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod > 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data > digest 0x3bb26e16 != source 0xec476c54 > > 2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : > 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to > 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest > 0x3bb26e16 != source 0xec476c54 > > 2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v > 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 > n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 > 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod > 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected > promote error (5) Input/output error > > 2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v > 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 > n=344 ec=2726/2726 lis/c 182298/182298 les/c/
Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors
Oh my... Tried to yum upgrade in writeback mode and noticed the syslogs on the VM : Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896024 Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896064 Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895552 Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895536 Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895520 (...) Ceph is also lgging many errors : 2018-07-24 15:20:24.893872 osd.74 [ERR] 1.33 copy from 1:cd70e921:::rbd_data.21e0fe2ae8944a.:head to 1:cd70e921:::rbd_data.21e0fe2ae8944a.:head data digest 0x1480c7a1 != source 0xe1e7591b [root@ceph0 ~]# egrep 'copy from.*to.*data digest' /var/log/ceph/ceph.log |wc -l 928 Setting the cache tier again to forward mode prevents the IO errors again : In writeback mode : # yum update 2>&1|tail ---> Package glibc-headers.x86_64 0:2.12-1.209.el6_9.2 will be updated ---> Package glibc-headers.x86_64 0:2.12-1.212.el6 will be an update ---> Package gmp.x86_64 0:4.3.1-12.el6 will be updated ---> Package gmp.x86_64 0:4.3.1-13.el6 will be an update ---> Package gnupg2.x86_64 0:2.0.14-8.el6 will be updated ---> Package gnupg2.x86_64 0:2.0.14-9.el6_10 will be an update ---> Package gnutls.x86_64 0:2.12.23-21.el6 will be updated ---> Package gnutls.x86_64 0:2.12.23-22.el6 will be an update ---> Package httpd.x86_64 0:2.2.15-60.sl6.6 will be updated Error: disk I/O error ð Each time I run a yum update, I get a bit farther in the yum update process. In forward mode : works as expected I haven't tried to flush the cache pool while in forward mode... yet... Ugh :/ Regards De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : mardi 24 juillet 2018 15:01 À : ceph-users Objet : [PROVENANCE INTERNET] [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors Hi, I read the 12.2.7 upgrade notes, and set "osd skip data digest = true" before I started upgrading from 12.2.6 on my Bluestore-only cluster. As far as I can tell, my OSDs all got restarted during the upgrade and all got the option enabled : This is what I see for a specific OSD taken at random: # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep data_digest "osd_skip_data_digest": "true", This is what I see when I try to injectarg the option data digest ignore option : # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart) (...) This has been like that since I upgraded to 12.2.7. I read in the releanotes that the skip_data_digest option should be sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite... However... My config : - Using tiering with an SSD hot storage tier - HDDs for cold storage And... I get I/O errors on some VMs when running some commands as simple as "yum check-update". The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) : block I/O error in device 'drive-virtio-disk0': Input/output error (5) In the ceph logs, I can see these errors : 2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 (yes, my cluster is seen as healthy) On the affected OSDs, I can see these errors : 2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/18
[ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors
Hi, I read the 12.2.7 upgrade notes, and set "osd skip data digest = true" before I started upgrading from 12.2.6 on my Bluestore-only cluster. As far as I can tell, my OSDs all got restarted during the upgrade and all got the option enabled : This is what I see for a specific OSD taken at random: # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep data_digest "osd_skip_data_digest": "true", This is what I see when I try to injectarg the option data digest ignore option : # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart) osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart) (...) This has been like that since I upgraded to 12.2.7. I read in the releanotes that the skip_data_digest option should be sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite... However... My config : - Using tiering with an SSD hot storage tier - HDDs for cold storage And... I get I/O errors on some VMs when running some commands as simple as "yum check-update". The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) : block I/O error in device 'drive-virtio-disk0': Input/output error (5) In the ceph logs, I can see these errors : 2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 (yes, my cluster is seen as healthy) On the affected OSDs, I can see these errors : 2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error 2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.429934 7f034642a700 -1 log_channel(cluster) log [ERR] : 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 0x3bb26e16 != source 0xec476c54 2018-07-24 11:17:56.429939 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected promote error (5) Input/output error And I don't know how to recover from that. Pool #1 is my SSD cache tier, hence pg 1.23 is on the SSD side. I've tried setting the cache pool to "readforward" despite the "not well supported" warning and could immediately get back working VMs (no more I/O errors). But with no SSD tiering : not really useful. As soon as I've tried setting the cache tier to writeback again, I got those I/O errors again... (not on the yum command, but in the mean time I've stopped and set out, then unset out osd.71 to check it with badblocks just in case...) I still have to find how to reproduce the io error on an affected host to further try to debug/fix that issue... Any ideas ? Thanks && regards ___ ceph-users mailing list
[ceph-users] bluestore behavior on disks sector read errors
Hi, Every now and then , sectors die on disks. When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes degraded. The exact status is : HEALTH_ERR 1 pgs inconsistent; 1 scrub errors pg 12.127 is active+clean+inconsistent, acting [141,67,85] If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty I get : (...) "osd": 112, "errors": [ "read_error" ], "size": 4194304 When this happens, I'm forced to manually run "ceph pg repair" on the inconsistent PGs after I made sure this was a read error : I feel this should not be a manual process. If I go on the machine and look at the syslogs, I indeed see a sector read error happened once or twice. But if I try to read the sector manually, then I can because it was reallocated on the disk I presume. Last time this happened, I ran badblocks on the disk and it found no issue... My question therefore are : why doen't bluestore retry reading the sector (in case of transient errors) ? (maybe it does) why isn't the pg automatically fixed when a read error was detected ? what will happen when the disks get old and reach up to 2048 bad sectors before the controllers/smart declare them as "failure predicted" ? I can't imagine manually fixing up to Nx2048 PGs in an infrastructure of N disks where N could reach the sky... Ideas ? Thanks && regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph crush map rules for EC pools and out OSDs ?
Hi, I have 5 data nodes (bluestore, kraken), each with 24 OSDs. I enabled the optimal crush tunables. I'd like to try to "really" use EC pools, but until now I've faced cluster lockups when I was using 3+2 EC pools with a host failure domain. When a host was down for instance ;) Since I'd like the erasure codes to be more than a "nice to have feature with 12+ ceph data nodes", I wanted to try this : - Use a 14+6 EC rule - And for each data chunk: oselect 4 hosts o On these hosts, select 5 OSDs In order to do that, I created this rule in the crush map : rule 4hosts_20shards { ruleset 3 type erasure min_size 20 max_size 20 step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step choose indep 4 type host step chooseleaf indep 5 type osd step emit } I then created an EC pool with this erasure profile : ceph osd erasure-code-profile set erasurep14_6_osd ruleset-failure-domain=osd k=14 m=6 I hoped this would allow for loosing 1 host completely without locking the cluster, and I have the impression this is working.. But. There's always a but ;) I tried to make all OSDs down by stopping the ceph-osd daemons on one node. And according to ceph, the cluster is unhealthy. The ceph health detail fives me for instance this (for the 3+2 and 14+6 pools) : pg 5.18b is active+undersized+degraded, acting [57,47,2147483647,23,133] pg 9.186 is active+undersized+degraded, acting [2147483647,2147483647,2147483647,2147483647,2147483647,133,142,125,131,137,50,48,55,65,52,16,13,18,22,3] My question therefore is : why aren't the down PGs remapped onto my 5th data node since I made sure the 20 EC shards were spread onto 4 hosts only ? I thought/hoped that because osds were down, the data would be rebuilt onto another OSD/host ? I can understand the 3+2 EC pool cannot allocate OSDs on another host because the 3+2=5 hosts already, but I don't understand why the 14+6 EC pool/pgs do not rebuild somewhere else ? I do not find anything worth in a "ceph pg query", the up and acting parts are equal and do contain the 2147483647 value (wich means none as far as I understood). I've also tried to "ceph osd out" all the OSDs from one host : in that case, the 3+2 EC PGs behaves as previously, but the 14+6 EC PGs seem happy despite the fact they are still saying the out OSDs are up and acting. Is my crush rule that wrong ? Is it possible to do what I want ? Thanks for any hints... Regards Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)
Hi, I'm facing the same thing after I reinstalled a node directly in jewel... Reading : http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/31917 I can confirm that running : "udevadm trigger -c add -s block " fires the udev rules and gets ceph-osd up. Thing is : I now have reinstalled boxes (CentOS 7.2.1511 ) which do not fire udev rules at boot, and get no /dev/disk/by-parttypeuuid - and I fear there is none also just after installing the ceph RPMs since the udev rules did not pre-exist -, and other exact same boxes (same setup, same hardware, same paritions) which were upgraded from previous ceph versions, which do seem to work correctly - or so I think. All with rootfs on LVM... I'll try to compare the 2 kinds of hosts to see if I can find something usefull ... Regards -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de stephane.d...@orange.com Envoyé : vendredi 24 juin 2016 12:10 À : Loic DacharyCc : ceph-users Objet : Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie) Hi Loïc, Sorry for the delay. Well, it's a vanillia Centos iso image downloaded from centos.org mirror: [root@hulk-stg ~]# cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) This issue happens after Ceph upgrade from hammer, I haven't tested with distro starting with a fresh Ceph install Thanks, Stéphane -Original Message- From: Loic Dachary [mailto:l...@dachary.org] Sent: Tuesday, June 21, 2016 14:48 To: DAVY Stephane OBS/OCB Cc: ceph-users Subject: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie) On 16/06/2016 18:01, stephane.d...@orange.com wrote: > Hi, > > Same issue with Centos 7, I also put back this file in /etc/udev/rules.d. Hi Stephane, Could you please detail which version of CentOS 7 you are using ? I tried to reproduce the problem with CentOS 7.2 as found on the CentOS cloud images repository ( http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud-1511.qcow2 ) but it "works for me". Thanks ! > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Alexandre DERUMIER > Sent: Thursday, June 16, 2016 17:53 > To: Karsten Heymann; Loris Cuoghi > Cc: Loic Dachary; ceph-users > Subject: Re: [ceph-users] osds udev rules not triggered on reboot > (jewel, jessie) > > Hi, > > I have the same problem with osd disks not mounted at boot on jessie > with ceph jewel > > workaround is to re-add 60-ceph-partuuid-workaround.rules file to udev > > http://tracker.ceph.com/issues/16351 > > > - Mail original - > De: "aderumier" > À: "Karsten Heymann" , "Loris Cuoghi" > > Cc: "Loic Dachary" , "ceph-users" > > Envoyé: Jeudi 28 Avril 2016 07:42:04 > Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, > jessie) > > Hi, > they are missing target files in debian packages > > http://tracker.ceph.com/issues/15573 > https://github.com/ceph/ceph/pull/8700 > > I have also done some other trackers about packaging bug > > jewel: debian package: wrong /etc/default/ceph/ceph location > http://tracker.ceph.com/issues/15587 > > debian/ubuntu : TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES not specified in > /etc/default/cep > http://tracker.ceph.com/issues/15588 > > jewel: debian package: init.d script bug > http://tracker.ceph.com/issues/15585 > > > @CC loic dachary, maybe he could help to speed up packaging fixes > > - Mail original - > De: "Karsten Heymann" > À: "Loris Cuoghi" > Cc: "ceph-users" > Envoyé: Mercredi 27 Avril 2016 15:20:29 > Objet: Re: [ceph-users] osds udev rules not triggered on reboot > (jewel, jessie) > > 2016-04-27 15:18 GMT+02:00 Loris Cuoghi : >> Le 27/04/2016 14:45, Karsten Heymann a écrit : >>> one workaround I found was to add >>> >>> [Install] >>> WantedBy=ceph-osd.target >>> >>> to /lib/systemd/system/ceph-disk@.service and then manually enable >>> my disks with >>> >>> # systemctl enable ceph-disk\@dev-sdi1 # systemctl start >>> ceph-disk\@dev-sdi1 >>> >>> That way they at least are started at boot time. > >> Great! But only if the disks keep their device names, right ? > > Exactly. It's just a little workaround until the real issue is fixed. > > +Karsten > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com >
Re: [ceph-users] OSD Restart results in "unfound objects"
Hi, Same for me... unsetting the bitwise flag considerably lowered the number of unfound objects. I'll have to wait/check for the remaining 214 though... Cheers -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Samuel Just Envoyé : jeudi 2 juin 2016 01:20 À : Uwe MeseckeCc : ceph-users Objet : Re: [ceph-users] OSD Restart results in "unfound objects" Yep, looks like the same issue: 2016-06-02 00:45:27.977064 7fc11b4e9700 10 osd.17 pg_epoch: 11108 pg[34.4a( v 11104'1080336 lc 11104'1080335 (11069'1077294,11104'1080336] local-les=11108 n=50593 ec=2051 les/c/f 11104/11104/0 11106/11107/11107) [17,13] r=0 lpr=11107 pi=11101-11106/3 crt=11104'1080336 lcod 0'0 mlcod 0'0 inactive m=1 u=1] search_for_mi ssing 34:52a5cefb:::default.3653921.2__shadow_.69E1Tth4Y2Q7m0VKNbQdJe-9BgYks6I_1:head 11104'1080336 also missing on osd.13 (last_backfill MAX but with wrong sort order) Thanks! -Sam On Wed, Jun 1, 2016 at 4:04 PM, Uwe Mesecke wrote: > Hey Sam, > > glad you found the bug. As another data point a just did the whole round of > "healthy -> set sortbitwise -> osd restarts -> unfound objects -> unset > sortbitwise -> healthy" with the debug settings as described by you earlier. > > I uploaded the logfiles... > > https://www.dropbox.com/s/f5hhptbtocbxe1k/ceph-osd.13.log.gz > https://www.dropbox.com/s/kau9cjqfhmtpd89/ceph-osd.17.log.gz > > The PG with the unfound object is „34.4a“ and it seems as there are similar > log messages as you noted in the issue. > > The cluster runs jewel 10.2.1 and was created a long time ago, I think it was > giant. > > Thanks again! > > Uwe > >> Am 02.06.2016 um 00:19 schrieb Samuel Just : >> >> http://tracker.ceph.com/issues/16113 >> >> I think I found the bug. Thanks for the report! Turning off >> sortbitwise should be an ok workaround for the moment. >> -Sam >> >> On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro >> wrote: >>> Yes, it was created as Hammer. >>> I haven't faced any issues on the upgrade (despite the well know systemd), >>> and after that the cluster didn't show any suspicious behavior. >>> >>> >>> --- >>> Diego Castro / The CloudFather >>> GetupCloud.com - Eliminamos a Gravidade >>> >>> 2016-06-01 18:57 GMT-03:00 Samuel Just : Was this cluster upgraded to jewel? If so, at what version did it start? -Sam On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro wrote: > Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait > until > the weekend to push the config. > BTW, i just unset sortbitwise flag. > > > --- > Diego Castro / The CloudFather > GetupCloud.com - Eliminamos a Gravidade > > 2016-06-01 13:39 GMT-03:00 Samuel Just : >> >> Can either of you reproduce with logs? That would make it a lot >> easier to track down if it's a bug. I'd want >> >> debug osd = 20 >> debug ms = 1 >> debug filestore = 20 >> >> On all of the osds for a particular pg from when it is clean until it >> develops an unfound object. >> -Sam >> >> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro >> wrote: >>> Hello Uwe, i also have sortbitwise flag enable and i have the exactly >>> behavior of yours. >>> Perhaps this is also the root of my issues, does anybody knows if is >>> safe to >>> disable it? >>> >>> >>> --- >>> Diego Castro / The CloudFather >>> GetupCloud.com - Eliminamos a Gravidade >>> >>> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke : > Am 01.06.2016 um 10:25 schrieb Diego Castro > : > > Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon. > Today my cluster suddenly went unhealth with lots of stuck pg's > due > unfound objects, no disks failures nor node crashes, it just went > bad. > > I managed to put the cluster on health state again by marking lost > objects to delete "ceph pg mark_unfound_lost delete". > Regarding the fact that i have no idea why the cluster gone bad, i > realized restarting the osd' daemons to unlock stuck clients put > the > cluster > on unhealth and pg gone stuck again due unfound objects. > > Does anyone have this issue? Hi, I also ran into that problem after upgrading to jewel. In my case I was able to somewhat correlate this behavior with setting the sortbitwise flag after the upgrade. When the flag is set, after some time these unfound objects are popping up. Restarting osds just makes it worse and/or
Re: [ceph-users] OSD Restart results in "unfound objects"
I do… In my case, I have collocated the MONs with some OSDs, and no later than Saturday when I lost data again, I found out that one of the MON+OSD nodes ran out of memory and started killing ceph-mon on that node… At the same moment, all OSDs started to complain about not being able to see other OSDs on other machines. I suspect that when the node runs out of memory, bad things happen with for instance the network (no memory : no network buffer ?). But I can’t explain the unfound objects, as in my case, same as yours, nodes did not crash, and ceph-osd did not crash neither – hence, I’m assuming no data was lost because of sudden disk poweroff for instance, or because of any kernel or raid controller cache… For now, I’m considering moving the MONs onto dedicated nodes … hoping the out of memory was my issue. De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Diego Castro Envoyé : mercredi 1 juin 2016 10:25 À : ceph-usersObjet : [ceph-users] OSD Restart results in "unfound objects" Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon. Today my cluster suddenly went unhealth with lots of stuck pg's due unfound objects, no disks failures nor node crashes, it just went bad. I managed to put the cluster on health state again by marking lost objects to delete "ceph pg mark_unfound_lost delete". Regarding the fact that i have no idea why the cluster gone bad, i realized restarting the osd' daemons to unlock stuck clients put the cluster on unhealth and pg gone stuck again due unfound objects. Does anyone have this issue? --- Diego Castro / The CloudFather GetupCloud.com - Eliminamos a Gravidade ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] unfound objects - why and how to recover ? (bonus : jewel logs)
Hi, -- First, let me start with the bonus... I migrated from hammer => jewel and followed the migration instructions... but migrations instructions are missing this : #chown -R ceph:ceph /var/log/ceph I just discoved this was the reason I found no log nowhere about my current issue :/ -- This is maybe the 3rd time this happens to me ... This time I'd like to try to understand what happens. So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here. Ceph health was happy, but any rbd operation was hanging - hence : ceph was hung, and so were the test VMs running on it. I placed my VM in an EC pool on top of which I overlayed an RBD pool with SSDs. The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs (and the failure domain is set to hosts) "Ceph -w" wasn't displaying new status lines as usual, but ceph health (detail) wasn't saying anything would be wrong. After looking at one node, I found that ceph logs were empty on one node, so I decided to restart the OSDs on that one using : systemctl restart ceph-osd@* After I did that, ceph -w got to life again , but telling me there was a dead MON - which I restarted too. I watched some kind of recovery happening, and after a few seconds/minutes, I now see : [root@ceph0 ~]# ceph health detail HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs stuck unclean; recovery 57/373846 objects degraded (0.015%); recovery 57/110920 unfound (0.051%) pg 691.65 is stuck unclean for 310704.556119, current state active+recovery_wait+degraded, last acting [44,99,69,9] pg 691.1e5 is stuck unclean for 493631.370697, current state active+recovering+degraded, last acting [77,43,20,99] pg 691.12a is stuck unclean for 14521.475478, current state active+recovering+degraded, last acting [42,56,7,106] pg 691.165 is stuck unclean for 14521.474525, current state active+recovering+degraded, last acting [21,71,24,117] pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound recovery 57/373846 objects degraded (0.015%) recovery 57/110920 unfound (0.051%) Damn. Last time this happened, I was forced to declare lost the PGs in order to recover a "healthy" ceph, because ceph does not want to revert PGs in EC pools. But one of the VMs started hanging randomly on disk IOs... This same VM is now down, and I can't remove its disk from rbd, it's hanging at 99% - I could work that around by renaming the file and re-installing the VM on a new disk, but anyway, I'd like to understand+fix+make sure this does not happen again. We sometimes suffer power cuts here : if restarting daemons kills ceph data, I cannot think of what would happen in case of power cut... Back to the unfound objects. I have no OSD down that would be in the cluster (only 1 down, and I put it myself down - OSD.46 - , but set its weight to 0 last week) I can query the PGs, but I don't understand what I see in there. For instance : #ceph pg 691.65 query (...) "num_objects_missing": 0, "num_objects_degraded": 39, "num_objects_misplaced": 0, "num_objects_unfound": 39, "num_objects_dirty": 138, And then for 2 peers I see : "state": "active+undersized+degraded", ## undersized ??? (...) "num_objects_missing": 0, "num_objects_degraded": 138, "num_objects_misplaced": 138, "num_objects_unfound": 0, "num_objects_dirty": 138, "blocked_by": [], "up_primary": 44, "acting_primary": 44 If I look at the "missing" objects, I can see something on some OSDs : # ceph pg 691.165 list_missing (...) { "oid": { "oid": "rbd_data.8de32431bd7b7.0ea7", "key": "", "snapid": -2, "hash": 971513189, "max": 0, "pool": 691, "namespace": "" }, "need": "26521'22595", "have": "25922'22575", "locations": [] } All of the missing objects have this "need/have" discrepancy. I can see such objects in a "691.165" directory on secondary OSDs, but I do not see any 691.165 directory on the primary OSD (44)... ? For instance : [root@ceph0 ~]# ll /var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0ea7* -rw-r--r-- 1 ceph ceph 1399392 May 15 13:18 /var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0ea7__head_39E81D65__2b3_5843_0 -rw-r--r-- 1 ceph ceph 1399392 May 27 11:07
Re: [ceph-users] jewel upgrade : MON unable to start
I believe this is because I did not read the instruction thoroughly enough... this is my first "live upgrade" -Message d'origine- De : Oleksandr Natalenko [mailto:oleksa...@natalenko.name] Envoyé : lundi 2 mai 2016 16:39 À : SCHAER Frederic <frederic.sch...@cea.fr>; ceph-us...@ceph.com Objet : Re: [ceph-users] jewel upgrade : MON unable to start Why do you upgrade osds first if it is necessary to upgrade mons before everything else? On May 2, 2016 5:31:43 PM GMT+03:00, SCHAER Frederic <frederic.sch...@cea.fr> wrote: >Hi, > >I'm < sort of > following the upgrade instructions on CentOS 7.2. >I upgraded 3 OSD nodes without too many issues, even if I would rewrite >those upgrade instructions to : > > >#chrony has ID 167 on my systems... this was set at install time ! but >I use NTP anyway. > >yum remove chrony > >sed -i -e '/chrony/d' /etc/passwd > >#there is no more "service ceph stop" possible after the yum update, so >I had to run it before. Or killall ceph daemons... > >service ceph stop > >yum -y update > >chown ceph:ceph /var/lib/ceph > >#this fixed some OSD wich failed to start because of permission denied >issues on the journals. > >chown -RL --dereference ceph:ceph /var/lib/ceph > >#not done automatically : > >systemctl enable ceph-osd.target ceph.target > >#systemctl start ceph-osd.target has absolutely no effect. Nor any >.target targets, at least for me, and right after the upgrade. > >ceph-disk activate-all > >Anyways. Now I'm trying to upgrade the MON nodes... and I'm facing an >issue. >I started with one MON and left the 2 others untouched (hammer). > >First, the mons did not want to start : >May 02 15:40:58 ceph2_snip_ ceph-mon[789124]: warning: unable to create >/var/run/ceph: (13) Permission denied > >No, pb: I created and chowned the directory. >But I'm now still unable to start this MON, journalctl tells me : > >May 02 16:05:49 ceph2_snip ceph-mon[804583]: starting mon.ceph2 rank 2 >at _ipsnip_.72:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph2 fsid >70ac4a78-46c0-45e6-8ff9-878b37f50fa1 >May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: In function >'void FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02 >16:05:49.487984 >May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED >assert(i.second.state == MDSMap::STATE_STANDBY) >May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 >(3a9fba20ec743699b69bd0181dd6c54dc01c64b9) >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: >(ceph::__ceph_assert_fail(char const*, char const*, int, char >const*)+0x85) [0x7f774de221e5] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() >const+0x952) [0x7f774dd3f972] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: >(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: >(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: >(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: >(Monitor::init_paxos()+0x95) [0x7f774da67955] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7: >(Monitor::preinit()+0x949) [0x7f774da77b39] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3) >[0x7f774da03e93] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9: >(__libc_start_main()+0xf5) [0x7f774ad6fb15] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401) >[0x7f774da57401] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the >executable, or `objdump -rdS ` is needed to interpret this. >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2016-05-02 16:05:49.490966 >7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' >thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984 >May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED >assert(i.second.state == MDSMap::STATE_STANDBY) >May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 >(3a9fba20ec743699b69bd0181dd6c54dc01c64b9) >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: >(ceph::__ceph_assert_fail(char const*, char const*, int, char >const*)+0x85) [0x7f774de221e5] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() >const+0x952) [0x7f774dd3f972] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: >(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: >(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: >(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb] >May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: >(Monitor::init_paxos()+0
[ceph-users] jewel upgrade : MON unable to start
Hi, I'm < sort of > following the upgrade instructions on CentOS 7.2. I upgraded 3 OSD nodes without too many issues, even if I would rewrite those upgrade instructions to : #chrony has ID 167 on my systems... this was set at install time ! but I use NTP anyway. yum remove chrony sed -i -e '/chrony/d' /etc/passwd #there is no more "service ceph stop" possible after the yum update, so I had to run it before. Or killall ceph daemons... service ceph stop yum -y update chown ceph:ceph /var/lib/ceph #this fixed some OSD wich failed to start because of permission denied issues on the journals. chown -RL --dereference ceph:ceph /var/lib/ceph #not done automatically : systemctl enable ceph-osd.target ceph.target #systemctl start ceph-osd.target has absolutely no effect. Nor any .target targets, at least for me, and right after the upgrade. ceph-disk activate-all Anyways. Now I'm trying to upgrade the MON nodes... and I'm facing an issue. I started with one MON and left the 2 others untouched (hammer). First, the mons did not want to start : May 02 15:40:58 ceph2_snip_ ceph-mon[789124]: warning: unable to create /var/run/ceph: (13) Permission denied No, pb: I created and chowned the directory. But I'm now still unable to start this MON, journalctl tells me : May 02 16:05:49 ceph2_snip ceph-mon[804583]: starting mon.ceph2 rank 2 at _ipsnip_.72:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph2 fsid 70ac4a78-46c0-45e6-8ff9-878b37f50fa1 May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984 May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9) May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f774de221e5] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() const+0x952) [0x7f774dd3f972] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: (MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: (PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: (Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: (Monitor::init_paxos()+0x95) [0x7f774da67955] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7: (Monitor::preinit()+0x949) [0x7f774da77b39] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3) [0x7f774da03e93] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9: (__libc_start_main()+0xf5) [0x7f774ad6fb15] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401) [0x7f774da57401] May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2016-05-02 16:05:49.490966 7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984 May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 (3a9fba20ec743699b69bd0181dd6c54dc01c64b9) May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f774de221e5] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() const+0x952) [0x7f774dd3f972] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: (MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: (PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: (Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: (Monitor::init_paxos()+0x95) [0x7f774da67955] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7: (Monitor::preinit()+0x949) [0x7f774da77b39] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3) [0x7f774da03e93] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9: (__libc_start_main()+0xf5) [0x7f774ad6fb15] May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401) [0x7f774da57401] May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. May 02 16:05:49 ceph2_snip ceph-mon[804583]: 0> 2016-05-02 16:05:49.490966 7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984 May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED assert(i.second.state == MDSMap::STATE_STANDBY) (...) ? I'm now stuck with a half jewel/hammer cluster... DOH ! :'( I've seen a bug on the bugtracker, but I fail to find a work around ?
[ceph-users] ceph OSD down+out =>health ok => remove => PGs backfilling... ?
Hi, One simple/quick question. In my ceph cluster, I had a disk wich was in predicted failure. It was so much in predicted failure that the ceph OSD daemon crashed. After the OSD crashed, ceph moved data correctly (or at least that's what I thought), and a ceph -s was giving a "HEALTH_OK". Perfect. I tride to tell ceph to mark the OSD down : it told me the OSD was already down... fine. Then I ran this : ID=43 ; ceph osd down $ID ; ceph auth del osd.$ID ; ceph osd rm $ID ; ceph osd crush remove osd.$ID And immediately after this, ceph told me : # ceph -s cluster 70ac4a78-46c0-45e6-8ff9-878b37f50fa1 health HEALTH_WARN 37 pgs backfilling 3 pgs stuck unclean recovery 12086/355688 objects misplaced (3.398%) monmap e2: 3 mons at {ceph0=192.54.207.70:6789/0,ceph1=192.54.207.71:6789/0,ceph2=192.54.207.72:6789/0} election epoch 938, quorum 0,1,2 ceph0,ceph1,ceph2 mdsmap e64: 1/1/1 up {0=ceph1=up:active}, 1 up:standby-replay, 1 up:standby osdmap e25455: 119 osds: 119 up, 119 in; 35 remapped pgs pgmap v5473702: 3212 pgs, 10 pools, 378 GB data, 97528 objects 611 GB used, 206 TB / 207 TB avail 12086/355688 objects misplaced (3.398%) 3175 active+clean 37 active+remapped+backfilling client io 192 kB/s rd, 1352 kB/s wr, 117 op/s Off course, I'm sure the OSD 43 was the one that was down ;) My question therefore is : If ceph successfully and automatically migrated data off the down/out OSD, why is there even anything happening once I tell ceph to forget about this osd ? Was the cluster not "HEALTH OK" after all ? (ceph-0.94.6-0.el7.x86_64 for now) Thanks && regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph startup issues : OSDs don't start
Hi, I'm sure I'm doing something wrong, I hope someone can enlighten me... I'm encountering many issues when I restart a ceph server (any ceph server). This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64. Firt : I have disabled abrt. I don't need abrt. But when I restart, I see these logs in the systemd-udevd journal : Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory How could I possibly debug these exceptions ? Could that be related to the osd hook that I'm using to put the SSDs in another root in the crush map (that hook is a bash script, but it's calling another helper python script that I made and which is trying to use megacli to identify the SSDs on a non-jbod controller... tricky thing.) ? Then, I see these kind of errors for most if not all drives : Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc48280 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc400012670).fault' Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc40c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4e1d0).fault' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd crush create-or-move -- 113 1.81 host=ceph4 root=default'' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']' returned non-zero exit status 1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1' [1257] exit with return code 1 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file '/run/udev/data/b65:49' for '/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes to netlink monitor 0x7f4cec2f3240 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0 Please note that at that time of the boot, I think there is still no network as the interfaces are brought up later according to the network journal : Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1: [ OK ] Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2: [ OK ] => too bad for the OSD startups... I have to say I also disabled NetworkManager, and I'm using static network configuration files... but I don't know why the ceph init script would be called before network is up... ? But even if I had network, I'm having another issue : I'm wondering wether I'm hitting deadlocks somewhere... Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792] /devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2 is taking a long time (...) Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]: 70 00 05 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(out) '=== osd.107 === ' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f957c05bb40).fault' (...) Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f95700056a0).fault' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107' weight 1.81 at location {host=ceph4,root=default} to crush map' Apr 21 18:02:16 ceph4._snip_
[ceph-users] ceph startup issues : OSDs don't start
Hi, I'm sure I'm doing something wrong, I hope someone can enlighten me... I'm encountering many issues when I restart a ceph server (any ceph server). This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64. Firt : I have disabled abrt. I don't need abrt. But when I restart, I see these logs in the systemd-udevd journal : Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception in '/usr/sbin/ceph-disk' Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon, is it running? [Errno 2] No such file or directory How could I possibly debug these exceptions ? Could that be related to the osd hook that I'm using to put the SSDs in another root in the crush map (that hook is a bash script, but it's calling another helper python script that I made and which is trying to use megacli to identify the SSDs on a non-jbod controller... tricky thing.) ? Then, I see these kind of errors for most if not all drives : Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc48280 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc400012670).fault' Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700 0 -- :/885104093 >> __MON_IP__:6789/0 pipe(0x7fc40c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7fc4e1d0).fault' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd crush create-or-move -- 113 1.81 host=ceph4 root=default'' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command '['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']' returned non-zero exit status 1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate /dev/sdt1' [1257] exit with return code 1 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file '/run/udev/data/b65:49' for '/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1' Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes to netlink monitor 0x7f4cec2f3240 Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0 Please note that at that time of the boot, I think there is still no network as the interfaces are brought up later according to the network journal : Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1: [ OK ] Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2: [ OK ] => too bad for the OSD startups... I have to say I also disabled NetworkManager, and I'm using static network configuration files... but I don't know why the ceph init script would be called before network is up... ? But even if I had network, I'm having another issue : I'm wondering wether I'm hitting deadlocks somewhere... Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792] /devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2 is taking a long time (...) Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]: 70 00 05 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(out) '=== osd.107 === ' Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f957c05bb40).fault' (...) Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700 0 -- :/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f95700056a0).fault' Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107' weight 1.81 at location {host=ceph4,root=default} to crush map' Apr 21 18:02:16 ceph4._snip_
Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)
Hi, Many thanks. Just tested : I could see the rbd_id object in the EC pool, and after promoting it I could see it in the SSD cache pool and could successfully list the image information, indeed. Cheers -Message d'origine- De : Jason Dillaman [mailto:dilla...@redhat.com] Envoyé : mercredi 24 février 2016 19:16 À : SCHAER Frederic <frederic.sch...@cea.fr> Cc : ceph-us...@ceph.com; HONORE Pierre-Francois <pierre-francois.hon...@cea.fr> Objet : Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools) If you run "rados -p ls | grep "rbd_id." and don't see that object, you are experiencing that issue [1]. You can attempt to work around this issue by running "rados -p irfu-virt setomapval rbd_id. dummy value" to force-promote the object to the cache pool. I haven't tested / verified that will alleviate the issue, though. [1] http://tracker.ceph.com/issues/14762 -- Jason Dillaman ----- Original Message - > From: "SCHAER Frederic" <frederic.sch...@cea.fr> > To: ceph-us...@ceph.com > Cc: "HONORE Pierre-Francois" <pierre-francois.hon...@cea.fr> > Sent: Wednesday, February 24, 2016 12:56:48 PM > Subject: [ceph-users] ceph hammer : rbd info/Status : operation not supported > (95) (EC+RBD tier pools) > Hi, > I just started testing VMs inside ceph this week, ceph-hammer 0.94-5 here. > I built several pools, using pool tiering: > - A small replicated SSD pool (5 SSDs only, but I thought it’d be better for > IOPS, I intend to test the difference with disks only) > - Overlaying a larger EC pool > I just have 2 VMs in Ceph… and one of them is breaking something. > The VM that is not breaking was migrated using qemu-img for creating the ceph > volume, then migrating the data. Its rbd format is 1 : > rbd image 'xxx-disk1': > size 20480 MB in 5120 objects > order 22 (4096 kB objects) > block_name_prefix: rb.0.83a49.3d1b58ba > format: 1 > The VM that’s failing has a rbd format 2 > this is what I had before things started breaking : > rbd image 'yyy-disk1': > size 10240 MB in 2560 objects > order 22 (4096 kB objects) > block_name_prefix: rbd_data.8ae1f47398c89 > format: 2 > features: layering, striping > flags: > stripe unit: 4096 kB > stripe count: 1 > The VM started behaving weirdly with a huge IOwait % during its install > (that’s to say it did not take long to go wrong ;) ) > Now, this is the only thing that I can get > [root@ceph0 ~]# rbd -p irfu-virt info yyy-disk1 > 2016-02-24 18:30:33.213590 7f00e6f6d7c0 -1 librbd::ImageCtx: error reading > image id: (95) Operation not supported > rbd: error opening image yyy-disk1: (95) Operation not supported > One thing to note : the VM * IS STILL * working : I can still do disk > operations, apparently. > During the VM installation, I realized I wrongly set the target SSD caching > size to 100Mbytes, instead of 100Gbytes, and ceph complained it was almost > full : > health HEALTH_WARN > 'ssd-hot-irfu-virt' at/near target max > My question is…… am I facing the bug as reported in this list thread with > title “Possible Cache Tier Bug - Can someone confirm” ? > Or did I do something wrong ? > The libvirt and kvm that are writing into ceph are the following : > libvirt -1.2.17-13.el7_2.3.x86_64 > qemu- kvm -1.5.3-105.el7_2.3.x86_64 > Any idea how I could recover the VM file, if possible ? > Please note I have no problem with deleting the VM and rebuilding it, I just > spawned it to test. > As a matter of fact, I just “virsh destroyed” the VM, to see if I could start > it again… and I cant : > # virsh start yyy > error: Failed to start domain yyy > error: internal error: process exited while connecting to monitor: > 2016-02-24T17:49:59.262170Z qemu-kvm: -drive > file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***==:auth_supported=cephx\;none:mon_host=_\:6789,if=none,id=drive-virtio-disk0,format=raw: > error reading header from yyy-disk1 > 2016-02-24T17:49:59.263743Z qemu-kvm: -drive > file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=A***==:auth_supported=cephx\;none:mon_host=___\:6789,if=none,id=drive-virtio-disk0,format=raw: > could not open disk image > rbd:irfu-virt/___-disk1:id=irfu-***==:auth_supported=cephx\;none:mon_host=___\:6789: > Could not open 'rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=*** > Ideas ? > Thanks > Frederic > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)
Hi, I just started testing VMs inside ceph this week, ceph-hammer 0.94-5 here. I built several pools, using pool tiering: - A small replicated SSD pool (5 SSDs only, but I thought it'd be better for IOPS, I intend to test the difference with disks only) - Overlaying a larger EC pool I just have 2 VMs in Ceph... and one of them is breaking something. The VM that is not breaking was migrated using qemu-img for creating the ceph volume, then migrating the data. Its rbd format is 1 : rbd image 'xxx-disk1': size 20480 MB in 5120 objects order 22 (4096 kB objects) block_name_prefix: rb.0.83a49.3d1b58ba format: 1 The VM that's failing has a rbd format 2 this is what I had before things started breaking : rbd image 'yyy-disk1': size 10240 MB in 2560 objects order 22 (4096 kB objects) block_name_prefix: rbd_data.8ae1f47398c89 format: 2 features: layering, striping flags: stripe unit: 4096 kB stripe count: 1 The VM started behaving weirdly with a huge IOwait % during its install (that's to say it did not take long to go wrong ;) ) Now, this is the only thing that I can get [root@ceph0 ~]# rbd -p irfu-virt info yyy-disk1 2016-02-24 18:30:33.213590 7f00e6f6d7c0 -1 librbd::ImageCtx: error reading image id: (95) Operation not supported rbd: error opening image yyy-disk1: (95) Operation not supported One thing to note : the VM *IS STILL* working : I can still do disk operations, apparently. During the VM installation, I realized I wrongly set the target SSD caching size to 100Mbytes, instead of 100Gbytes, and ceph complained it was almost full : health HEALTH_WARN 'ssd-hot-irfu-virt' at/near target max My question is.. am I facing the bug as reported in this list thread with title "Possible Cache Tier Bug - Can someone confirm" ? Or did I do something wrong ? The libvirt and kvm that are writing into ceph are the following : libvirt-1.2.17-13.el7_2.3.x86_64 qemu-kvm-1.5.3-105.el7_2.3.x86_64 Any idea how I could recover the VM file, if possible ? Please note I have no problem with deleting the VM and rebuilding it, I just spawned it to test. As a matter of fact, I just "virsh destroyed" the VM, to see if I could start it again... and I cant : # virsh start yyy error: Failed to start domain yyy error: internal error: process exited while connecting to monitor: 2016-02-24T17:49:59.262170Z qemu-kvm: -drive file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***==:auth_supported=cephx\;none:mon_host=_\:6789,if=none,id=drive-virtio-disk0,format=raw: error reading header from yyy-disk1 2016-02-24T17:49:59.263743Z qemu-kvm: -drive file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=A***==:auth_supported=cephx\;none:mon_host=___\:6789,if=none,id=drive-virtio-disk0,format=raw: could not open disk image rbd:irfu-virt/___-disk1:id=irfu-***==:auth_supported=cephx\;none:mon_host=___\:6789: Could not open 'rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=*** Ideas ? Thanks Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure Coding pool stuck at creation because of pre-existing crush ruleset ?
Hi, With 5 hosts, I could successfully create pools with k=4 and m=1, with the failure domain being set to "host". With 6 hosts, I could also create k=4,m=1 EC pools. But I suddenly failed with 6 hosts k=5 and m=1, or k=4,m=2 : the PGs were never created - I reused the pool name for my tests, this seems to matter, see below- ?? HEALTH_WARN 512 pgs stuck inactive; 512 pgs stuck unclean pg 159.70 is stuck inactive since forever, current state creating, last acting [] pg 159.71 is stuck inactive since forever, current state creating, last acting [] pg 159.72 is stuck inactive since forever, current state creating, last acting [] The pool is like this : [root@ceph0 ~]# ceph osd pool get testec erasure_code_profile erasure_code_profile: erasurep4_2_host [root@ceph0 ~]# ceph osd erasure-code-profile get erasurep4_2_host directory=/usr/lib64/ceph/erasure-code k=4 m=2 plugin=isa ruleset-failure-domain=host The PG list is like this - all PGs are alike- : pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 159.0 0 0 0 0 0 0 0 0 creating0.000'0 0:0 [] -1 [] -1 0'0 2015-09-30 14:41:01.219196 0'0 2015-09-30 14:41:01.219196 159.1 0 0 0 0 0 0 0 0 creating0.000'0 0:0 [] -1 [] -1 0'0 2015-09-30 14:41:01.219197 0'0 2015-09-30 14:41:01.219197 I can't dump a PG (but if it's on no OSD then...) [root@ceph0 ~]# ceph pg 159.0 dump ^CError EINTR: problem getting command descriptions from pg.159.0 ? Hangs. The OSD tree is like this : -1 21.71997 root default -2 3.62000 host ceph4 9 1.81000 osd.9 up 1.0 1.0 15 1.81000 osd.15 up 1.0 1.0 -3 3.62000 host ceph0 5 1.81000 osd.5 up 1.0 1.0 11 1.81000 osd.11 up 1.0 1.0 -4 3.62000 host ceph1 6 1.81000 osd.6 up 1.0 1.0 12 1.81000 osd.12 up 1.0 1.0 -5 3.62000 host ceph2 7 1.81000 osd.7 up 1.0 1.0 13 1.81000 osd.13 up 1.0 1.0 -6 3.62000 host ceph3 8 1.81000 osd.8 up 1.0 1.0 14 1.81000 osd.14 up 1.0 1.0 -13 3.62000 host ceph5 10 1.81000 osd.10 up 1.0 1.0 16 1.81000 osd.16 up 1.0 1.0 Then, I dumped the crush ruleset and noticed the "max_size=5". [root@ceph0 ~]# ceph osd pool get testec crush_ruleset crush_ruleset: 1 [root@ceph0 ~]# ceph osd crush rule dump testec { "rule_id": 1, "rule_name": "testec", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 5, I thought I should not care, since I'm not creating a replicated pool but... I then deleted the pool + deleted the "testec" ruleset, re-created the pool and... boom, PGs started being created !? Now, the ruleset looks like this : [root@ceph0 ~]# ceph osd crush rule dump testec { "rule_id": 1, "rule_name": "testec", "ruleset": 1, "type": 3, "min_size": 3, "max_size": 6, ^^^ Is this a bug, or a "feature" (if so, I'd be glad if someone could shed some light on it ?) ? I'm presuming ceph is considering that an EC chunk is a replica, but I'm failing to understand the documentation : I did not select the crush ruleset when I created the pool. Still, the ruleset was chosen by default (by CRUSH?) , and was not working... ? Thanks && regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Important security noticed regarding release signing key
-Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Wido den Hollander Envoyé : lundi 21 septembre 2015 15:50 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] Important security noticed regarding release signing key On 21-09-15 15:05, SCHAER Frederic wrote: > Hi, > > Forgive the question if the answer is obvious... It's been more than "an hour > or so" and eu.ceph.com apparently still hasn't been re-signed or at least > what I checked wasn't : > > # rpm -qp --qf '%{RSAHEADER:pgpsig}' > http://eu.ceph.com/rpm-hammer/el7/x86_64/ceph-0.94.3-0.el7.centos.x86_64.rpm > RSA/SHA1, Wed 26 Aug 2015 09:57:17 PM CEST, Key ID 7ebfdd5d17ed316d > > Should this repository/mirror be discarded and should we (in EU) switch to > download.ceph.com ? I fixed eu.ceph.com by putting a Varnish HTTP cache in between which now links to ceph.com You can still use eu.ceph.com and should be able to do so. eu.ceph.com caches all traffic so that should be much snappier then downloading everything from download.ceph.com directly. Wido [>- FS : -<] Many thanks for your quick reply and quick reaction ! Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
De : Jake Young [mailto:jak3...@gmail.com] Envoyé : mercredi 29 juillet 2015 17:13 À : SCHAER Frederic frederic.sch...@cea.fr Cc : ceph-users@lists.ceph.com Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic frederic.sch...@cea.frmailto:frederic.sch...@cea.fr wrote: Hi again, So I have tried - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores - changing the memory configuration, from advanced ecc mode to performance mode, boosting the memory bandwidth from 35GB/s to 40GB/s - plugged a second 10GB/s link and setup a ceph internal network - tried various tuned-adm profile such as throughput-performance This changed about nothing. If - the CPUs are not maxed out, and lowering the frequency doesn't change a thing - the network is not maxed out - the memory doesn't seem to have an impact - network interrupts are spread across all 8 cpu cores and receive queues are OK - disks are not used at their maximum potential (iostat shows my dd commands produce much more tps than the 4MB ceph transfers...) Where can I possibly find a bottleneck ? I'm /(almost) out of ideas/ ... :'( Regards Frederic, I was trying to optimize my ceph cluster as well and I looked at all of the same things you described, which didn't help my performance noticeably. The following network kernel tuning settings did help me significantly. This is my /etc/sysctl.conf file on all of my hosts: ceph mons, ceph osds and any client that connects to my ceph cluster. # Increase Linux autotuning TCP buffer limits # Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE # Don't set tcp_mem itself! Let the kernel scale it based on RAM. #net.core.rmem_max = 56623104 #net.core.wmem_max = 56623104 # Use 128M buffers net.core.rmem_max = 134217728 net.core.wmem_max = 134217728 net.core.rmem_default = 67108864 net.core.wmem_default = 67108864 net.core.optmem_max = 134217728 net.ipv4.tcp_rmem = 4096 87380 67108864 net.ipv4.tcp_wmem = 4096 65536 67108864 # Make room for more TIME_WAIT sockets due to more clients, # and allow them to be reused if we run out of sockets # Also increase the max packet backlog net.core.somaxconn = 1024 # Increase the length of the processor input queue net.core.netdev_max_backlog = 25 net.ipv4.tcp_max_syn_backlog = 3 net.ipv4.tcp_max_tw_buckets = 200 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_fin_timeout = 10 # Disable TCP slow start on idle connections net.ipv4.tcp_slow_start_after_idle = 0 # If your servers talk UDP, also up these limits net.ipv4.udp_rmem_min = 8192 net.ipv4.udp_wmem_min = 8192 # Disable source routing and redirects net.ipv4.conf.all.send_redirects = 0 net.ipv4.conf.all.accept_redirects = 0 net.ipv4.conf.all.accept_source_route = 0 # Recommended when jumbo frames are enabled net.ipv4.tcp_mtu_probing = 1 I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else. Let me know if that helps. Jake [- FS : -] Hi, Thanks for suggesting these :] I finally got some time to try your kernel parameters… but that doesn’t seem to help at least for the EC pools. I’ll need to re-add all the disk OSDs to be really sure, especially with the replicated pools – I’d like to see if at least the replicated pools are better, so that I can use them as frontend pools… Regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi again, So I have tried - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores - changing the memory configuration, from advanced ecc mode to performance mode, boosting the memory bandwidth from 35GB/s to 40GB/s - plugged a second 10GB/s link and setup a ceph internal network - tried various tuned-adm profile such as throughput-performance This changed about nothing. If - the CPUs are not maxed out, and lowering the frequency doesn't change a thing - the network is not maxed out - the memory doesn't seem to have an impact - network interrupts are spread across all 8 cpu cores and receive queues are OK - disks are not used at their maximum potential (iostat shows my dd commands produce much more tps than the 4MB ceph transfers...) Where can I possibly find a bottleneck ? I'm /(almost) out of ideas/ ... :'( Regards -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : vendredi 24 juillet 2015 16:04 À : Christian Balzer; ceph-users@lists.ceph.com Objet : [PROVENANCE INTERNET] Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? Hi, Thanks. I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores. I also discovered turbostat which showed me the R510s were not configured for performance in the bios (but dbpm - demand based power management), and were not bumping the CPUs frequency to 2.4GHz as they should... only apparently remaining at 1.6Ghz... But changing that did not improve things unfortunately. I know have CPUs using their xeon turbo frequency, but no throughput improvement. Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly according to redhat, i.e : one receive queue per physical core, spreading the IRQ load everywhere. One thing I noticed though is that the dell BIOS allows to change IRQs... but once you change the network card IRQ, it also changes the RAID card IRQ as well as many others, all sharing the same bios IRQ (that's therefore apparently a useless option). Weird. Still attempting to determine the bottleneck ;) Regards Frederic -Message d'origine- De : Christian Balzer [mailto:ch...@gol.com] Envoyé : jeudi 23 juillet 2015 14:18 À : ceph-users@lists.ceph.com Cc : Gregory Farnum; SCHAER Frederic Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote: Your note that dd can do 2GB/s without networking makes me think that you should explore that. As you say, network interrupts can be problematic in some systems. The only thing I can think of that's been really bad in the past is that some systems process all network interrupts on cpu 0, and you probably want to make sure that it's splitting them across CPUs. An IRQ overload would be very visible with atop. Splitting the IRQs will help, but it is likely to need some smarts. As in, irqbalance may spread things across NUMA nodes. A card with just one IRQ line will need RPS (Receive Packet Steering), irqbalance can't help it. For example, I have a compute node with such a single line card and Quad Opterons (64 cores, 8 NUMA nodes). The default is all interrupt handling on CPU0 and that is very little, except for eth2. So this gets a special treatment: --- echo 4 /proc/irq/106/smp_affinity_list --- Pinning the IRQ for eth2 to CPU 4 by default --- echo f0 /sys/class/net/eth2/queues/rx-0/rps_cpus --- giving RPS CPUs 4-7 to work with. At peak times it needs more than 2 cores, otherwise with this architecture just using 4 and 5 (same L2 cache) would be better. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi, Thanks. I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores. I also discovered turbostat which showed me the R510s were not configured for performance in the bios (but dbpm - demand based power management), and were not bumping the CPUs frequency to 2.4GHz as they should... only apparently remaining at 1.6Ghz... But changing that did not improve things unfortunately. I know have CPUs using their xeon turbo frequency, but no throughput improvement. Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly according to redhat, i.e : one receive queue per physical core, spreading the IRQ load everywhere. One thing I noticed though is that the dell BIOS allows to change IRQs... but once you change the network card IRQ, it also changes the RAID card IRQ as well as many others, all sharing the same bios IRQ (that's therefore apparently a useless option). Weird. Still attempting to determine the bottleneck ;) Regards Frederic -Message d'origine- De : Christian Balzer [mailto:ch...@gol.com] Envoyé : jeudi 23 juillet 2015 14:18 À : ceph-users@lists.ceph.com Cc : Gregory Farnum; SCHAER Frederic Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote: Your note that dd can do 2GB/s without networking makes me think that you should explore that. As you say, network interrupts can be problematic in some systems. The only thing I can think of that's been really bad in the past is that some systems process all network interrupts on cpu 0, and you probably want to make sure that it's splitting them across CPUs. An IRQ overload would be very visible with atop. Splitting the IRQs will help, but it is likely to need some smarts. As in, irqbalance may spread things across NUMA nodes. A card with just one IRQ line will need RPS (Receive Packet Steering), irqbalance can't help it. For example, I have a compute node with such a single line card and Quad Opterons (64 cores, 8 NUMA nodes). The default is all interrupt handling on CPU0 and that is very little, except for eth2. So this gets a special treatment: --- echo 4 /proc/irq/106/smp_affinity_list --- Pinning the IRQ for eth2 to CPU 4 by default --- echo f0 /sys/class/net/eth2/queues/rx-0/rps_cpus --- giving RPS CPUs 4-7 to work with. At peak times it needs more than 2 cores, otherwise with this architecture just using 4 and 5 (same L2 cache) would be better. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi, Well I think the journaling would still appear in the dstat output, as that's still IOs : even if the user-side bandwidth indeed is cut in half, that should not be the case of disks IO. For instance I just tried a replicated pool for the test, and got around 1300MiB/s in dstat for about 600MiB/s in the rados bench - I take it that indeed, with replication/size=2, there's a total of 2 replicas, so that's 1 user IO for 2 * [1 replicas + 1 journals] / number of hosts = 600*2*2/2 = 1200MiBs of IOs per host (+/- the approximations) ... Using the dd flag oflag=sync indeed lowers the dstat values down to 1100-1300MiB/s. Still above what ceph uses with EC pools . I have tried to identify/watch interrupt issues (using the watch command), but I have to say I failed until know. The Broadcom card is indeed spreading the load on the cpus: # egrep 'CPU|p2p' /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 80: 881646372 1508 30 97328 0 10459270 2715 8753 0 12765 5100 9148 9420 0 PCI-MSI-edge p2p1 82: 179710 165107 94684 334842 210219 47403 270330 166877 3516 229043 709844660 16512 5088 2456312 12302 PCI-MSI-edge p2p1-fp-0 83: 12454 14073 5571 15196 5282 22301 11522 21299 4092581302069 1303 79810 705953243 1836 15190 883683 PCI-MSI-edge p2p1-fp-1 84: 6463 13994 57006 16200 16778 374815 558398 11902 695554360 94228 1252 18649 825684 7555 731875 190402 PCI-MSI-edge p2p1-fp-2 85: 163228 259899 143625 121326 107509 798435 168027 144088 75321 89962 55297 715175665 784356 53961 92153 92959 PCI-MSI-edge p2p1-fp-3 86:233267453226792070827220797122540051748938 39492831684674 65008514098872704778 140711 160954 5910372981286 672487805 PCI-MSI-edge p2p1-fp-4 87: 33772 233318 136341 58163 506773 183451 18269706 52425 226509 22150 17026 176203 5942 681346619 270341 87435 PCI-MSI-edge p2p1-fp-5 88: 65103573 105514146 51193688 51330824 41771147 61202946 41053735 49301547 181380 73028922 39525 172439 155778 108065 154750931 26348797 PCI-MSI-edge p2p1-fp-6 89: 59287698 120778879 43446789 47063897 39634087 39463210 46582805 48786230 342778 82670325 135397 438041 318995 3642955 179107495 833932 PCI-MSI-edge p2p1-fp-7 90: 1804 4453 2434 19885 11527 9771 12724 2392840 12721439 1166 3354 560 69386 9233 PCI-MSI-edge p2p2 92:6455149433007258203245273513 115645711838476 22200494039978 977482 15351931 9494511685983 772531 271810175312351954224 PCI-MSI-edge p2p2-fp-0 I don't know yet how to check if there are memory bandwith/latency/whatever issues... Regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi Gregory, Thanks for your replies. Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts). 2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC H800 + 11 2TB SAS disks (one unused SSD...) The EC pool is defined with k=4, m=1 I set the failure domain to OSD for the test The OSDs are set up with XFS and a 10GB journal 1st partition (the single doomed-dell SSD was a bottleneck for 23 disks…) All disks are presently configured with a single-RAID0 because H700/H800 do not support JBOD. I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command : rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 --run-name bench_`hostname -s` --no-cleanup I'm aggregating the average bandwidth at the end of the tests. I'm monitoring the Ceph servers stats live with this dstat command: dstat -N p2p1,p2p2,total The network MTU is 9000 on all nodes. With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for the whole 2-nodes ceph cluster / 5 clients. I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less considering I removed half disks ! I'm therefore starting to think I am CPU /memory bandwidth limited... ? That's not however what I am tempted to conclude (for the cpu at least) when I see the dstat output, as it says the cpus still sit idle or IO waiting : total-cpu-usage -dsk/total- --net/p2p1net/p2p2---net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send: recv send: recv send| in out | int csw 1 1 97 0 0 0| 586k 1870k| 0 0 : 0 0 : 0 0 | 49B 455B|816715k 29 17 24 27 0 3| 128k 734M| 367M 870k: 0 0 : 367M 870k| 0 0 | 61k 61k 30 17 34 16 0 3| 432k 750M| 229M 567k: 199M 168M: 427M 168M| 0 0 | 65k 68k 25 14 38 20 0 3| 16k 634M| 232M 654k: 162M 133M: 393M 134M| 0 0 | 56k 64k 19 10 46 23 0 2| 232k 463M| 244M 670k: 184M 138M: 428M 139M| 0 0 | 45k 55k 15 8 46 29 0 1| 368k 422M| 213M 623k: 149M 110M: 362M 111M| 0 0 | 35k 41k 25 17 37 19 0 3| 48k 584M| 139M 394k: 137M 90M: 276M 91M| 0 0 | 54k 53k Could it be the interruptions or system context switches that cause this relatively poor performance per node ? PCI-E interractions with the PERC cards ? I know I can get way more disk throughput with dd (command below) total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 1 1 97 0 0 0| 595k 2059k| 0 0 | 634B 2886B|797115k 1 93 0 3 0 3| 0 1722M| 49k 78k| 0 0 | 40k 47k 1 93 0 3 0 3| 0 1836M| 40k 69k| 0 0 | 45k 57k 1 95 0 2 0 2| 0 1805M| 40k 69k| 0 0 | 38k 34k 1 94 0 3 0 2| 0 1864M| 37k 38k| 0 0 | 35k 24k (…) Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo writing $FS_THR times (threads) $[ 4 * FILE_MB ] mb on $i... ; for j in `seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M count=$[ FILE_MB / 4 ] done ; done ; wait) ; echo wrote $[ N_FS * FILE_MB * FS_THR ] MB on $N_FS FS with $FS_THR threads ; rm -f /var/lib/ceph/osd/*/test.zero* Hope I gave you more insights on what I’m trying to achieve, and where I’m failing ? Regards -Message d'origine- De : Gregory Farnum [mailto:g...@gregs42.com] Envoyé : mercredi 22 juillet 2015 16:01 À : Florent MONTHEL Cc : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ?? We might also be able to help you improve or better understand your results if you can tell us exactly what tests you're conducting that are giving you these numbers. -Greg On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL fmont...@flox-arts.netmailto:fmont...@flox-arts.net wrote: Hi Frederic, When you have Ceph cluster with 1 node you don’t experienced network and communication overhead due to distributed model With 2 nodes and EC 4+1 you will have communication between 2 nodes but you will keep internal communication (2 chunks on first node and 3 chunks on second node) On your configuration EC pool is setup with 4+1 so you will have for each write overhead due to write spreading on 5 nodes (for 1 customer IO, you will experience 5 Ceph IO due to EC 4+1) It’s the reason for that I think you’re
[ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??
Hi, As I explained in various previous threads, I'm having a hard time getting the most out of my test ceph cluster. I'm benching things with rados bench. All Ceph hosts are on the same 10GB switch. Basically, I know I can get about 1GB/s of disk write performance per host, when I bench things with dd (hundreds of dd threads) +iperf 10gbit inbound+iperf 10gbit outbound. I also can get 2GB/s or even more if I don't bench the network at the same time, so yes, there is a bottleneck between disks and network, but I can't identify which one, and it's not relevant for what follows anyway (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about this strange bottleneck though...) My hosts each are connected though a single 10Gbits/s link for now. My problem is the following. Please note I see the same kind of poor performance with replicated pools... When testing EC pools, I ended putting a 4+1 pool on a single node in order to track down the ceph bottleneck. On that node, I can get approximately 420MB/s write performance using rados bench, but that's fair enough since the dstat output shows that real data throughput on disks is about 800+MB/s (that's the ceph journal effect, I presume). I tested Ceph on my other standalone nodes : I can also get around 420MB/s, since they're identical. I'm testing things with 5 10Gbits/s clients, each running rados bench. But what I really don't get is the following : - With 1 host : throughput is 420MB/s - With 2 hosts : I get 640MB/s. That's surely not 2x420MB/s. - With 5 hosts : I get around 1375MB/s . That's far from the expected 2GB/s. The network never is maxed out, nor are the disks or CPUs. The hosts throughput I see with rados bench seems to match the dstat throughput. That's as if each additional host was only capable of adding 220MB/s of throughput. Compare this to the 1GB/s they are capable of (420MB/s with journals)... I'm therefore wondering what could possibly be so wrong with my setup ?? Why would it impact so much the performance to add hosts ? On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards. I know, not perfect, but not THAT bad neither... ? Any hint would be greatly appreciated ! Thanks Frederic Schaer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] read performance VS network usage
Hi Nick, Thanks for your explanation. I have some doubts this is what's happening, but I'm going to first check what happens with disks IO with a clean pool and clean bench data (discarding any existing cache...) I'm using the following commands for creating the bench data (and benching writes) on all 5 clients : rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 --run-name bench_`hostname -s` --no-cleanup Replace write with seq for the read bench. As you can see, I do specify the -b option, even though I'm wondering if this one affects the read bench, the help seems unclear to me: -b op_size set the size of write ops for put or benchmarking Still, even if it didn't work and if rados bench reads were issuing 4kb reads, how could this explain that all 5 servers receive 800MiB/s (and not megabits... ) each, and that they only send on the average what each client receives ? Where would the extra ~400MiB (not bits) come from ? If the OSDs were reconstructing data using the other hosts data before sending that to the client, this would mean the OSD hosts would send much more data to their neighbor OSDs on the network than my average client throughput -and not roughly the same amount-, wouldn't it ? I took a look at the network interfaces, hoping this would come from localhost, but this did not : this came in from the physical network interface... Still trying to understand ;) Regards De : Nick Fisk [mailto:n...@fisk.me.uk] Envoyé : jeudi 23 avril 2015 17:21 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : RE: read performance VS network usage Hi Frederic, If you are using EC pools, the primary OSD requests the remaining shards of the object from the other OSD's, reassembles it and then sends the data to the client. The entire object needs to be reconstructed even for a small IO operation, so 4kb reads could lead to quite a large IO amplification if you are using the default 4MB object sizes. I believe this is what you are seeing, although creating a RBD with smaller object sizes can help reduce this. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER Frederic Sent: 23 April 2015 15:40 To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] read performance VS network usage Hi again, On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). For these tests, I've setup a RAID0 on the 23 disks. For now, I'm not using SSDs as I discovered my vendor apparently decreased their perfs on purpose... So : 5 server nodes of which 3 are MONS too. I also have 5 clients. All of them have a single 10G NIC, I'm not using a private network. I'm testing EC pools, with the failure domain set to hosts. The EC pool k/m is set to k=4/m=1 I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64) And... I just found out I had limited read performance. While I was watching the stats using dstat on one server node, I noticed that during the rados (read) bench, all the server nodes sent about 370MiB/s on the network, which is the average speed I get per server, but they also all received about 750-800MiB/s on that same network. And 800MB/s is about as much as you can get on a 10G link... I'm trying to understand why I see this inbound data flow ? - Why does a server node receive data at all during a read bench ? - Why is it about twice as much as the data the node is sending ? - Is this about verifying data integrity at read time ? I'm alone on the cluster, it's not used anywhere else. I will try tomorrow to see if adding a 2nd 10G port (with a private network this time) improves the performance, but I'm really curious here to understand what's the bottleneck and what's ceph doing... ? Looking at the write performance, I see the same kind of behavior : nodes send about half the amount of data they receive (600MB/300MB), but this might be because this time the client only sends the real data and the erasure coding happens behind the scenes (or not ?) Any idea ? Regards Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] read performance VS network usage
OK, I must learn how to read dstat... I took the recv column for the send column... total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 15 22 43 16 0 4| 343M 7916k| 252M 659M| 0 0 | 78k 122k 15 18 45 18 0 4| 368M 4500k| 271M 592M| 0 0 | 82k 138k (...) I also notice that I see less network throughput with an MTU=9000. So... conclusion : the nodes indeed receive part of the data and send it back to the client (even with 4MB reads, if the bench takes the option). My last surprise is with the clients : usr sys idl wai hiq siq| read writ| recv send| in out | int csw 2 1 97 0 0 0| 718B 116k| 0 0 | 0 0 |1947 3148 12 14 72 0 0 1| 028k| 764M 1910k| 0 0 | 25k 27k 11 13 75 0 0 1| 0 4096B| 758M 1860k| 0 0 | 25k 27k 13 14 71 0 0 1| 0 4096B| 785M 1815k| 0 0 | 25k 24k 12 14 73 0 0 1| 0 0 | 839M 1960k| 0 0 | 25k 25k 12 14 72 0 0 2| 0 548k| 782M 1873k| 0 0 | 24k 25k 11 14 73 0 0 1| 044k| 782M 1924k| 0 0 | 25k 26k They are also receiving much more data than what rados bench reports (around 275MB/s each)... would that be some sort of data amplification ?? Regards De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : vendredi 24 avril 2015 10:03 À : Nick Fisk; ceph-users@lists.ceph.com Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage Hi Nick, Thanks for your explanation. I have some doubts this is what's happening, but I'm going to first check what happens with disks IO with a clean pool and clean bench data (discarding any existing cache...) I'm using the following commands for creating the bench data (and benching writes) on all 5 clients : rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 --run-name bench_`hostname -s` --no-cleanup Replace write with seq for the read bench. As you can see, I do specify the -b option, even though I'm wondering if this one affects the read bench, the help seems unclear to me: -b op_size set the size of write ops for put or benchmarking Still, even if it didn't work and if rados bench reads were issuing 4kb reads, how could this explain that all 5 servers receive 800MiB/s (and not megabits... ) each, and that they only send on the average what each client receives ? Where would the extra ~400MiB (not bits) come from ? If the OSDs were reconstructing data using the other hosts data before sending that to the client, this would mean the OSD hosts would send much more data to their neighbor OSDs on the network than my average client throughput -and not roughly the same amount-, wouldn't it ? I took a look at the network interfaces, hoping this would come from localhost, but this did not : this came in from the physical network interface... Still trying to understand ;) Regards De : Nick Fisk [mailto:n...@fisk.me.uk] Envoyé : jeudi 23 avril 2015 17:21 À : SCHAER Frederic; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Objet : RE: read performance VS network usage Hi Frederic, If you are using EC pools, the primary OSD requests the remaining shards of the object from the other OSD's, reassembles it and then sends the data to the client. The entire object needs to be reconstructed even for a small IO operation, so 4kb reads could lead to quite a large IO amplification if you are using the default 4MB object sizes. I believe this is what you are seeing, although creating a RBD with smaller object sizes can help reduce this. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER Frederic Sent: 23 April 2015 15:40 To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] read performance VS network usage Hi again, On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). For these tests, I've setup a RAID0 on the 23 disks. For now, I'm not using SSDs as I discovered my vendor apparently decreased their perfs on purpose... So : 5 server nodes of which 3 are MONS too. I also have 5 clients. All of them have a single 10G NIC, I'm not using a private network. I'm testing EC pools, with the failure domain set to hosts. The EC pool k/m is set to k=4/m=1 I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64) And... I just found out I had limited read performance. While I was watching the stats using dstat on one server node, I noticed that during the rados (read) bench, all the server nodes sent about 370MiB/s on the network, which is the average speed I get per server, but they also all received about 750-800MiB/s on that same network. And 800MB/s is about as much as you can get on a 10G link... I'm trying to understand why I see this inbound data flow
Re: [ceph-users] read performance VS network usage
And to reply to myslef... The client apparent network bandwidth is just the fact that dstat aggregates the bridge network interface and the physical interface, thus doubling the data... Ah ah ah. Regards De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : vendredi 24 avril 2015 10:26 À : ceph-users@lists.ceph.com Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage OK, I must learn how to read dstat... I took the recv column for the send column... total-cpu-usage -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 15 22 43 16 0 4| 343M 7916k| 252M 659M| 0 0 | 78k 122k 15 18 45 18 0 4| 368M 4500k| 271M 592M| 0 0 | 82k 138k (...) I also notice that I see less network throughput with an MTU=9000. So... conclusion : the nodes indeed receive part of the data and send it back to the client (even with 4MB reads, if the bench takes the option). My last surprise is with the clients : usr sys idl wai hiq siq| read writ| recv send| in out | int csw 2 1 97 0 0 0| 718B 116k| 0 0 | 0 0 |1947 3148 12 14 72 0 0 1| 028k| 764M 1910k| 0 0 | 25k 27k 11 13 75 0 0 1| 0 4096B| 758M 1860k| 0 0 | 25k 27k 13 14 71 0 0 1| 0 4096B| 785M 1815k| 0 0 | 25k 24k 12 14 73 0 0 1| 0 0 | 839M 1960k| 0 0 | 25k 25k 12 14 72 0 0 2| 0 548k| 782M 1873k| 0 0 | 24k 25k 11 14 73 0 0 1| 044k| 782M 1924k| 0 0 | 25k 26k They are also receiving much more data than what rados bench reports (around 275MB/s each)... would that be some sort of data amplification ?? Regards De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER Frederic Envoyé : vendredi 24 avril 2015 10:03 À : Nick Fisk; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage Hi Nick, Thanks for your explanation. I have some doubts this is what's happening, but I'm going to first check what happens with disks IO with a clean pool and clean bench data (discarding any existing cache...) I'm using the following commands for creating the bench data (and benching writes) on all 5 clients : rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 --run-name bench_`hostname -s` --no-cleanup Replace write with seq for the read bench. As you can see, I do specify the -b option, even though I'm wondering if this one affects the read bench, the help seems unclear to me: -b op_size set the size of write ops for put or benchmarking Still, even if it didn't work and if rados bench reads were issuing 4kb reads, how could this explain that all 5 servers receive 800MiB/s (and not megabits... ) each, and that they only send on the average what each client receives ? Where would the extra ~400MiB (not bits) come from ? If the OSDs were reconstructing data using the other hosts data before sending that to the client, this would mean the OSD hosts would send much more data to their neighbor OSDs on the network than my average client throughput -and not roughly the same amount-, wouldn't it ? I took a look at the network interfaces, hoping this would come from localhost, but this did not : this came in from the physical network interface... Still trying to understand ;) Regards De : Nick Fisk [mailto:n...@fisk.me.uk] Envoyé : jeudi 23 avril 2015 17:21 À : SCHAER Frederic; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Objet : RE: read performance VS network usage Hi Frederic, If you are using EC pools, the primary OSD requests the remaining shards of the object from the other OSD's, reassembles it and then sends the data to the client. The entire object needs to be reconstructed even for a small IO operation, so 4kb reads could lead to quite a large IO amplification if you are using the default 4MB object sizes. I believe this is what you are seeing, although creating a RBD with smaller object sizes can help reduce this. Nick From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER Frederic Sent: 23 April 2015 15:40 To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: [ceph-users] read performance VS network usage Hi again, On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). For these tests, I've setup a RAID0 on the 23 disks. For now, I'm not using SSDs as I discovered my vendor apparently decreased their perfs on purpose... So : 5 server nodes of which 3 are MONS too. I also have 5 clients. All of them have a single 10G NIC, I'm not using a private network. I'm testing EC pools, with the failure domain set to hosts. The EC pool k/m is set to k=4/m=1 I'm testing EC pools using the giant release (ceph-0.87.1-0.el7
[ceph-users] read performance VS network usage
Hi again, On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). For these tests, I've setup a RAID0 on the 23 disks. For now, I'm not using SSDs as I discovered my vendor apparently decreased their perfs on purpose... So : 5 server nodes of which 3 are MONS too. I also have 5 clients. All of them have a single 10G NIC, I'm not using a private network. I'm testing EC pools, with the failure domain set to hosts. The EC pool k/m is set to k=4/m=1 I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64) And... I just found out I had limited read performance. While I was watching the stats using dstat on one server node, I noticed that during the rados (read) bench, all the server nodes sent about 370MiB/s on the network, which is the average speed I get per server, but they also all received about 750-800MiB/s on that same network. And 800MB/s is about as much as you can get on a 10G link... I'm trying to understand why I see this inbound data flow ? - Why does a server node receive data at all during a read bench ? - Why is it about twice as much as the data the node is sending ? - Is this about verifying data integrity at read time ? I'm alone on the cluster, it's not used anywhere else. I will try tomorrow to see if adding a 2nd 10G port (with a private network this time) improves the performance, but I'm really curious here to understand what's the bottleneck and what's ceph doing... ? Looking at the write performance, I see the same kind of behavior : nodes send about half the amount of data they receive (600MB/300MB), but this might be because this time the client only sends the real data and the erasure coding happens behind the scenes (or not ?) Any idea ? Regards Frederic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-crush-location + SSD detection ?
Hi, I've seen and read a few things about ceph-crush-location and I think that's what I need. What I need (want to try) is : a way to have SSDs in non-dedicated hosts, but also to put those SSDs in a dedicated ceph root. From what I read, using ceph-crush-location, I could add a hostname with a SSD suffix in case the tool is called against a SSD... thing is : I must make sure this is a SSD, and this is where coding and experimenting comes. Hence, I'd like to know if someone would have an already working implementation that would detect if the OSD is a SSD, and if so, append a string to the hostname ? I'm for instance wondering when this tool is called, if the OSD is already mounted (or should have been...), what happens at boot ... I know I can get the OSD mountpoint using something like that on a running OSD : #ceph --format xml --admin-daemon /var/run/ceph/ceph-osd.0.asok config get osd_data|sed -e 's|./osd_data.*||;s|.*osd_data.||' /var/lib/ceph/osd/ceph-0 I know I can find out if this is a disk or a SSD using for instance this : [root@ceph0 ~]# cat /sys/block/sdy/queue/rotational 0 [root@ceph0 ~]# cat /sys/block/sda/queue/rotational 1 So I just have to associate the mountpoint with the device... provided OSD is mounted when the tool is called. Anyone willing to share experience with ceph-crush-location ? Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-crush-location + SSD detection ?
-Message d'origine- (...) So I just have to associate the mountpoint with the device... provided OSD is mounted when the tool is called. Anyone willing to share experience with ceph-crush-location ? Something like this? https://gist.github.com/wido/5d26d88366e28e25e23d I've used that a couple of times. Wido [- FS : -] Exactly That... and you confirm at the same time it works ... Many thanks :] [- FS : -] ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?
Hi, Many thanks for the explanations. I haven't used the nodcache option when mounting cephfs, it actually got there by default My mount command is/was : # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret I don't know what causes this option to be default, maybe it's the kernel module I compiled from git (because there is no kmod-ceph or kmod-rbd in any RHEL-like distributions except RHEV), I'll try to update/check ... Concerning the rados pool ls, indeed : I created empty files in the pool, and they were not showing up probably because they were just empty - but when I create a non empty file, I see things in rados ls... Thanks again Frederic -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John Spray Envoyé : mardi 3 mars 2015 17:15 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ? On 03/03/2015 15:21, SCHAER Frederic wrote: By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt .) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] This is probably #10288, which was fixed in 0.87.1 So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool. but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed This is data you're seeing from the page cache, it hasn't been written to RADOS. You have used the nodcache setting, but that doesn't mean what you think it does (it was about caching dentries, not data). It's actually not even used in recent kernels (http://tracker.ceph.com/issues/11009). You could try the nofsc option, but I don't know exactly how much caching that turns off -- the safer approach here is probably to do your testing using I/Os that have O_DIRECT set. And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Right -- eventually the cache goes away, and you see the true (empty) state of the file. Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet Well, you're not really getting ownership of anything here: you're modifying the file's metadata, which you are entitled to do (pool permissions have nothing to do with file metadata). There was a recent bug where a file's pool layout could be changed even if it had data, but that was about safety rather than permissions. Final question for those that read down here : it appears that before creating the cephfs filesystem, I used the puppet pool to store a test rbd instance. And it appears I cannot get the list of cephfs objects in that pool, whereas I can get those that are on the newly created root pool : [root@ceph0 ~]# rados -p puppet ls test.rbd rbd_directory [root@ceph0 ~]# rados -p root ls 10a. 10b. Bug, or feature ? I didn't see anything in your earlier steps that would have led to any objects
[ceph-users] cephfs filesystem layouts : authentication gotchas ?
Hi, I am attempting to test the cephfs filesystem layouts. I created a user with rights to write only in one pool : client.puppet key:zzz caps: [mon] allow r caps: [osd] allow rwx pool=puppet I also created another pool in which I would assume this user is allowed to do nothing after I successfully configure things. By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt ...) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool... but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet But fortunately, I haven't succeeded yet (?) in reading that file... My question therefore is : what am I doing wrong ? Final question for those that read down here : it appears that before creating the cephfs filesystem, I used the puppet pool to store a test rbd instance. And it appears I cannot get the list of cephfs objects in that pool, whereas I can get those that are on the newly created root pool : [root@ceph0 ~]# rados -p puppet ls test.rbd rbd_directory [root@ceph0 ~]# rados -p root ls 10a. 10b. Bug, or feature ? Thanks regards P.S : ceph release : [root@dev7248 ~]# rpm -qa '*ceph*' kmod-libceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64 libcephfs1-0.87-0.el7.centos.x86_64 ceph-common-0.87-0.el7.centos.x86_64 ceph-0.87-0.el7.centos.x86_64 kmod-ceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64 ceph-fuse-0.87.1-0.el7.centos.x86_64 python-ceph-0.87-0.el7.centos.x86_64 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] XFS recovery on boot : rogue mounts ?
Hi, I rebooted a failed server, which is now showing a rogue filesystem mount. Actually, there were also several disks missing in the node, all reported as prepared by ceph-disk, but not activated. [root@ceph2 ~]# grep /var/lib/ceph/tmp /etc/mtab /dev/sdo1 /var/lib/ceph/tmp/mnt.usVRe8 xfs rw,noatime,attr2,inode64,noquota 0 0 This path does not exist, and after having to ceph-disk activate-all, I can now see the OSD under it's correct path (and missing ones got mounted too) : [root@ceph2 ~]# grep sdo1 /etc/mtab /dev/sdo1 /var/lib/ceph/tmp/mnt.usVRe8 xfs rw,noatime,attr2,inode64,noquota 0 0 /dev/sdo1 /var/lib/ceph/osd/ceph-53 xfs rw,noatime,attr2,inode64,noquota 0 0 [root@ceph2 ~]# ll /var/lib/ceph/tmp/mnt.usVRe8 ls: cannot access /var/lib/ceph/tmp/mnt.usVRe8: No such file or directory I just looked at the logs, and it appears that this sdo disk performed an XFS recovery at boot : Mar 2 11:33:45 ceph2 kernel: [ 21.479747] XFS (sdo1): Mounting Filesystem Mar 2 11:33:45 ceph2 kernel: [ 21.641263] XFS (sdo1): Starting recovery (logdev: internal) Mar 2 11:33:45 ceph2 kernel: [ 21.674451] XFS (sdo1): Ending recovery (logdev: internal) I do not see any Ending clean mount line for this disk. If I check the syslogs, I can see OSDs are usually mounted twice, but not always, and sometimes they even aren't mounted at all : [root@ceph2 ~]# zegrep 'XFS.*Ending clean' /var/log/messages.1.gz |sed -e s/.*XFS/XFS/|sort |uniq -c 2 XFS (sdb1): Ending clean mount 2 XFS (sdd1): Ending clean mount 1 XFS (sde1): Ending clean mount 2 XFS (sdg1): Ending clean mount 1 XFS (sdh1): Ending clean mount 1 XFS (sdi1): Ending clean mount 2 XFS (sdj1): Ending clean mount 3 XFS (sdk1): Ending clean mount 3 XFS (sdl1): Ending clean mount 4 XFS (sdm1): Ending clean mount So : would there be an issue with disks that perform an XFS recovery at boot ? I know that reboot will cleanup things, but rebooting isn't the cleanest thing to do... Thanks ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
Hi, Back on this. I finally found out a logic in the mapping. So after taking the time to note all the disks serial numbers on 3 different machines and 2 different OSes, I now know that my specific LSI SAS 2008 cards (no reference on them, but I think those are LSI sas 9207-8i) map the disks of the MD1000 in the reverse alphabetic order : sd{b..p} map to slot{14..0} There is absolutely nothing else that appears usable, except the sas_address of the disks which seems associated with slots. But even this one is different depending on machines, and the address - slot mapping does not seem very obvious at the very least... Good thing is that I now know that fun tools exist in packages such as sg3_tils, smp_utils and others like mpt-status... Next step is to try an md1200 ;) Thanks again Cheers -Message d'origine- De : JF Le Fillâtre [mailto:jean-francois.lefilla...@uni.lu] Envoyé : mercredi 19 novembre 2014 13:42 À : SCHAER Frederic Cc : ceph-users@lists.ceph.com Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ? Hello again, So whatever magic allows the Dell MD1200 to report the slot position for each disk isn't present in your JBODs. Time for something else. There are two sides to your problem: 1) Identifying which disk is where in your JBOD Quite easy. Again I'd go for a udev rule + script that will either rename the disks entirely, or create a symlink with a name like jbodX-slotY or something to figure out easily which is which. The mapping end-device-to-slot can be static in the script, so you need to identify once the order in which the kernel scans the slots and then you can map. But it won't survive a disk swap or a change of scanning order from a kernel upgrade, so it's not enough. 2) Finding a way of identification independent of hot-plugs and scan order That's the tricky part. If you remove a disk from your JBOD and replace it with another one, the other one will get another sdX name, and in my experience even another end_device-... name. But given that you want the new disk to have the exact same name or symlink as the previous one, you have to find something in the path of the device or (better) in the udev attributes that is immutable. If possible at all, it will depend on your specific hardware combination, so you will have to try for yourself. Suggested methodology: 1) write down the serial number of one drive in any slot, and figure out its device name (sdX) with smartctl -i /dev/sd... 2) grab the detailed /sys path name and list of udev attributes: readlink -f /sys/class/block/sdX udevadm info --attribute-walk /dev/sdX 3) pull that disk and replace it. Check the logs to see which is its new device name (sdY) 4) rerun the commands from #2 with sdY 5) compare the outputs and find something in the path or in the attributes that didn't change and is unique to that disk (ie not a common parent for example). If you have something that really didn't change, you're in luck. Either use the serial numbers or unplug and replug all disks one by one to figure out the mapping slot number / immutable item. Then write the udev rule. :) Thanks! JF On 19/11/14 11:29, SCHAER Frederic wrote: Hi Thanks. I hoped it would be it, but no ;) With this mapping : lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
Hi Thanks. I hoped it would be it, but no ;) With this mapping : lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:7/end_device-1:1:7/target1:0:8/1:0:8:0/block/sdi lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdj - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:0/end_device-1:2:0/target1:0:9/1:0:9:0/block/sdj lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdk - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:1/end_device-1:2:1/target1:0:10/1:0:10:0/block/sdk lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdl - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:2/end_device-1:2:2/target1:0:11/1:0:11:0/block/sdl lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdm - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:3/end_device-1:2:3/target1:0:12/1:0:12:0/block/sdm lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdn - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:4/end_device-1:2:4/target1:0:13/1:0:13:0/block/sdn lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdo - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:5/end_device-1:2:5/target1:0:14/1:0:14:0/block/sdo lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdp - ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:6/end_device-1:2:6/target1:0:15/1:0:15:0/block/sdp sdd was on physical slot 12, sdk was on slot 5, and sdg was on slot 9 (and I did not check the others)... so clearly this cannot be put in production as is and I'll have to find a way. Regards -Message d'origine- De : Carl-Johan Schenström [mailto:carl-johan.schenst...@gu.se] Envoyé : lundi 17 novembre 2014 14:14 À : SCHAER Frederic; Scottix; Erik Logtenberg Cc : ceph-users@lists.ceph.com Objet : RE: [ceph-users] jbod + SMART : how to identify failing disks ? Hi! I'm fairly sure that the link targets in /sys/class/block were correct the last time I had to change a drive on a system with a Dell HBA connected to an MD1000, but perhaps I was just lucky. =/ I.e., # ls -l /sys/class/block/sdj lrwxrwxrwx. 1 root root 0 17 nov 13.54 /sys/class/block/sdj - ../../devices/pci:20/:20:0a.0/:21:00.0/host7/port-7:0/expander-7:0/port-7:0:1/expander-7:2/port-7:2:6/end_device-7:2:6/target7:0:7/7:0:7:0/block/sdj would be first port on HBA, first expander, 7th slot (6, starting from 0). Don't take my word for it, though! -- Carl-Johan Schenström Driftansvarig / System Administrator Språkbanken Svensk nationell datatjänst / The Swedish Language Bank Swedish National Data Service Göteborgs universitet / University of Gothenburg carl-johan.schenst...@gu.se / +46 709 116769 From: ceph-users ceph-users-boun...@lists.ceph.com on behalf of SCHAER Frederic frederic.sch...@cea.fr Sent: Friday, November 14, 2014 17:24 To: Scottix; Erik Logtenberg Cc: ceph-users@lists.ceph.com Subject: Re: [ceph
[ceph-users] rogue mount in /var/lib/ceph/tmp/mnt.eml1yz ?
Hi, I rebooted a node (I'm doing some tests, and breaking many things ;) ), I see I have : [root@ceph0 ~]# mount|grep sdp1 /dev/sdp1 on /var/lib/ceph/tmp/mnt.eml1yz type xfs (rw,noatime,attr2,inode64,noquota) /dev/sdp1 on /var/lib/ceph/osd/ceph-55 type xfs (rw,noatime,attr2,inode64,noquota) [root@ceph0 ~]# ls -l /var/lib/ceph/tmp/mnt.eml1yz ls: cannot access /var/lib/ceph/tmp/mnt.eml1yz: No such file or directory In /var/lib/ceph/tmp, all I see is : [root@ceph0 ~]# ll /var/lib/ceph/tmp/ total 0 -rw-r--r-- 1 root root 0 Nov 19 11:34 ceph-disk.activate.lock -rw-r--r-- 1 root root 0 Oct 31 18:19 ceph-disk.prepare.lock I think (but I'm not sure) that I already faced this before the reboot with another device - but since the device naming seems completely inconsistent on my systems (see another thread of mine), I can't say for sure it's not the same OSD that's buggy - in that case, I could try to just zap it and recreate it. Any idea what goes wrong in this case or where I could look at ? (nothing usefull in /var/log/ceph/ceph-osd.55.log) Ceph version is giant : ceph-0.87-0.el7.centos.x86_64 Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] jbod + SMART : how to identify failing disks ?
Wow. Thanks Not very operations friendly though… Wouldn’t it be just OK to pull the disk that we think is the bad one, check the serial number, and if not, just replug and let the udev rules do their job and re-insert the disk in the ceph cluster ? (provided XFS doesn’t freeze for good when we do that) Regards De : Craig Lewis [mailto:cle...@centraldesktop.com] Envoyé : lundi 17 novembre 2014 22:32 À : SCHAER Frederic Cc : ceph-users@lists.ceph.com Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ? I use `dd` to force activity to the disk I want to replace, and watch the activity lights. That only works if your disks aren't 100% busy. If they are, stop the ceph-osd daemon, and see which drive stops having activity. Repeat until you're 100% confident that you're pulling the right drive. On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic frederic.sch...@cea.frmailto:frederic.sch...@cea.fr wrote: Hi, I’m used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I’m going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it’s the correct one…. But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can’t imagine having to guess slots that each time, and I can’t imagine neither creating serial number stickers for every single disk I could have to manage … Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to “blink” a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] jbod + SMART : how to identify failing disks ?
Hi, I'm used to RAID software giving me the failing disks slots, and most often blinking the disks on the disk bays. I recently installed a DELL 6GB HBA SAS JBOD card, said to be an LSI 2008 one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) . Since this is an LSI, I thought I'd use MegaCli to identify the disks slot, but MegaCli does not see the HBA card. Then I found the LSI sas2ircu utility, but again, this one fails at giving me the disk slots (it finds the disks, serials and others, but slot is always 0) Because of this, I'm going to head over to the disk bay and unplug the disk which I think corresponds to the alphabetical order in linux, and see if it's the correct one But even if this is correct this time, it might not be next time. But this makes me wonder : how do you guys, Ceph users, manage your disks if you really have JBOD servers ? I can't imagine having to guess slots that each time, and I can't imagine neither creating serial number stickers for every single disk I could have to manage ... Is there any specific advice reguarding JBOD cards people should (not) use in their systems ? Any magical way to blink a drive in linux ? Thanks regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
Hi loic, Back on this issue... Using the epel package, I still get prepared-only disks, e.g : /dev/sdc : /dev/sdc1 ceph data, prepared, cluster ceph, journal /dev/sdc2 /dev/sdc2 ceph journal, for /dev/sdc1 Looking at udev output, I can see that there is no ACTION=add with ID_PART_ENTRY_TYPE= 4fbd7e29-9d25-41b8-afd0-062c0ceff05d , only a change action. This was on a previously prepared disk, which I zapped. When I start partx -u /dev/sdc , then and only then the kernel sees the new partitions, and it also sees the old ones disappeared too - see the part udev log attached. In this log are only the events that udev saw right when I ran 'partx -u' : nothing before, and nothing after. This still looks like it's not partx -a that should be used on this system when running ceph-disk prepare, but partx -u.. ? And off course, after the partx -u is run, the disk is activated. This is what I have : [root@ceph1 ~]# rpm -qf /usr/sbin/partx util-linux-2.23.2-16.el7.x86_64 [root@ceph1 ~]# cat /etc/redhat-release CentOS Linux release 7.0.1406 (Core) [root@ceph1 ~]# rpm -qi ceph Name: ceph Epoch : 1 Version : 0.80.5 Release : 8.el7 Architecture: x86_64 Install Date: Tue 28 Oct 2014 12:28:41 PM CET Group : System Environment/Base Size: 39154515 License : GPL-2.0 Signature : RSA/SHA256, Sat 23 Aug 2014 08:02:08 PM CEST, Key ID 6a2faea2352c64e5 Source RPM : ceph-0.80.5-8.el7.src.rpm Build Date : Fri 22 Aug 2014 02:36:05 AM CEST Build Host : buildhw-08.phx2.fedoraproject.org Regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
-Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] The failure journal check: ondisk fsid ---- doesn't match expected 244973de-7472-421c-bb25-4b09d3f8d441 and the udev logs DEBUG:ceph-disk:Journal /dev/sdc2 has OSD UUID ---- means /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdc2 fails to read the OSD UUID from /dev/sdc2 which means something went wrong when preparing the journal. It would be great if you could send the command you used to prepare the disk and the output (verbose if possible). I think you can reproduce the problem by zapping the disk with ceph-disk zap /dev/sdc and running partx -u if the corresponding entries in /dev/disk/by-partuuid have not been removed. That would also help me fix zap in the context of https://github.com/ceph/ceph/pull/2648 ... or have confirmation that it does not need fixing because it updates correctly on RHEL ;-) Cheers -- Loïc Dachary, Artisan Logiciel Libre [- FS : -] Hi Loic, At first, some notes : - I noticed I have to wait at least 1sec before I run partx -u on a prepared disk, otherwise even with that disks won't get properly handled by udev. Maybe some caching somewhere ? - the '-u' option does not seem to exist under RHEL6... so maybe the RHEL6 behaviour was just to include the kernel updating in the -a option, and not anymore ? - this is CentOS 7, i.e RHEL like, but not pure RHEL7 even if very close. I Zapped the disk (it seems to work as expected) : [root@ceph1 ~]# ll /dev/disk/by-partuuid/ total 0 lrwxrwxrwx 1 root root 10 Oct 9 15:57 668f92f5-df46-4052-92ba-e8b8f7efd2d9 - ../../sdb1 lrwxrwxrwx 1 root root 10 Oct 9 15:57 feb09ba1-30a2-44a8-a338-fef39ae6626a - ../../sdb2 [root@ceph1 ~] partx -u : [root@ceph1 ~]# partx -u /dev/sdc partx: specified range 1:0 does not make sense Right after that, I re-prepared the disk : [root@ceph1 ~]# parted -s /dev/sdc mklabel gpt [root@ceph1 ~]# ceph-disk -v prepare /dev/sdc INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_type INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_type INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size INFO:ceph-disk:Will colocate journal with data on /dev/sdc DEBUG:ceph-disk:Creating journal partition num 2 size 5120 on /dev/sdc INFO:ceph-disk:Running command: /usr/sbin/sgdisk --new=2:0:5120M --change-name=2:ceph journal --partition-guid=2:46bf261f-7ec3-485e-98c9-3c185de5efb8 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc Information: Moved requested sector from 34 to 2048 in order to align on 2048-sector boundaries. The operation has completed successfully. INFO:ceph-disk:calling partx on prepared device /dev/sdc INFO:ceph-disk:re-reading known partitions will display errors INFO:ceph-disk:Running command: /usr/sbin/partx -a /dev/sdc partx: /dev/sdc: error adding partition 2 INFO:ceph-disk:Running command: /usr/bin/udevadm settle DEBUG:ceph-disk:Journal is GPT partition /dev/disk/by-partuuid/46bf261f-7ec3-485e-98c9-3c185de5efb8 DEBUG:ceph-disk:Creating osd partition on /dev/sdc INFO:ceph-disk:Running command: /usr/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:436ac41b-8800-466e-98f5-098aa2c64ba9 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdc Information: Moved requested sector from 10485761 to 10487808 in order to align on 2048-sector boundaries. The operation has completed successfully. INFO:ceph-disk:Running command: /usr/sbin/partprobe /dev/sdc INFO:ceph-disk:Running command: /usr/bin/udevadm settle DEBUG:ceph-disk:Creating xfs fs on /dev/sdc1 INFO:ceph-disk:Running command: /usr/sbin/mkfs -t xfs -f -i size=2048 -- /dev/sdc1 meta-data=/dev/sdc1 isize=2048 agcount=4, agsize=60686271 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 data = bsize=4096 blocks=242745083, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal log bsize=4096 blocks=118527, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0,
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
Hi Loic, Patched, and still not working (sorry)... I'm attaching the prepare output, and also a different a real udev debug output I captured using udevadm monitor --environment (udev.log file) I added a sync command in ceph-disk-udev (this did not change a thing), and I noticed that udev script is called 3 times when adding one disk, and that the debug output was captured and then mixed all into one file. This may lead to log mis-interpretation (race conditions ?)... I changed a bit the logging in order to get one file per call and attached those logs to this mail. File timestamps are as follows : File: '/var/log/udev_ceph.log.out.22706' Change: 2014-10-10 15:48:09.136386306 +0200 File: '/var/log/udev_ceph.log.out.22749' Change: 2014-10-10 15:48:11.502425395 +0200 File: '/var/log/udev_ceph.log.out.22750' Change: 2014-10-10 15:48:11.606427113 +0200 Actually, I can reproduce the UUID=0 thing with this command : [root@ceph1 ~]# /usr/sbin/ceph-disk -v activate-journal /dev/sdc2 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdc2 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 DEBUG:ceph-disk:Journal /dev/sdc2 has OSD UUID ---- INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/disk/by-partuuid/---- error: /dev/disk/by-partuuid/----: No such file or directory ceph-disk: Cannot discover filesystem type: device /dev/disk/by-partuuid/----: Command '/sbin/blkid' returned non-zero exit status 2 Ah - to answer previous mails : - I tried to manually create the gpt partition table to see if things would improve, but this was not the case (I also tried to zero out the start and end of disks, and also to add random data) - running ceph-disk prepare twice does not work, it's just that once every 20 (?) times it surprisingly does not fail on this hardware/os combination ;) Regards -Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] Envoyé : vendredi 10 octobre 2014 14:37 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] ceph-dis prepare : UUID=---- Hi Frederic, To be 100% sure it would be great if you could manually patch your local ceph-disk script and change 'partprobe', into 'partx', '-a', in https://github.com/ceph/ceph/blob/v0.80.6/src/ceph-disk#L1284 ceph-disk zap ceph-disk prepare and hopefully it will show up as it should. It works for me on centos7 but ... Cheers On 10/10/2014 14:33, Loic Dachary wrote: Hi Frederic, It looks like this is just because https://github.com/ceph/ceph/blob/v0.80.6/src/ceph-disk#L1284 should call partx instead of partprobe. The udev debug output makes this quite clear http://tracker.ceph.com/issues/9721 I think https://github.com/dachary/ceph/commit/8d914001420e5bfc1e12df2d4882bfe2e1719a5c#diff-788c3cea6213c27f5fdb22f8337096d5R1285 fixes it Cheers On 09/10/2014 16:29, SCHAER Frederic wrote: -Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] Envoyé : jeudi 9 octobre 2014 16:20 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] ceph-dis prepare : UUID=---- On 09/10/2014 16:04, SCHAER Frederic wrote: Hi Loic, Back on sdb, as the sde output was from another machine on which I ran partx -u afterwards. To reply your last question first : I think the SG_IO error comes from the fact that disks are exported as a single disks RAID0 on a PERC 6/E, which does not support JBOD - this is decommissioned hardware on which I'd like to test and validate we can use ceph for our use case... So back on the UUID. It's funny : I retried and ceph-disk prepare worked this time. I tried on another disk, and it failed. There is a difference in the output from ceph-disk : on the failing disk, I have these extra lines after disks are prepared : (...) realtime =none extsz=4096 blocks=0, rtextents=0 Warning: The kernel is still using the old partition table. The new table will be used at the next reboot. The operation has completed successfully. partx: /dev/sdc: error adding partitions 1-2 I didn't have the warning about the old partition tables on the disk that worked. So on this new disk, I have : [root@ceph1 ~]# mount /dev/sdc1 /mnt [root@ceph1 ~]# ll /mnt/ total 16 -rw-r--r-- 1 root root 37 Oct 9 15:58 ceph_fsid -rw-r--r-- 1 root root 37 Oct 9 15:58 fsid lrwxrwxrwx 1 root root 58 Oct 9 15:58 journal - /dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc -rw-r--r-- 1 root root 37 Oct 9 15:58 journal_uuid -rw-r--r-- 1 root root 21 Oct 9 15:58 magic [root@ceph1 ~]# cat /mnt/journal_uuid 5e50bb8b-0b99-455f-af71-10815a32bfbc [root@ceph1
[ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
Hi, I am setting up a test ceph cluster, on decommissioned hardware (hence : not optimal, I know). I have installed CentOS7, installed and setup ceph mons and OSD machines using puppet, and now I'm trying to add OSDs with the servers OSD disks... and I have issues (of course ;) ) I used the Ceph RHEL7 RPMs (ceph-0.80.6-0.el7.x86_64) When I run ceph-disk prepare for a disk, I most of the time (but not always) get the partitions created, but not activated : [root@ceph4 ~]# ceph-disk list|grep sdh WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt /dev/sdh : /dev/sdh1 ceph data, prepared, cluster ceph, journal /dev/sdh2 /dev/sdh2 ceph journal, for /dev/sdh1 I tried to debug udev rules thinking they were not launched to activate the OSD, but they are, and they fail on this error : + ln -sf ../../sdh2 /dev/disk/by-partuuid/5b3bde8f-ccad-4093-a8a5-ad6413ae8931 + mkdir -p /dev/disk/by-parttypeuuid + ln -sf ../../sdh2 /dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.5b3bde8f-ccad-4093-a8a5-ad6413ae8931 + case $ID_PART_ENTRY_TYPE in + /usr/sbin/ceph-disk -v activate-journal /dev/sdh2 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdh2 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 DEBUG:ceph-disk:Journal /dev/sdh2 has OSD UUID ---- INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/disk/by-partuuid/---- error: /dev/disk/by-partuuid/----: No such file or directory ceph-disk: Cannot discover filesystem type: device /dev/disk/by-partuuid/----: Command '/sbin/blkid' returned non-zero exit status 2 + exit + exec You'll notice the zeroed UUID... Because of this, I looked at the output of ceph-disk prepare, and saw that partx complains at the end (this is the partx -a command) : Warning: The kernel is still using the old partition table. The new table will be used at the next reboot. The operation has completed successfully. partx: /dev/sdh: error adding partitions 1-2 And indeed, running partx -a /dev/sdh does not change anything. But I just discovered that running partx -u /dev/sdh will fix everything I.e : right after I send this update command to the kernel, my debug logs show that the udev rule does everything fine and the OSD starts up. I'm therefore wondering what I did wrong ? is this CentOS 7 that is misbehaving, or the kernel, or...? Any reason why partx -a is used instead of partx -u ? I'd be glad to hear others advice on this ! Thanks regards Frederic Schaer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
Hi Loic, With this example disk/machine that I left untouched until now : /dev/sdb : /dev/sdb1 ceph data, prepared, cluster ceph, osd.44, journal /dev/sdb2 /dev/sdb2 ceph journal, for /dev/sdb1 [root@ceph1 ~]# ll /dev/disk/by-partuuid/ total 0 lrwxrwxrwx 1 root root 10 Oct 9 15:09 2c27dbda-fbe3-48d6-80fe-b513e1c11702 - ../../sdb1 lrwxrwxrwx 1 root root 10 Oct 9 15:09 d2352e3b-f7f2-40c7-8273-8bfa8ab4206a - ../../sdb2 This is the blkid output : [root@ceph1 ~]# blkid /dev/sdb2 [root@ceph1 ~]# blkid /dev/sdb1 /dev/sdb1: UUID=c8feaaad-bd83-41a3-a82a-0a8727d0b067 TYPE=xfs PARTLABEL=ceph data PARTUUID=2c27dbda-fbe3-48d6-80fe-b513e1c11702 If I run partx -u /dev/sdb, then the filesystem will get activated and the OSD started. And sometimes, it just works without intervention, but that's the exception. I modified the udev script this morning, so I can give you the output of what happens when things go wrong : links are created, but somewhere the UUIDD is wrongly detected by ceph-osd, as far as I understand : Thu Oct 9 11:15:13 CEST 2014 + PARTNO=2 + NAME=sde2 + PARENT_NAME=sde ++ /usr/sbin/sgdisk --info=2 /dev/sde ++ grep 'Partition GUID code' ++ awk '{print $4}' ++ tr '[:upper:]' '[:lower:]' + ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-b4b80ceff106 + '[' -z 45b0969e-9b03-4f30-b4c6-b4b80ceff106 ']' ++ /usr/sbin/sgdisk --info=2 /dev/sde ++ grep 'Partition unique GUID' ++ awk '{print $4}' ++ tr '[:upper:]' '[:lower:]' + ID_PART_ENTRY_UUID=a9e8d490-82a7-48c1-8ef1-aff92351c69c + mkdir -p /dev/disk/by-partuuid + ln -sf ../../sde2 /dev/disk/by-partuuid/a9e8d490-82a7-48c1-8ef1-aff92351c69c + mkdir -p /dev/disk/by-parttypeuuid + ln -sf ../../sde2 /dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.a9e8d490-82a7-48c1-8ef1-aff92351c69c + case $ID_PART_ENTRY_TYPE in + /usr/sbin/ceph-disk -v activate-journal /dev/sde2 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sde2 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 DEBUG:ceph-disk:Journal /dev/sde2 has OSD UUID ---- INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- /dev/disk/by-partuuid/---- error: /dev/disk/by-partuuid/----: No such file or directory ceph-disk: Cannot discover filesystem type: device /dev/disk/by-partuuid/----: Command '/sbin/blkid' returned non-zero exit status 2 + exit + exec regards Frederic. P.S : in your puppet module, it seems impossible to specify osd disks by path, i.e : ceph::profile::params::osds: '/dev/disk/by-path/pci-\:0a\:00.0-scsi-0\:2\:': (I tried without the backslashes too) -Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] Envoyé : jeudi 9 octobre 2014 15:01 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] ceph-dis prepare : UUID=---- Bonjour, I'm not familiar with RHEL7 but willing to learn ;-) I recently ran into confusing situations regarding the content of /dev/disk/by-partuuid because partprobe was not called when it should have (ubuntu). On RHEL, kpartx is used instead because partprobe reboots, apparently. What is the content of /dev/disk/by-partuuid on your machine ? ls -l /dev/disk/by-partuuid Cheers On 09/10/2014 12:24, SCHAER Frederic wrote: Hi, I am setting up a test ceph cluster, on decommissioned hardware (hence : not optimal, I know). I have installed CentOS7, installed and setup ceph mons and OSD machines using puppet, and now I'm trying to add OSDs with the servers OSD disks. and I have issues (of course ;) ) I used the Ceph RHEL7 RPMs (ceph-0.80.6-0.el7.x86_64) When I run ceph-disk prepare for a disk, I most of the time (but not always) get the partitions created, but not activated : [root@ceph4 ~]# ceph-disk list|grep sdh WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt /dev/sdh : /dev/sdh1 ceph data, prepared, cluster ceph, journal /dev/sdh2 /dev/sdh2 ceph journal, for /dev/sdh1 I tried to debug udev rules thinking they were not launched to activate the OSD, but they are, and they fail on this error : + ln -sf ../../sdh2 /dev/disk/by-partuuid/5b3bde8f-ccad-4093-a8a5-ad6413ae8931 + mkdir -p /dev/disk/by-parttypeuuid + ln -sf ../../sdh2 /dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.5b3bde8f-ccad-4093-a8a5-ad6413ae8931 + case $ID_PART_ENTRY_TYPE in + /usr/sbin/ceph-disk -v activate-journal /dev/sdh2 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdh2 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
Hi Loic, Back on sdb, as the sde output was from another machine on which I ran partx -u afterwards. To reply your last question first : I think the SG_IO error comes from the fact that disks are exported as a single disks RAID0 on a PERC 6/E, which does not support JBOD - this is decommissioned hardware on which I'd like to test and validate we can use ceph for our use case... So back on the UUID. It's funny : I retried and ceph-disk prepare worked this time. I tried on another disk, and it failed. There is a difference in the output from ceph-disk : on the failing disk, I have these extra lines after disks are prepared : (...) realtime =none extsz=4096 blocks=0, rtextents=0 Warning: The kernel is still using the old partition table. The new table will be used at the next reboot. The operation has completed successfully. partx: /dev/sdc: error adding partitions 1-2 I didn't have the warning about the old partition tables on the disk that worked. So on this new disk, I have : [root@ceph1 ~]# mount /dev/sdc1 /mnt [root@ceph1 ~]# ll /mnt/ total 16 -rw-r--r-- 1 root root 37 Oct 9 15:58 ceph_fsid -rw-r--r-- 1 root root 37 Oct 9 15:58 fsid lrwxrwxrwx 1 root root 58 Oct 9 15:58 journal - /dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc -rw-r--r-- 1 root root 37 Oct 9 15:58 journal_uuid -rw-r--r-- 1 root root 21 Oct 9 15:58 magic [root@ceph1 ~]# cat /mnt/journal_uuid 5e50bb8b-0b99-455f-af71-10815a32bfbc [root@ceph1 ~]# sgdisk --info=1 /dev/sdc Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) Partition unique GUID: 244973DE-7472-421C-BB25-4B09D3F8D441 First sector: 10487808 (at 5.0 GiB) Last sector: 1952448478 (at 931.0 GiB) Partition size: 1941960671 sectors (926.0 GiB) Attribute flags: Partition name: 'ceph data' [root@ceph1 ~]# sgdisk --info=2 /dev/sdc Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 5E50BB8B-0B99-455F-AF71-10815A32BFBC First sector: 2048 (at 1024.0 KiB) Last sector: 10485760 (at 5.0 GiB) Partition size: 10483713 sectors (5.0 GiB) Attribute flags: Partition name: 'ceph journal' Puzzling, isn't it ? -Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] Envoyé : jeudi 9 octobre 2014 15:37 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] ceph-dis prepare : UUID=---- Does what do sgdisk --info=1 /dev/sde and sgdisk --info=2 /dev/sde print ? It looks like the journal points to an incorrect location (you should see this by mounting /dev/sde1). Here is what I have on a cluster root@bm0015:~# ls -l /var/lib/ceph/osd/ceph-1/ total 56 -rw-r--r-- 1 root root 192 Nov 2 2013 activate.monmap -rw-r--r-- 1 root root3 Nov 2 2013 active -rw-r--r-- 1 root root 37 Nov 2 2013 ceph_fsid drwxr-xr-x 114 root root 8192 Sep 14 11:01 current -rw-r--r-- 1 root root 37 Nov 2 2013 fsid lrwxrwxrwx 1 root root 58 Nov 2 2013 journal - /dev/disk/by-partuuid/7e811295-1b45-477d-907a-41c4c90d9687 -rw-r--r-- 1 root root 37 Nov 2 2013 journal_uuid -rw--- 1 root root 56 Nov 2 2013 keyring -rw-r--r-- 1 root root 21 Nov 2 2013 magic -rw-r--r-- 1 root root6 Nov 2 2013 ready -rw-r--r-- 1 root root4 Nov 2 2013 store_version -rw-r--r-- 1 root root 42 Dec 27 2013 superblock -rw-r--r-- 1 root root0 May 2 14:01 upstart -rw-r--r-- 1 root root2 Nov 2 2013 whoami root@bm0015:~# cat /var/lib/ceph/osd/ceph-1/journal_uuid 7e811295-1b45-477d-907a-41c4c90d9687 root@bm0015:~# I guess in your case the content of journal_uuid is 0- etc. for some reason. Do you know where that SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 comes from ? On 09/10/2014 15:20, SCHAER Frederic wrote: Hi Loic, With this example disk/machine that I left untouched until now : /dev/sdb : /dev/sdb1 ceph data, prepared, cluster ceph, osd.44, journal /dev/sdb2 /dev/sdb2 ceph journal, for /dev/sdb1 [root@ceph1 ~]# ll /dev/disk/by-partuuid/ total 0 lrwxrwxrwx 1 root root 10 Oct 9 15:09 2c27dbda-fbe3-48d6-80fe-b513e1c11702 - ../../sdb1 lrwxrwxrwx 1 root root 10 Oct 9 15:09 d2352e3b-f7f2-40c7-8273-8bfa8ab4206a - ../../sdb2 This is the blkid output : [root@ceph1 ~]# blkid /dev/sdb2 [root@ceph1 ~]# blkid /dev/sdb1 /dev/sdb1: UUID=c8feaaad-bd83-41a3-a82a-0a8727d0b067 TYPE=xfs PARTLABEL=ceph data PARTUUID=2c27dbda-fbe3-48d6-80fe-b513e1c11702 If I run partx -u /dev/sdb, then the filesystem will get activated and the OSD started. And sometimes, it just works without intervention, but that's the exception. I modified the udev script this morning, so I can give you the output of what happens when things go wrong : links are created, but somewhere the UUIDD is wrongly detected by ceph-osd, as far as I understand : Thu Oct 9 11
Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000
-Message d'origine- De : Loic Dachary [mailto:l...@dachary.org] Envoyé : jeudi 9 octobre 2014 16:20 À : SCHAER Frederic; ceph-users@lists.ceph.com Objet : Re: [ceph-users] ceph-dis prepare : UUID=---- On 09/10/2014 16:04, SCHAER Frederic wrote: Hi Loic, Back on sdb, as the sde output was from another machine on which I ran partx -u afterwards. To reply your last question first : I think the SG_IO error comes from the fact that disks are exported as a single disks RAID0 on a PERC 6/E, which does not support JBOD - this is decommissioned hardware on which I'd like to test and validate we can use ceph for our use case... So back on the UUID. It's funny : I retried and ceph-disk prepare worked this time. I tried on another disk, and it failed. There is a difference in the output from ceph-disk : on the failing disk, I have these extra lines after disks are prepared : (...) realtime =none extsz=4096 blocks=0, rtextents=0 Warning: The kernel is still using the old partition table. The new table will be used at the next reboot. The operation has completed successfully. partx: /dev/sdc: error adding partitions 1-2 I didn't have the warning about the old partition tables on the disk that worked. So on this new disk, I have : [root@ceph1 ~]# mount /dev/sdc1 /mnt [root@ceph1 ~]# ll /mnt/ total 16 -rw-r--r-- 1 root root 37 Oct 9 15:58 ceph_fsid -rw-r--r-- 1 root root 37 Oct 9 15:58 fsid lrwxrwxrwx 1 root root 58 Oct 9 15:58 journal - /dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc -rw-r--r-- 1 root root 37 Oct 9 15:58 journal_uuid -rw-r--r-- 1 root root 21 Oct 9 15:58 magic [root@ceph1 ~]# cat /mnt/journal_uuid 5e50bb8b-0b99-455f-af71-10815a32bfbc [root@ceph1 ~]# sgdisk --info=1 /dev/sdc Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown) Partition unique GUID: 244973DE-7472-421C-BB25-4B09D3F8D441 First sector: 10487808 (at 5.0 GiB) Last sector: 1952448478 (at 931.0 GiB) Partition size: 1941960671 sectors (926.0 GiB) Attribute flags: Partition name: 'ceph data' [root@ceph1 ~]# sgdisk --info=2 /dev/sdc Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown) Partition unique GUID: 5E50BB8B-0B99-455F-AF71-10815A32BFBC First sector: 2048 (at 1024.0 KiB) Last sector: 10485760 (at 5.0 GiB) Partition size: 10483713 sectors (5.0 GiB) Attribute flags: Partition name: 'ceph journal' Puzzling, isn't it ? Yes :-) Just to be 100% sure, when you try to activate this /dev/sdc it shows an error and complains that the journal uuid is -000* etc ? If so could you copy your udev debug output ? Cheers [- FS : -] No, when I manually activate the disk instead of attempting to go the udev way, it seems to work : [root@ceph1 ~]# ceph-disk activate /dev/sdc1 got monmap epoch 1 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2014-10-09 16:21:43.286288 7f2be6a027c0 -1 journal check: ondisk fsid ---- doesn't match expected 244973de-7472-421c-bb25-4b09d3f8d441, invalid (someone else's?) journal SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 2014-10-09 16:21:43.301957 7f2be6a027c0 -1 filestore(/var/lib/ceph/tmp/mnt.4lJlzP) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2014-10-09 16:21:43.305941 7f2be6a027c0 -1 created object store /var/lib/ceph/tmp/mnt.4lJlzP journal /var/lib/ceph/tmp/mnt.4lJlzP/journal for osd.47 fsid 70ac4a78-46c0-45e6-8ff9-878b37f50fa1 2014-10-09 16:21:43.305992 7f2be6a027c0 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.4lJlzP/keyring: can't open /var/lib/ceph/tmp/mnt.4lJlzP/keyring: (2) No such file or directory 2014-10-09 16:21:43.306099 7f2be6a027c0 -1 created new key in keyring /var/lib/ceph/tmp/mnt.4lJlzP/keyring added key for osd.47 === osd.47 === create-or-move updating item name 'osd.47' weight 0.9 at location {host=ceph1,root=default} to crush map Starting Ceph osd.47 on ceph1... Running as unit run-12392.service. The osd then appeared in the osd tree... I attached the logs to this email (I just added a set -x in the script called by udev, and redirected the output) Regards udev_ceph.log.out Description: udev_ceph.log.out ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com