[ceph-users] luminous 12.2.6 -> 12.2.7 active+clean+inconsistent PGs workaround (or wait for 12.2.8+ ?)

2018-09-03 Thread SCHAER Frederic
Hi,

For those facing (lots of) active+clean+inconsistent PGs after the luminous 
12.2.6 metadata corruption and 12.2.7 upgrade, I'd like to explain how I 
finally got rid of those.

Disclaimer : my cluster doesn't contain highly valuable data, and I can sort of 
recreate what is actually contains : VMs. The following is risky...

One reason I needed to fix those issues is that I faced IO errors whit pool 
overlays/tiering which were apparently related to the inconsistencies, and the 
only way I could get my VMs running again was to completely disable the SSDs 
overlay, which is far from  ideal.
For those not feeling the need to fix this "harmless" issue, please stop 
reading.
For the others, please understand the risks of the following... or wait for an 
official "pg repair" solution

So :

1st step :
since I was getting an ever growing list of damaged PGs, I decided to 
deep-scrub... all PGs.
Yes. If you have 1+PB data... stop reading (or not ?).

How to do that :
# for j in  ; do for i in `ceph pg ls-by-pool $j |cut -d " " -f 
1|tail -n +2`; do ceph pg deep-scrub $i ; done ; done

I think I already had a full list of damaged PGs until I upgraded to mimic and 
restarted the MONs/the OSDs : I believe daemons restarts caused ceph to forget 
about known inconsistencies.
If you believe the number of damaged PGs is sort of stable for you then skip 
step 1...

2nd step is sort of easy : it is to apply the method described here :

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/021054.html

I tried to add some rados locking before overwriting the objects (4M rbd 
objects in my case), but was still able to overwrite a locked object even with 
"rados -p rbd lock get --lock-type exclusive" ... maybe I haven't tried hard 
enough.
It would have been great if it were possible to make sure the object was not 
overwritten between a get and a put :/ - that would make this procedure much 
safer...

In my case, I had 2000+ damaged PGs, so I wrote a small script that should 
process those PGs and should try to apply the procedure:
https://gist.github.com/fschaer/cb851eae4f46287eaf30715e18f14524

My Ceph cluster has been healthy since Friday evening and I haven't seen any 
data corruption nor any hung VM...

Cheers
Frederic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

2018-07-25 Thread SCHAER Frederic
My cache pool seems affected by an old/closed bug... but I don't think this is 
(directly ?) related to the current issue - but this won't help anyway :-/
http://tracker.ceph.com/issues/12659

Since I got promote issues, I tried to flush only the affected rbd image : I 
got 6 unflush-able objects...

rbd image 'dev7243':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.194b8c238e1f29
(...)

=>

# for i in `rados -p ssd-hot-irfu-virt ls |egrep '^rbd_data.194b8c238e1f29'`; 
do rados -p ssd-hot-irfu-virt cache-flush $i ; rados -p ssd-hot-irfu-virt 
cache-evict $i ; done
error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or 
resource busy
error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or 
resource busy
error from cache-flush rbd_data.194b8c238e1f29.0926: (16) Device or 
resource busy
error from cache-flush rbd_data.194b8c238e1f29.0926: (16) Device or 
resource busy
(...)

Strange that the cache-evict error message is the same as the cache flush one...
# rados -p ssd-hot-irfu-virt cache-evict 
rbd_data.194b8c238e1f29.082f
error from cache-flush rbd_data.194b8c238e1f29.082f: (16) Device or 
resource busy

Anyway : I stopped the VM and... I still can't flush the objects.
I don't think this is related anyway, as the OSD propote error is :

2018-07-25 10:51:44.386764 7fd27929b700 -1 log_channel(cluster) log [ERR] : 
1.39 copy from 1:9c0e12cc:::rbd_data.1920e2238e1f29.0dfc:head to 
1:9c0e12cc:::rbd_data.1920e2238e1f29.0dfc:head data digest 0x
632451e5 != source 0x73dfd8ab
2018-07-25 10:51:44.386769 7fd27929b700 -1 osd.74 pg_epoch: 182580 pg[1.39( v 
182580'38868939 (182579'38867404,182580'38868939] local-lis/les=182563/182564 
n=342 ec=2726/2726 lis/c 182563/182563 les/c/f 182564/182564/0 182563/
182563/182558) [74,71,19] r=0 lpr=182563 crt=182580'38868939 lcod 
182580'38868938 mlcod 182580'38868938 active+clean] finish_promote unexpected 
promote error (5) Input/output error

And I don't see object rbd_data.1920e2238e1f29.0dfc (:head ?) in 
the unflush-able objects...

Cheers

-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : mercredi 25 juillet 2018 10:28
À : Dan van der Ster 
Cc : ceph-users 
Objet : [PROVENANCE INTERNET] Re: [ceph-users] 12.2.7 + osd skip data digest + 
bluestore + I/O errors

Hi again,

Now with all OSDs restarted, I'm getting 
health: HEALTH_ERR
777 scrub errors
Possible data damage: 36 pgs inconsistent
(...)
pgs: 4764 active+clean
 36   active+clean+inconsistent

But from what I could read up to now, this is what's expected and should 
auto-heal when objects are overwritten  - fingers crossed as pg repair or scrub 
doesn't seem to help.
New errors in the ceph logs include lines like the following, which I also 
hope/presume are expected - I still have posts to read on this list about omap 
and those  errors :
2018-07-25 10:20:00.106227 osd.66 osd.66 192.54.207.75:6826/2430367 12 : 
cluster [ERR] 11.288 shard 207: soid 
11:1155c332:::rbd_data.207dce238e1f29.0527:head data_digest 
0xc8997a5b != data_digest 0x2ca15853 from auth oi 
11:1155c332:::rbd_data.207dce238e1f29.0527:head(182554'240410 
client.6084296.0:48463693 dirty|data_digest|omap_digest s 4194304 uv 49429318 
dd 2ca15853 od  alloc_hint [0 0 0])
2018-07-25 10:20:00.106230 osd.66 osd.66 192.54.207.75:6826/2430367 13 : 
cluster [ERR] 11.288 soid 
11:1155c332:::rbd_data.207dce238e1f29.0527:head: failed to pick 
suitable auth object

But never mind : with the SSD cache in writeback, I just saw the same error 
again on one VM (only) for now :
(lots of these)
2018-07-25 10:15:19.841746 osd.101 osd.101 192.54.207.206:6859/3392654 116 : 
cluster [ERR] 1.20 copy from 
1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head to 
1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head data digest 
0x27451e3c != source 0x12c05014

(osd.101 is a SSD from the cache pool)

=> yum update => I/O error => Set the TIER pool to forward => yum update starts.

Weird, but if that happens only on this host, I can cope with it (I have 780+ 
scrub errors to handle now :/ )

And just to be sure ;)

[root@ceph10 ~]# ceph --admin-daemon /var/run/ceph/*osd*101* version
{"version":"12.2.7","release":"luminous","release_type":"stable"}

On the good side : this update is forcing us to dive into ceph internals : 
we'll be more ceph-aware tonight than this morning ;)

Cheers
Fred

-Message d'origine-
De : SCHAER Frederic 
Envoyé : mercredi 25 juillet 2018 09:57
À : 'Dan van der Ster' 
Cc : ceph-users 
Objet : RE: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

Hi Dan,

Just checked again : argggh

Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

2018-07-25 Thread SCHAER Frederic
Hi again,

Now with all OSDs restarted, I'm getting 
health: HEALTH_ERR
777 scrub errors
Possible data damage: 36 pgs inconsistent
(...)
pgs: 4764 active+clean
 36   active+clean+inconsistent

But from what I could read up to now, this is what's expected and should 
auto-heal when objects are overwritten  - fingers crossed as pg repair or scrub 
doesn't seem to help.
New errors in the ceph logs include lines like the following, which I also 
hope/presume are expected - I still have posts to read on this list about omap 
and those  errors :
2018-07-25 10:20:00.106227 osd.66 osd.66 192.54.207.75:6826/2430367 12 : 
cluster [ERR] 11.288 shard 207: soid 
11:1155c332:::rbd_data.207dce238e1f29.0527:head data_digest 
0xc8997a5b != data_digest 0x2ca15853 from auth oi 
11:1155c332:::rbd_data.207dce238e1f29.0527:head(182554'240410 
client.6084296.0:48463693 dirty|data_digest|omap_digest s 4194304 uv 49429318 
dd 2ca15853 od  alloc_hint [0 0 0])
2018-07-25 10:20:00.106230 osd.66 osd.66 192.54.207.75:6826/2430367 13 : 
cluster [ERR] 11.288 soid 
11:1155c332:::rbd_data.207dce238e1f29.0527:head: failed to pick 
suitable auth object

But never mind : with the SSD cache in writeback, I just saw the same error 
again on one VM (only) for now :
(lots of these)
2018-07-25 10:15:19.841746 osd.101 osd.101 192.54.207.206:6859/3392654 116 : 
cluster [ERR] 1.20 copy from 
1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head to 
1:06dd6812:::rbd_data.194b8c238e1f29.07a3:head data digest 
0x27451e3c != source 0x12c05014

(osd.101 is a SSD from the cache pool)

=> yum update => I/O error => Set the TIER pool to forward => yum update starts.

Weird, but if that happens only on this host, I can cope with it (I have 780+ 
scrub errors to handle now :/ )

And just to be sure ;)

[root@ceph10 ~]# ceph --admin-daemon /var/run/ceph/*osd*101* version
{"version":"12.2.7","release":"luminous","release_type":"stable"}

On the good side : this update is forcing us to dive into ceph internals : 
we'll be more ceph-aware tonight than this morning ;)

Cheers
Fred

-Message d'origine-
De : SCHAER Frederic 
Envoyé : mercredi 25 juillet 2018 09:57
À : 'Dan van der Ster' 
Cc : ceph-users 
Objet : RE: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

Hi Dan,

Just checked again : arggghhh...

# grep AUTO_RESTART /etc/sysconfig/ceph
CEPH_AUTO_RESTART_ON_UPGRADE=no

So no :'(
RPMs were upgraded, but OSD were not restarted as I thought. Or at least not 
restarted with new 12.2.7 binaries (but since the skip digest option was 
present in the running 12.2.6 OSDs, I guess the 12.2.6 osds did not understand 
that option)

I just restarted all of the OSDs : I will check again the behavior and report 
here - thanks for pointing me in the good direction !

Fred

-Message d'origine-
De : Dan van der Ster [mailto:d...@vanderster.com] 
Envoyé : mardi 24 juillet 2018 16:50
À : SCHAER Frederic 
Cc : ceph-users 
Objet : Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

`ceph versions` -- you're sure all the osds are running 12.2.7 ?

osd_skip_data_digest = true is supposed to skip any crc checks during reads.
But maybe the cache tiering IO path is different and checks the crc anyway?

-- dan


On Tue, Jul 24, 2018 at 3:01 PM SCHAER Frederic  wrote:
>
> Hi,
>
>
>
> I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before 
> I started upgrading from 12.2.6 on my Bluestore-only cluster.
>
> As far as I can tell, my OSDs all got restarted during the upgrade and all 
> got the option enabled :
>
>
>
> This is what I see for a specific OSD taken at random:
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep 
> data_digest
>
> "osd_skip_data_digest": "true",
>
>
>
> This is what I see when I try to injectarg the option data digest ignore 
> option :
>
>
>
> # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head
>
> osd.0: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.1: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.2: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.3: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> (…)
>
>
>
> This has been like that since I upgraded to 12.2.7.
>
> I read in the releanotes that the skip_data_digest  option should be 
> sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal 
> on rewrite…
>
>
>
> However…
>
>
>
> My config :
>
> -  Using tiering with an SSD hot storage

Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

2018-07-25 Thread SCHAER Frederic
Hi Dan,

Just checked again : arggghhh...

# grep AUTO_RESTART /etc/sysconfig/ceph
CEPH_AUTO_RESTART_ON_UPGRADE=no

So no :'(
RPMs were upgraded, but OSD were not restarted as I thought. Or at least not 
restarted with new 12.2.7 binaries (but since the skip digest option was 
present in the running 12.2.6 OSDs, I guess the 12.2.6 osds did not understand 
that option)

I just restarted all of the OSDs : I will check again the behavior and report 
here - thanks for pointing me in the good direction !

Fred

-Message d'origine-
De : Dan van der Ster [mailto:d...@vanderster.com] 
Envoyé : mardi 24 juillet 2018 16:50
À : SCHAER Frederic 
Cc : ceph-users 
Objet : Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

`ceph versions` -- you're sure all the osds are running 12.2.7 ?

osd_skip_data_digest = true is supposed to skip any crc checks during reads.
But maybe the cache tiering IO path is different and checks the crc anyway?

-- dan


On Tue, Jul 24, 2018 at 3:01 PM SCHAER Frederic  wrote:
>
> Hi,
>
>
>
> I read the 12.2.7 upgrade notes, and set “osd skip data digest = true” before 
> I started upgrading from 12.2.6 on my Bluestore-only cluster.
>
> As far as I can tell, my OSDs all got restarted during the upgrade and all 
> got the option enabled :
>
>
>
> This is what I see for a specific OSD taken at random:
>
> # ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep 
> data_digest
>
> "osd_skip_data_digest": "true",
>
>
>
> This is what I see when I try to injectarg the option data digest ignore 
> option :
>
>
>
> # ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head
>
> osd.0: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.1: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.2: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> osd.3: osd_skip_data_digest = 'true' (not observed, change may require 
> restart)
>
> (…)
>
>
>
> This has been like that since I upgraded to 12.2.7.
>
> I read in the releanotes that the skip_data_digest  option should be 
> sufficient to ignore the 12.2.6 corruptions and that objects should auto-heal 
> on rewrite…
>
>
>
> However…
>
>
>
> My config :
>
> -  Using tiering with an SSD hot storage tier
>
> -  HDDs for cold storage
>
>
>
> And… I get I/O errors on some VMs when running some commands as simple as 
> “yum check-update”.
>
>
>
> The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) :
>
>
>
> block I/O error in device 'drive-virtio-disk0': Input/output error (5)
>
>
>
> In the ceph logs, I can see these errors :
>
>
>
> 2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 
> 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
> 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
> 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 
> 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
> 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
> 0x3bb26e16 != source 0xec476c54
>
>
>
> (yes, my cluster is seen as healthy)
>
>
>
> On the affected OSDs, I can see these errors :
>
>
>
> 2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
> 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
> n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
> 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
> 182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data 
> digest 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 
> 1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
> 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
> 0x3bb26e16 != source 0xec476c54
>
> 2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
> 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
> n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
> 182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
> 182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected 
> promote error (5) Input/output error
>
> 2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
> 182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
> n=344 ec=2726/2726 lis/c 182298/182298 les/c/

Re: [ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

2018-07-24 Thread SCHAER Frederic
Oh my...

Tried to yum upgrade in writeback mode and noticed the syslogs on the VM :

Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896024
Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1896064
Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895552
Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895536
Jul 24 15:16:57 dev7240 kernel: end_request: I/O error, dev vda, sector 1895520
(...)

Ceph is also lgging many errors :

2018-07-24 15:20:24.893872 osd.74 [ERR] 1.33 copy from 
1:cd70e921:::rbd_data.21e0fe2ae8944a.:head to 
1:cd70e921:::rbd_data.21e0fe2ae8944a.:head data digest 
0x1480c7a1 != source 0xe1e7591b
[root@ceph0 ~]# egrep 'copy from.*to.*data digest' /var/log/ceph/ceph.log |wc -l
928

Setting the cache tier again to forward mode prevents the IO errors again :

In writeback mode :

# yum update 2>&1|tail
---> Package glibc-headers.x86_64 0:2.12-1.209.el6_9.2 will be updated
---> Package glibc-headers.x86_64 0:2.12-1.212.el6 will be an update
---> Package gmp.x86_64 0:4.3.1-12.el6 will be updated
---> Package gmp.x86_64 0:4.3.1-13.el6 will be an update
---> Package gnupg2.x86_64 0:2.0.14-8.el6 will be updated
---> Package gnupg2.x86_64 0:2.0.14-9.el6_10 will be an update
---> Package gnutls.x86_64 0:2.12.23-21.el6 will be updated
---> Package gnutls.x86_64 0:2.12.23-22.el6 will be an update
---> Package httpd.x86_64 0:2.2.15-60.sl6.6 will be updated
Error: disk I/O error


ð  Each time I run a yum update, I get a bit farther in the yum update process.

In forward mode : works as expected
I haven't tried to flush the cache pool while in forward mode... yet...

Ugh :/

Regards


De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : mardi 24 juillet 2018 15:01
À : ceph-users 
Objet : [PROVENANCE INTERNET] [ceph-users] 12.2.7 + osd skip data digest + 
bluestore + I/O errors

Hi,

I read the 12.2.7 upgrade notes, and set "osd skip data digest = true" before I 
started upgrading from 12.2.6 on my Bluestore-only cluster.
As far as I can tell, my OSDs all got restarted during the upgrade and all got 
the option enabled :

This is what I see for a specific OSD taken at random:
# ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep 
data_digest
"osd_skip_data_digest": "true",

This is what I see when I try to injectarg the option data digest ignore option 
:

# ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head
osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart)
(...)

This has been like that since I upgraded to 12.2.7.
I read in the releanotes that the skip_data_digest  option should be sufficient 
to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite...

However...

My config :

-  Using tiering with an SSD hot storage tier

-  HDDs for cold storage

And... I get I/O errors on some VMs when running some commands as simple as 
"yum check-update".

The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) :


block I/O error in device 'drive-virtio-disk0': Input/output error (5)

In the ceph logs, I can see these errors :


2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54

(yes, my cluster is seen as healthy)

On the affected OSDs, I can see these errors :

2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data 
digest 0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 
1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/18

[ceph-users] 12.2.7 + osd skip data digest + bluestore + I/O errors

2018-07-24 Thread SCHAER Frederic
Hi,

I read the 12.2.7 upgrade notes, and set "osd skip data digest = true" before I 
started upgrading from 12.2.6 on my Bluestore-only cluster.
As far as I can tell, my OSDs all got restarted during the upgrade and all got 
the option enabled :

This is what I see for a specific OSD taken at random:
# ceph --admin-daemon /var/run/ceph/ceph-osd.68.asok config show|grep 
data_digest
"osd_skip_data_digest": "true",

This is what I see when I try to injectarg the option data digest ignore option 
:

# ceph tell osd.* injectargs '--osd_skip_data_digest=true' 2>&1|head
osd.0: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.1: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.2: osd_skip_data_digest = 'true' (not observed, change may require restart)
osd.3: osd_skip_data_digest = 'true' (not observed, change may require restart)
(...)

This has been like that since I upgraded to 12.2.7.
I read in the releanotes that the skip_data_digest  option should be sufficient 
to ignore the 12.2.6 corruptions and that objects should auto-heal on rewrite...

However...

My config :

-  Using tiering with an SSD hot storage tier

-  HDDs for cold storage

And... I get I/O errors on some VMs when running some commands as simple as 
"yum check-update".

The qemu/kvm/libirt logs show me these (in : /var/log/libvirt/qemu) :


block I/O error in device 'drive-virtio-disk0': Input/output error (5)

In the ceph logs, I can see these errors :


2018-07-24 11:17:56.420391 osd.71 [ERR] 1.23 copy from 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54

2018-07-24 11:17:56.429936 osd.71 [ERR] 1.23 copy from 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54

(yes, my cluster is seen as healthy)

On the affected OSDs, I can see these errors :

2018-07-24 11:17:56.420349 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data 
digest 0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.420388 7f034642a700 -1 log_channel(cluster) log [ERR] : 
1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.420395 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected 
promote error (5) Input/output error
2018-07-24 11:17:56.429900 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
182367'46340723 mlcod 182367'46340723 active+clean] process_copy_chunk data 
digest 0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.429934 7f034642a700 -1 log_channel(cluster) log [ERR] : 
1.23 copy from 1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head to 
1:c590b9d7:::rbd_data.1920e2238e1f29.00e7:head data digest 
0x3bb26e16 != source 0xec476c54
2018-07-24 11:17:56.429939 7f034642a700 -1 osd.71 pg_epoch: 182367 pg[1.23( v 
182367'46340724 (182367'46339152,182367'46340724] local-lis/les=182298/182299 
n=344 ec=2726/2726 lis/c 182298/182298 les/c/f 182299/182299/0 
182298/182298/43896) [71,101,74] r=0 lpr=182298 crt=182367'46340724 lcod 
182367'46340723 mlcod 182367'46340723 active+clean] finish_promote unexpected 
promote error (5) Input/output error

And I don't know how to recover from that.
Pool #1 is my SSD cache tier, hence pg 1.23 is on the SSD side.

I've tried setting the cache pool to "readforward" despite the "not well 
supported" warning and could immediately get back working VMs (no more I/O 
errors).
But with no SSD tiering : not really useful.

As soon as I've tried setting the cache tier to writeback again, I got those 
I/O errors again... (not on the yum command, but in the mean time I've stopped 
and set out, then unset out osd.71 to check it with badblocks just in case...)
I still have to find how to reproduce the io error on an affected host to 
further try to debug/fix that issue...

Any ideas ?

Thanks && regards

___
ceph-users mailing list

[ceph-users] bluestore behavior on disks sector read errors

2017-06-27 Thread SCHAER Frederic
Hi,

Every now and then , sectors die on disks.
When this happens on my bluestore (kraken) OSDs, I get 1 PG that becomes 
degraded.
The exact status is :


HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

pg 12.127 is active+clean+inconsistent, acting [141,67,85]

If I do a # rados list-inconsistent-obj 12.127 --format=json-pretty
I get :
(...)

"osd": 112,

"errors": [

"read_error"

],

"size": 4194304

When this happens, I'm forced to manually run "ceph pg repair" on the 
inconsistent PGs after I made sure this was a read error : I feel this should 
not be a manual process.

If I go on the machine and look at the syslogs, I indeed see a sector read 
error happened once or twice.
But if I try to read the sector manually, then I can because it was reallocated 
on the disk I presume.
Last time this happened, I ran badblocks on the disk and it found no issue...

My question therefore are :

why doen't bluestore retry reading the sector (in case of transient errors) ? 
(maybe it does)
why isn't the pg automatically fixed when a read error was detected ?
what will happen when the disks get old and reach up to 2048 bad sectors before 
the controllers/smart declare them as "failure predicted" ?
I can't imagine manually fixing  up to Nx2048 PGs in an infrastructure of N 
disks where N could reach the sky...

Ideas ?

Thanks && regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph crush map rules for EC pools and out OSDs ?

2017-03-01 Thread SCHAER Frederic
Hi,

I have 5 data nodes (bluestore, kraken), each with 24 OSDs.
I enabled the optimal crush tunables.
I'd like to try to "really" use EC pools, but until now I've faced cluster 
lockups when I was using 3+2 EC pools with a host failure domain.
When a host was down for instance ;)

Since I'd like the erasure codes to be more than a "nice to have feature with 
12+ ceph data nodes", I wanted to try this :


-  Use a 14+6 EC rule

-  And for each data chunk:

oselect 4 hosts

o   On these hosts, select 5 OSDs

In order to do that, I created this rule in the crush map :

rule 4hosts_20shards {
ruleset 3
type erasure
min_size 20
max_size 20
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default
step choose indep 4 type host
step chooseleaf indep 5 type osd
step emit
}

I then created an EC pool with this erasure profile :
ceph osd erasure-code-profile set erasurep14_6_osd  ruleset-failure-domain=osd 
k=14 m=6

I hoped this would allow for loosing 1 host completely  without locking the 
cluster, and I have the impression this is working..
But. There's always a but ;)

I tried to make all OSDs down by stopping the ceph-osd daemons on one node.
And according to ceph, the cluster is unhealthy.
The ceph health detail fives me for instance this (for the 3+2 and 14+6 pools) :

pg 5.18b is active+undersized+degraded, acting [57,47,2147483647,23,133]
pg 9.186 is active+undersized+degraded, acting 
[2147483647,2147483647,2147483647,2147483647,2147483647,133,142,125,131,137,50,48,55,65,52,16,13,18,22,3]

My question therefore is : why aren't the down PGs remapped onto my 5th data 
node since I made sure the 20 EC shards were spread onto 4 hosts only ?
I thought/hoped that because osds were down, the data would be rebuilt onto 
another OSD/host ?
I can understand the 3+2 EC pool cannot allocate OSDs on another host because 
the 3+2=5 hosts already, but I don't understand why the 14+6 EC pool/pgs do not 
rebuild somewhere else ?

I do not find anything worth in a "ceph pg query", the up and acting parts are 
equal and do contain the 2147483647 value (wich means none as far as I 
understood).

I've also tried to "ceph osd out" all the OSDs from one host : in that case, 
the 3+2 EC PGs behaves as previously, but the 14+6 EC PGs seem happy despite 
the fact they are still saying the out OSDs are up and acting.
Is my crush rule that wrong ?
Is it possible to do what I want ?

Thanks for any hints...

Regards
Frederic

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

2016-06-24 Thread SCHAER Frederic
Hi,

I'm facing the same thing after I reinstalled a node directly in jewel... 

Reading : http://thread.gmane.org/gmane.comp.file-systems.ceph.devel/31917
I can confirm that running : "udevadm trigger -c add -s block " fires the udev 
rules and gets ceph-osd up.

Thing is : I now have reinstalled boxes (CentOS 7.2.1511 ) which do not fire 
udev rules at boot, and get no /dev/disk/by-parttypeuuid - and I fear there is 
none also just after installing the ceph RPMs since the udev rules did not 
pre-exist -, and other exact same boxes (same setup, same hardware, same 
paritions) which were upgraded from previous ceph versions, which do seem to 
work correctly - or so I think.
All with rootfs on LVM...

I'll try to compare the 2 kinds of hosts to see if I can find something usefull 
...

Regards


-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de 
stephane.d...@orange.com
Envoyé : vendredi 24 juin 2016 12:10
À : Loic Dachary 
Cc : ceph-users 
Objet : Re: [ceph-users] osds udev rules not triggered on reboot (jewel, jessie)

Hi Loïc,

Sorry for the delay. Well, it's a vanillia Centos iso image downloaded from  
centos.org mirror:
[root@hulk-stg ~]# cat /etc/redhat-release
CentOS Linux release 7.2.1511 (Core)

This issue happens after Ceph upgrade from hammer, I haven't tested with distro 
starting with a fresh Ceph install

Thanks,

Stéphane

-Original Message-
From: Loic Dachary [mailto:l...@dachary.org] 
Sent: Tuesday, June 21, 2016 14:48
To: DAVY Stephane OBS/OCB
Cc: ceph-users
Subject: Re: [ceph-users] osds udev rules not triggered on reboot (jewel, 
jessie)



On 16/06/2016 18:01, stephane.d...@orange.com wrote:
> Hi,
> 
> Same issue with Centos 7, I also put back this file in /etc/udev/rules.d. 

Hi Stephane,

Could you please detail which version of CentOS 7 you are using ? I tried to 
reproduce the problem with CentOS 7.2 as found on the CentOS cloud images 
repository ( 
http://cloud.centos.org/centos/7/images/CentOS-7-x86_64-GenericCloud-1511.qcow2 
) but it "works for me".

Thanks !

> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Alexandre DERUMIER
> Sent: Thursday, June 16, 2016 17:53
> To: Karsten Heymann; Loris Cuoghi
> Cc: Loic Dachary; ceph-users
> Subject: Re: [ceph-users] osds udev rules not triggered on reboot 
> (jewel, jessie)
> 
> Hi,
> 
> I have the same problem with osd disks not mounted at boot on jessie 
> with ceph jewel
> 
> workaround is to re-add 60-ceph-partuuid-workaround.rules file to udev
> 
> http://tracker.ceph.com/issues/16351
> 
> 
> - Mail original -
> De: "aderumier" 
> À: "Karsten Heymann" , "Loris Cuoghi" 
> 
> Cc: "Loic Dachary" , "ceph-users" 
> 
> Envoyé: Jeudi 28 Avril 2016 07:42:04
> Objet: Re: [ceph-users] osds udev rules not triggered on reboot (jewel,   
> jessie)
> 
> Hi,
> they are missing target files in debian packages
> 
> http://tracker.ceph.com/issues/15573
> https://github.com/ceph/ceph/pull/8700
> 
> I have also done some other trackers about packaging bug
> 
> jewel: debian package: wrong /etc/default/ceph/ceph location
> http://tracker.ceph.com/issues/15587
> 
> debian/ubuntu : TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES not specified in 
> /etc/default/cep
> http://tracker.ceph.com/issues/15588
> 
> jewel: debian package: init.d script bug
> http://tracker.ceph.com/issues/15585
> 
> 
> @CC loic dachary, maybe he could help to speed up packaging fixes
> 
> - Mail original -
> De: "Karsten Heymann" 
> À: "Loris Cuoghi" 
> Cc: "ceph-users" 
> Envoyé: Mercredi 27 Avril 2016 15:20:29
> Objet: Re: [ceph-users] osds udev rules not triggered on reboot 
> (jewel, jessie)
> 
> 2016-04-27 15:18 GMT+02:00 Loris Cuoghi : 
>> Le 27/04/2016 14:45, Karsten Heymann a écrit : 
>>> one workaround I found was to add
>>>
>>> [Install]
>>> WantedBy=ceph-osd.target
>>>
>>> to /lib/systemd/system/ceph-disk@.service and then manually enable 
>>> my disks with
>>>
>>> # systemctl enable ceph-disk\@dev-sdi1 # systemctl start 
>>> ceph-disk\@dev-sdi1
>>>
>>> That way they at least are started at boot time. 
> 
>> Great! But only if the disks keep their device names, right ? 
> 
> Exactly. It's just a little workaround until the real issue is fixed. 
> 
> +Karsten
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] OSD Restart results in "unfound objects"

2016-06-02 Thread SCHAER Frederic
Hi,

Same for me... unsetting the bitwise flag considerably lowered the number of 
unfound objects.
I'll have to wait/check for the remaining 214 though...

Cheers

-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Samuel 
Just
Envoyé : jeudi 2 juin 2016 01:20
À : Uwe Mesecke 
Cc : ceph-users 
Objet : Re: [ceph-users] OSD Restart results in "unfound objects"

Yep, looks like the same issue:

2016-06-02 00:45:27.977064 7fc11b4e9700 10 osd.17 pg_epoch: 11108
pg[34.4a( v 11104'1080336 lc 11104'1080335
(11069'1077294,11104'1080336] local-les=11108 n=50593 ec=2051 les/c/f
11104/11104/0 11106/11107/11107) [17,13] r=0 lpr=11107
pi=11101-11106/3 crt=11104'1080336 lcod 0'0 mlcod 0'0 inactive m=1
u=1] search_for_mi
ssing 
34:52a5cefb:::default.3653921.2__shadow_.69E1Tth4Y2Q7m0VKNbQdJe-9BgYks6I_1:head
11104'1080336 also missing on osd.13 (last_backfill MAX but with wrong
sort order)

Thanks!
-Sam

On Wed, Jun 1, 2016 at 4:04 PM, Uwe Mesecke  wrote:
> Hey Sam,
>
> glad you found the bug. As another data point a just did the whole round of 
> "healthy -> set sortbitwise -> osd restarts -> unfound objects -> unset 
> sortbitwise -> healthy" with the debug settings as described by you earlier.
>
> I uploaded the logfiles...
>
> https://www.dropbox.com/s/f5hhptbtocbxe1k/ceph-osd.13.log.gz
> https://www.dropbox.com/s/kau9cjqfhmtpd89/ceph-osd.17.log.gz
>
> The PG with the unfound object is „34.4a“ and it seems as there are similar 
> log messages as you noted in the issue.
>
> The cluster runs jewel 10.2.1 and was created a long time ago, I think it was 
> giant.
>
> Thanks again!
>
> Uwe
>
>> Am 02.06.2016 um 00:19 schrieb Samuel Just :
>>
>> http://tracker.ceph.com/issues/16113
>>
>> I think I found the bug.  Thanks for the report!  Turning off
>> sortbitwise should be an ok workaround for the moment.
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 3:00 PM, Diego Castro
>>  wrote:
>>> Yes, it was created as Hammer.
>>> I haven't faced any issues on the upgrade (despite the well know systemd),
>>> and after that the cluster didn't show any suspicious behavior.
>>>
>>>
>>> ---
>>> Diego Castro / The CloudFather
>>> GetupCloud.com - Eliminamos a Gravidade
>>>
>>> 2016-06-01 18:57 GMT-03:00 Samuel Just :

 Was this cluster upgraded to jewel?  If so, at what version did it start?
 -Sam

 On Wed, Jun 1, 2016 at 1:48 PM, Diego Castro
  wrote:
> Hello Samuel, i'm bit afraid of restarting my osd's again, i'll wait
> until
> the weekend to push the config.
> BTW, i just unset sortbitwise flag.
>
>
> ---
> Diego Castro / The CloudFather
> GetupCloud.com - Eliminamos a Gravidade
>
> 2016-06-01 13:39 GMT-03:00 Samuel Just :
>>
>> Can either of you reproduce with logs?  That would make it a lot
>> easier to track down if it's a bug.  I'd want
>>
>> debug osd = 20
>> debug ms = 1
>> debug filestore = 20
>>
>> On all of the osds for a particular pg from when it is clean until it
>> develops an unfound object.
>> -Sam
>>
>> On Wed, Jun 1, 2016 at 5:36 AM, Diego Castro
>>  wrote:
>>> Hello Uwe, i also have sortbitwise flag enable and i have the exactly
>>> behavior of yours.
>>> Perhaps this is also the root of my issues, does anybody knows if is
>>> safe to
>>> disable it?
>>>
>>>
>>> ---
>>> Diego Castro / The CloudFather
>>> GetupCloud.com - Eliminamos a Gravidade
>>>
>>> 2016-06-01 7:17 GMT-03:00 Uwe Mesecke :


> Am 01.06.2016 um 10:25 schrieb Diego Castro
> :
>
> Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
> Today my cluster suddenly went unhealth with lots of stuck pg's
> due
> unfound objects, no disks failures nor node crashes, it just went
> bad.
>
> I managed to put the cluster on health state again by marking lost
> objects to delete "ceph pg  mark_unfound_lost delete".
> Regarding the fact that i have no idea why the cluster gone bad, i
> realized restarting the osd' daemons to unlock stuck clients put
> the
> cluster
> on unhealth and pg gone stuck again due unfound objects.
>
> Does anyone have this issue?

 Hi,

 I also ran into that problem after upgrading to jewel. In my case I
 was
 able to somewhat correlate this behavior with setting the
 sortbitwise
 flag
 after the upgrade. When the flag is set, after some time these
 unfound
 objects are popping up. Restarting osds just makes it worse and/or
 

Re: [ceph-users] OSD Restart results in "unfound objects"

2016-06-01 Thread SCHAER Frederic
I do…

In my case, I have collocated the MONs with some OSDs, and no later than 
Saturday when I lost data again, I found out that one of the MON+OSD nodes ran 
out of memory and started killing ceph-mon on that node…
At the same moment, all OSDs started to complain about not being able to see 
other OSDs on other machines.

I suspect that when the node runs out of memory, bad things happen with for 
instance the network (no memory : no network buffer ?). But I can’t explain the 
unfound objects, as in my case, same as yours, nodes did not crash, and 
ceph-osd did not crash neither – hence, I’m assuming no data was lost because 
of sudden disk poweroff for instance, or because of any kernel or raid 
controller cache…

For now, I’m considering moving the MONs onto dedicated nodes … hoping the out 
of memory was my issue.

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Diego 
Castro
Envoyé : mercredi 1 juin 2016 10:25
À : ceph-users 
Objet : [ceph-users] OSD Restart results in "unfound objects"

Hello, i have a cluster running Jewel 10.2.0, 25 OSD's + 4 Mon.
Today my cluster suddenly went unhealth with lots of stuck pg's  due unfound 
objects, no disks failures nor node crashes, it just went bad.

I managed to put the cluster on health state again by marking lost objects to 
delete "ceph pg  mark_unfound_lost delete".
Regarding the fact that i have no idea why the cluster gone bad, i realized 
restarting the osd' daemons to unlock stuck clients put the cluster on unhealth 
and pg gone stuck again due unfound objects.

Does anyone have this issue?

---
Diego Castro / The CloudFather
GetupCloud.com - Eliminamos a Gravidade
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] unfound objects - why and how to recover ? (bonus : jewel logs)

2016-05-27 Thread SCHAER Frederic
Hi,

--
First, let me start with the bonus...
I migrated from hammer => jewel and followed the migration instructions... but 
migrations instructions are missing this :
#chown  -R ceph:ceph /var/log/ceph
I just discoved this was the reason I found no log nowhere about my current 
issue :/
--

This is maybe the 3rd time this happens to me ... This time I'd like to try to 
understand what happens.

So. ceph-10.2.0-0.el7.x86_64+Cent0S 7.2 here.
Ceph health was happy, but any rbd operation was hanging - hence : ceph was 
hung, and so were the test VMs running on it.

I placed my VM in an EC pool on top of which I overlayed an RBD pool with SSDs.
The EC pool is defined as being a 3+1 pool, with 5 hosts hosting the OSDs (and 
the failure domain is set to hosts)

"Ceph -w" wasn't displaying new status lines as usual, but ceph health (detail) 
wasn't saying anything would be wrong.
After looking at one node, I found that ceph logs were empty on one node, so I 
decided to restart the OSDs on that one using : systemctl restart ceph-osd@*

After I did that, ceph -w got to life again , but telling me there was a dead 
MON - which I restarted too.
I watched some kind of recovery happening, and after a few seconds/minutes, I 
now see :

[root@ceph0 ~]# ceph health detail
HEALTH_WARN 4 pgs degraded; 3 pgs recovering; 1 pgs recovery_wait; 4 pgs stuck 
unclean; recovery 57/373846 objects degraded (0.015%); recovery 57/110920 
unfound (0.051%)
pg 691.65 is stuck unclean for 310704.556119, current state 
active+recovery_wait+degraded, last acting [44,99,69,9]
pg 691.1e5 is stuck unclean for 493631.370697, current state 
active+recovering+degraded, last acting [77,43,20,99]
pg 691.12a is stuck unclean for 14521.475478, current state 
active+recovering+degraded, last acting [42,56,7,106]
pg 691.165 is stuck unclean for 14521.474525, current state 
active+recovering+degraded, last acting [21,71,24,117]
pg 691.165 is active+recovering+degraded, acting [21,71,24,117], 15 unfound
pg 691.12a is active+recovering+degraded, acting [42,56,7,106], 1 unfound
pg 691.1e5 is active+recovering+degraded, acting [77,43,20,99], 2 unfound
pg 691.65 is active+recovery_wait+degraded, acting [44,99,69,9], 39 unfound
recovery 57/373846 objects degraded (0.015%)
recovery 57/110920 unfound (0.051%)

Damn.
Last time this happened, I was forced to declare lost the PGs in order to 
recover a "healthy" ceph, because ceph does not want to revert PGs in EC pools. 
But one of the VMs started hanging randomly on disk IOs...
This same VM is now down, and I can't remove its disk from rbd, it's hanging at 
99% - I could work that around by renaming the file and re-installing the VM on 
a new disk, but anyway, I'd like to understand+fix+make sure this does not 
happen again.
We sometimes suffer power cuts here : if restarting daemons kills ceph data, I 
cannot think of what would happen in case of power cut...

Back to the unfound objects. I have no OSD down that would be in the cluster 
(only 1 down, and I put it myself down - OSD.46 - , but set its weight to 0 
last week)
I can query the PGs, but I don't understand what I see in there.
For instance :

#ceph pg 691.65 query
(...)
"num_objects_missing": 0,
"num_objects_degraded": 39,
"num_objects_misplaced": 0,
"num_objects_unfound": 39,
"num_objects_dirty": 138,

And then for 2 peers I see :
"state": "active+undersized+degraded", ## undersized ???
(...)
"num_objects_missing": 0,
"num_objects_degraded": 138,
"num_objects_misplaced": 138,
"num_objects_unfound": 0,
"num_objects_dirty": 138,
"blocked_by": [],
"up_primary": 44,
"acting_primary": 44


If I look at the "missing" objects, I can see something on some OSDs :
# ceph pg 691.165 list_missing
(...)
{
"oid": {
"oid": "rbd_data.8de32431bd7b7.0ea7",
"key": "",
"snapid": -2,
"hash": 971513189,
"max": 0,
"pool": 691,
"namespace": ""
},
"need": "26521'22595",
"have": "25922'22575",
"locations": []
}

All of the missing objects have this "need/have" discrepancy.

I can see such objects in a "691.165" directory on secondary OSDs, but I do not 
see any 691.165 directory on the primary OSD (44)... ?
For instance :
[root@ceph0 ~]# ll 
/var/lib/ceph/osd/ceph-21/current/691.165s0_head/*8de32431bd7b7.0ea7*
-rw-r--r-- 1 ceph ceph 1399392 May 15 13:18 
/var/lib/ceph/osd/ceph-21/current/691.165s0_head/rbd\udata.8de32431bd7b7.0ea7__head_39E81D65__2b3_5843_0
-rw-r--r-- 1 ceph ceph 1399392 May 27 11:07 

Re: [ceph-users] jewel upgrade : MON unable to start

2016-05-02 Thread SCHAER Frederic
I believe this is because I did not read the instruction thoroughly enough... 
this is my first "live upgrade" 


-Message d'origine-
De : Oleksandr Natalenko [mailto:oleksa...@natalenko.name] 
Envoyé : lundi 2 mai 2016 16:39
À : SCHAER Frederic <frederic.sch...@cea.fr>; ceph-us...@ceph.com
Objet : Re: [ceph-users] jewel upgrade : MON unable to start

Why do you upgrade osds first if it is necessary to upgrade mons before 
everything else?

On May 2, 2016 5:31:43 PM GMT+03:00, SCHAER Frederic <frederic.sch...@cea.fr> 
wrote:
>Hi,
>
>I'm < sort of > following the upgrade instructions on CentOS 7.2.
>I upgraded 3 OSD nodes without too many issues, even if I would rewrite
>those upgrade instructions to :
>
>
>#chrony has ID 167 on my systems... this was set at install time ! but
>I use NTP anyway.
>
>yum remove chrony
>
>sed -i -e '/chrony/d' /etc/passwd
>
>#there is no more "service ceph stop" possible after the yum update, so
>I had to run it before. Or killall ceph daemons...
>
>service ceph stop
>
>yum -y update
>
>chown ceph:ceph /var/lib/ceph
>
>#this fixed some OSD wich failed to start because of permission denied
>issues on the journals.
>
>chown -RL --dereference ceph:ceph /var/lib/ceph
>
>#not done automatically :
>
>systemctl enable ceph-osd.target ceph.target
>
>#systemctl start ceph-osd.target has absolutely no effect. Nor any
>.target targets, at least for me, and right after the upgrade.
>
>ceph-disk activate-all
>
>Anyways. Now I'm trying to upgrade the MON nodes... and I'm facing an
>issue.
>I started with one MON and left the 2 others untouched (hammer).
>
>First, the mons did not want to start :
>May 02 15:40:58 ceph2_snip_ ceph-mon[789124]: warning: unable to create
>/var/run/ceph: (13) Permission denied
>
>No, pb: I created and chowned the directory.
>But I'm now still unable to start this MON, journalctl tells me :
>
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: starting mon.ceph2 rank 2
>at _ipsnip_.72:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph2 fsid
>70ac4a78-46c0-45e6-8ff9-878b37f50fa1
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: In function
>'void FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02
>16:05:49.487984
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED
>assert(i.second.state == MDSMap::STATE_STANDBY)
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0
>(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1:
>(ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x85) [0x7f774de221e5]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity()
>const+0x952) [0x7f774dd3f972]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3:
>(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4:
>(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5:
>(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6:
>(Monitor::init_paxos()+0x95) [0x7f774da67955]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7:
>(Monitor::preinit()+0x949) [0x7f774da77b39]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3)
>[0x7f774da03e93]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9:
>(__libc_start_main()+0xf5) [0x7f774ad6fb15]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401)
>[0x7f774da57401]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the
>executable, or `objdump -rdS ` is needed to interpret this.
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2016-05-02 16:05:49.490966
>7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const'
>thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED
>assert(i.second.state == MDSMap::STATE_STANDBY)
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0
>(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1:
>(ceph::__ceph_assert_fail(char const*, char const*, int, char
>const*)+0x85) [0x7f774de221e5]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity()
>const+0x952) [0x7f774dd3f972]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3:
>(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4:
>(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5:
>(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb]
>May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6:
>(Monitor::init_paxos()+0

[ceph-users] jewel upgrade : MON unable to start

2016-05-02 Thread SCHAER Frederic
Hi,

I'm < sort of > following the upgrade instructions on CentOS 7.2.
I upgraded 3 OSD nodes without too many issues, even if I would rewrite those 
upgrade instructions to :


#chrony has ID 167 on my systems... this was set at install time ! but I use 
NTP anyway.

yum remove chrony

sed -i -e '/chrony/d' /etc/passwd

#there is no more "service ceph stop" possible after the yum update, so I had 
to run it before. Or killall ceph daemons...

service ceph stop

yum -y update

chown ceph:ceph /var/lib/ceph

#this fixed some OSD wich failed to start because of permission denied issues 
on the journals.

chown -RL --dereference ceph:ceph /var/lib/ceph

#not done automatically :

systemctl enable ceph-osd.target ceph.target

#systemctl start ceph-osd.target has absolutely no effect. Nor any .target 
targets, at least for me, and right after the upgrade.

ceph-disk activate-all

Anyways. Now I'm trying to upgrade the MON nodes... and I'm facing an issue.
I started with one MON and left the 2 others untouched (hammer).

First, the mons did not want to start :
May 02 15:40:58 ceph2_snip_ ceph-mon[789124]: warning: unable to create 
/var/run/ceph: (13) Permission denied

No, pb: I created and chowned the directory.
But I'm now still unable to start this MON, journalctl tells me :

May 02 16:05:49 ceph2_snip ceph-mon[804583]: starting mon.ceph2 rank 2 at 
_ipsnip_.72:6789/0 mon_data /var/lib/ceph/mon/ceph-ceph2 fsid 
70ac4a78-46c0-45e6-8ff9-878b37f50fa1
May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: In function 'void 
FSMap::sanity() const' thread 7f774d7d94c0 time 2016-05-02 16:05:49.487984
May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED 
assert(i.second.state == MDSMap::STATE_STANDBY)
May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x85) [0x7f774de221e5]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() const+0x952) 
[0x7f774dd3f972]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: 
(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: 
(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: 
(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: (Monitor::init_paxos()+0x95) 
[0x7f774da67955]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7: (Monitor::preinit()+0x949) 
[0x7f774da77b39]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3) [0x7f774da03e93]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9: (__libc_start_main()+0xf5) 
[0x7f774ad6fb15]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401) [0x7f774da57401]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2016-05-02 16:05:49.490966 
7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 
7f774d7d94c0 time 2016-05-02 16:05:49.487984
May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED 
assert(i.second.state == MDSMap::STATE_STANDBY)
May 02 16:05:49 ceph2_snip ceph-mon[804583]: ceph version 10.2.0 
(3a9fba20ec743699b69bd0181dd6c54dc01c64b9)
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 1: (ceph::__ceph_assert_fail(char 
const*, char const*, int, char const*)+0x85) [0x7f774de221e5]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 2: (FSMap::sanity() const+0x952) 
[0x7f774dd3f972]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 3: 
(MDSMonitor::update_from_paxos(bool*)+0x490) [0x7f774db5cba0]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 4: 
(PaxosService::refresh(bool*)+0x1a5) [0x7f774dacdda5]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 5: 
(Monitor::refresh_from_paxos(bool*)+0x15b) [0x7f774da674bb]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 6: (Monitor::init_paxos()+0x95) 
[0x7f774da67955]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 7: (Monitor::preinit()+0x949) 
[0x7f774da77b39]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 8: (main()+0x23e3) [0x7f774da03e93]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 9: (__libc_start_main()+0xf5) 
[0x7f774ad6fb15]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 10: (()+0x25e401) [0x7f774da57401]
May 02 16:05:49 ceph2_snip ceph-mon[804583]: NOTE: a copy of the executable, or 
`objdump -rdS ` is needed to interpret this.
May 02 16:05:49 ceph2_snip ceph-mon[804583]: 0> 2016-05-02 16:05:49.490966 
7f774d7d94c0 -1 mds/FSMap.cc: In function 'void FSMap::sanity() const' thread 
7f774d7d94c0 time 2016-05-02 16:05:49.487984
May 02 16:05:49 ceph2_snip ceph-mon[804583]: mds/FSMap.cc: 607: FAILED 
assert(i.second.state == MDSMap::STATE_STANDBY)
(...)


?  I'm now stuck with a half jewel/hammer cluster... DOH ! :'(

I've seen a bug on the bugtracker, but I fail to find a work around ?

[ceph-users] ceph OSD down+out =>health ok => remove => PGs backfilling... ?

2016-04-26 Thread SCHAER Frederic
Hi,

One simple/quick question.
In my ceph cluster, I had a disk wich was in predicted failure. It was so much 
in predicted failure that the ceph OSD daemon crashed.

After the OSD crashed, ceph moved data correctly (or at least that's what I 
thought), and a ceph -s was giving a "HEALTH_OK".
Perfect.
I tride to tell ceph to mark the OSD down : it told me the OSD was already 
down... fine.

Then I ran this :
ID=43 ; ceph osd down $ID ; ceph auth del osd.$ID ; ceph osd rm $ID ; ceph osd 
crush remove osd.$ID

And immediately after this, ceph told me :
# ceph -s
cluster 70ac4a78-46c0-45e6-8ff9-878b37f50fa1
 health HEALTH_WARN
37 pgs backfilling
3 pgs stuck unclean
recovery 12086/355688 objects misplaced (3.398%)
 monmap e2: 3 mons at 
{ceph0=192.54.207.70:6789/0,ceph1=192.54.207.71:6789/0,ceph2=192.54.207.72:6789/0}
election epoch 938, quorum 0,1,2 ceph0,ceph1,ceph2
 mdsmap e64: 1/1/1 up {0=ceph1=up:active}, 1 up:standby-replay, 1 up:standby
 osdmap e25455: 119 osds: 119 up, 119 in; 35 remapped pgs
  pgmap v5473702: 3212 pgs, 10 pools, 378 GB data, 97528 objects
611 GB used, 206 TB / 207 TB avail
12086/355688 objects misplaced (3.398%)
3175 active+clean
  37 active+remapped+backfilling
  client io 192 kB/s rd, 1352 kB/s wr, 117 op/s

Off course, I'm sure the OSD 43 was the one that was down ;)
My question therefore is :

If ceph successfully and automatically migrated data off the down/out OSD, why 
is there even anything happening once I tell ceph to forget about this osd ?
Was the cluster not "HEALTH OK" after all ?

(ceph-0.94.6-0.el7.x86_64 for now)

Thanks && regards

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph startup issues : OSDs don't start

2016-04-21 Thread SCHAER Frederic
Hi,

I'm sure I'm doing something wrong, I hope someone can enlighten me...
I'm encountering many issues when I restart a ceph server (any ceph server).

This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64.

Firt : I have disabled abrt. I don't need abrt.
But when I restart, I see these logs in the systemd-udevd journal :


Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception 
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon, 
is it running? [Errno 2] No such file or directory
Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception 
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon, 
is it running? [Errno 2] No such file or directory

How could I possibly debug these exceptions ?
Could that be related to the osd hook that I'm using to put the SSDs in another 
root in the crush map (that hook is a bash script, but it's calling another 
helper python script that I made and which is trying to use megacli to identify 
the SSDs on a non-jbod controller... tricky thing.) ?


Then, I see these kind of errors for most if not all drives :


Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700  0 -- :/885104093 >> 
__MON_IP__:6789/0 pipe(0x7fc48280 sd=6 :0 s=1 pgs=0 cs=0 l=1 
c=0x7fc400012670).fault'
Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700  0 -- :/885104093 >> 
__MON_IP__:6789/0 pipe(0x7fc40c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 
c=0x7fc4e1d0).fault'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf 
--name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd crush 
create-or-move -- 113 1.81 host=ceph4 root=default''
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command 
'['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']' 
returned non-zero exit status 1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1' [1257] exit with return code 1
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file 
'/run/udev/data/b65:49' for 
'/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes 
to netlink monitor 0x7f4cec2f3240
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0

Please note that at that time of the boot, I think there is still no network as 
the interfaces are brought up later according to the network journal :

Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1:  [  OK  
]
Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2:  [  OK  
]

=> too bad for the OSD startups... I have to say I also disabled 
NetworkManager, and I'm using static network configuration files... but I don't 
know why the ceph init script would be called before network is up... ?
But even if I had network, I'm having another issue : I'm wondering wether I'm 
hitting deadlocks somewhere...

Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792] 
/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2
 is taking a long time
(...)
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]:  70 00 
05 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) ''
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(out) '=== osd.107 === '
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700  0 -- 
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0 cs=0 
l=1 c=0x7f957c05bb40).fault'
(...)
Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700  0 -- 
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0 cs=0 
l=1 c=0x7f95700056a0).fault'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107' 
weight 1.81 at location {host=ceph4,root=default} to crush map'
Apr 21 18:02:16 ceph4._snip_ 

[ceph-users] ceph startup issues : OSDs don't start

2016-04-21 Thread SCHAER Frederic
Hi,

I'm sure I'm doing something wrong, I hope someone can enlighten me...
I'm encountering many issues when I restart a ceph server (any ceph server).

This is on CentOS 7.2, ceph-0.94.6-0.el7.x86_64.

Firt : I have disabled abrt. I don't need abrt.
But when I restart, I see these logs in the systemd-udevd journal :


Apr 21 18:00:14 ceph4._snip_ python[1109]: detected unhandled Python exception 
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1109]: can't communicate with ABRT daemon, 
is it running? [Errno 2] No such file or directory
Apr 21 18:00:14 ceph4._snip_ python[1174]: detected unhandled Python exception 
in '/usr/sbin/ceph-disk'
Apr 21 18:00:14 ceph4._snip_ python[1174]: can't communicate with ABRT daemon, 
is it running? [Errno 2] No such file or directory

How could I possibly debug these exceptions ?
Could that be related to the osd hook that I'm using to put the SSDs in another 
root in the crush map (that hook is a bash script, but it's calling another 
helper python script that I made and which is trying to use megacli to identify 
the SSDs on a non-jbod controller... tricky thing.) ?


Then, I see these kind of errors for most if not all drives :


Apr 21 18:00:47 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) '2016-04-21 18:00:47.115322 7fc408ff9700  0 -- :/885104093 >> 
__MON_IP__:6789/0 pipe(0x7fc48280 sd=6 :0 s=1 pgs=0 cs=0 l=1 
c=0x7fc400012670).fault'
Apr 21 18:00:50 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) '2016-04-21 18:00:50.115543 7fc408ef8700  0 -- :/885104093 >> 
__MON_IP__:6789/0 pipe(0x7fc40c00 sd=6 :0 s=1 pgs=0 cs=0 l=1 
c=0x7fc4e1d0).fault'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(out) 'failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf 
--name=osd.113 --keyring=/var/lib/ceph/osd/ceph-113/keyring osd crush 
create-or-move -- 113 1.81 host=ceph4 root=default''
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1'(err) 'ceph-disk: Error: ceph osd start failed: Command 
'['/usr/sbin/service', 'ceph', '--cluster', 'ceph', 'start', 'osd.113']' 
returned non-zero exit status 1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: '/usr/sbin/ceph-disk-activate 
/dev/sdt1' [1257] exit with return code 1
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: adding watch on '/dev/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: created db file 
'/run/udev/data/b65:49' for 
'/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:6/2:2:6:0/block/sdt/sdt1'
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: passed unknown number of bytes 
to netlink monitor 0x7f4cec2f3240
Apr 21 18:00:52 ceph4._snip_ systemd-udevd[876]: seq 2553 processed with 0

Please note that at that time of the boot, I think there is still no network as 
the interfaces are brought up later according to the network journal :

Apr 21 18:02:16 ceph4._snip_ network[2904]: Bringing up interface p2p1:  [  OK  
]
Apr 21 18:02:19 ceph4._snip_ network[2904]: Bringing up interface p2p2:  [  OK  
]

=> too bad for the OSD startups... I have to say I also disabled 
NetworkManager, and I'm using static network configuration files... but I don't 
know why the ceph init script would be called before network is up... ?
But even if I had network, I'm having another issue : I'm wondering wether I'm 
hitting deadlocks somewhere...

Apr 21 18:01:10 ceph4._snip_ systemd-udevd[779]: worker [792] 
/devices/pci:00/:00:07.0/:03:00.0/host2/target2:2:0/2:2:0:0/block/sdn/sdn2
 is taking a long time
(...)
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) 'SG_IO: bad/missing sense data, sb[]:  70 00 
05 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) ' 00 00 00 0b 00 00 00 00 20 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00'
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) ''
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(out) '=== osd.107 === '
Apr 21 18:01:54 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) '2016-04-21 18:01:54.707669 7f95801ac700  0 -- 
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f957c05f710 sd=4 :0 s=1 pgs=0 cs=0 
l=1 c=0x7f957c05bb40).fault'
(...)
Apr 21 18:02:12 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) '2016-04-21 18:02:12.709053 7f95801ac700  0 -- 
:/2141879112 >> __MON_IP__:6789/0 pipe(0x7f9570008280 sd=4 :0 s=1 pgs=0 cs=0 
l=1 c=0x7f95700056a0).fault'
Apr 21 18:02:16 ceph4._snip_ systemd-udevd[792]: '/usr/sbin/ceph-disk 
activate-journal /dev/sdn2'(err) 'create-or-move updated item name 'osd.107' 
weight 1.81 at location {host=ceph4,root=default} to crush map'
Apr 21 18:02:16 ceph4._snip_ 

Re: [ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-27 Thread SCHAER Frederic
Hi,

Many thanks.
Just tested : I could see the rbd_id object in the EC pool, and after promoting 
it I could see it in the SSD cache pool and could successfully list the image 
information, indeed.

Cheers

-Message d'origine-
De : Jason Dillaman [mailto:dilla...@redhat.com] 
Envoyé : mercredi 24 février 2016 19:16
À : SCHAER Frederic <frederic.sch...@cea.fr>
Cc : ceph-us...@ceph.com; HONORE Pierre-Francois <pierre-francois.hon...@cea.fr>
Objet : Re: [ceph-users] ceph hammer : rbd info/Status : operation not 
supported (95) (EC+RBD tier pools)

If you run "rados -p  ls | grep "rbd_id." and don't see 
that object, you are experiencing that issue [1].

You can attempt to work around this issue by running "rados -p irfu-virt 
setomapval rbd_id. dummy value" to force-promote the object to the 
cache pool.  I haven't tested / verified that will alleviate the issue, though.

[1] http://tracker.ceph.com/issues/14762

-- 

Jason Dillaman 

----- Original Message - 

> From: "SCHAER Frederic" <frederic.sch...@cea.fr>
> To: ceph-us...@ceph.com
> Cc: "HONORE Pierre-Francois" <pierre-francois.hon...@cea.fr>
> Sent: Wednesday, February 24, 2016 12:56:48 PM
> Subject: [ceph-users] ceph hammer : rbd info/Status : operation not supported
> (95) (EC+RBD tier pools)

> Hi,

> I just started testing VMs inside ceph this week, ceph-hammer 0.94-5 here.

> I built several pools, using pool tiering:
> - A small replicated SSD pool (5 SSDs only, but I thought it’d be better for
> IOPS, I intend to test the difference with disks only)
> - Overlaying a larger EC pool

> I just have 2 VMs in Ceph… and one of them is breaking something.
> The VM that is not breaking was migrated using qemu-img for creating the ceph
> volume, then migrating the data. Its rbd format is 1 :
> rbd image 'xxx-disk1':
> size 20480 MB in 5120 objects
> order 22 (4096 kB objects)
> block_name_prefix: rb.0.83a49.3d1b58ba
> format: 1

> The VM that’s failing has a rbd format 2
> this is what I had before things started breaking :
> rbd image 'yyy-disk1':
> size 10240 MB in 2560 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.8ae1f47398c89
> format: 2
> features: layering, striping
> flags:
> stripe unit: 4096 kB
> stripe count: 1

> The VM started behaving weirdly with a huge IOwait % during its install
> (that’s to say it did not take long to go wrong ;) )
> Now, this is the only thing that I can get

> [root@ceph0 ~]# rbd -p irfu-virt info yyy-disk1
> 2016-02-24 18:30:33.213590 7f00e6f6d7c0 -1 librbd::ImageCtx: error reading
> image id: (95) Operation not supported
> rbd: error opening image yyy-disk1: (95) Operation not supported

> One thing to note : the VM * IS STILL * working : I can still do disk
> operations, apparently.
> During the VM installation, I realized I wrongly set the target SSD caching
> size to 100Mbytes, instead of 100Gbytes, and ceph complained it was almost
> full :
> health HEALTH_WARN
> 'ssd-hot-irfu-virt' at/near target max

> My question is…… am I facing the bug as reported in this list thread with
> title “Possible Cache Tier Bug - Can someone confirm” ?
> Or did I do something wrong ?

> The libvirt and kvm that are writing into ceph are the following :
> libvirt -1.2.17-13.el7_2.3.x86_64
> qemu- kvm -1.5.3-105.el7_2.3.x86_64

> Any idea how I could recover the VM file, if possible ?
> Please note I have no problem with deleting the VM and rebuilding it, I just
> spawned it to test.
> As a matter of fact, I just “virsh destroyed” the VM, to see if I could start
> it again… and I cant :

> # virsh start yyy
> error: Failed to start domain yyy
> error: internal error: process exited while connecting to monitor:
> 2016-02-24T17:49:59.262170Z qemu-kvm: -drive
> file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***==:auth_supported=cephx\;none:mon_host=_\:6789,if=none,id=drive-virtio-disk0,format=raw:
> error reading header from yyy-disk1
> 2016-02-24T17:49:59.263743Z qemu-kvm: -drive
> file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=A***==:auth_supported=cephx\;none:mon_host=___\:6789,if=none,id=drive-virtio-disk0,format=raw:
> could not open disk image
> rbd:irfu-virt/___-disk1:id=irfu-***==:auth_supported=cephx\;none:mon_host=___\:6789:
> Could not open 'rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***

> Ideas ?
> Thanks
> Frederic

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph hammer : rbd info/Status : operation not supported (95) (EC+RBD tier pools)

2016-02-24 Thread SCHAER Frederic
Hi,

I just started testing VMs inside ceph this week, ceph-hammer 0.94-5 here.

I built several pools, using pool tiering:

-  A small replicated SSD pool (5 SSDs only, but I thought it'd be 
better for IOPS, I intend to test the difference with disks only)

-  Overlaying a larger EC pool

I just have 2 VMs in Ceph... and one of them is breaking something.
The VM that is not breaking was migrated using qemu-img for creating the ceph 
volume, then migrating the data. Its rbd format is 1 :
rbd image 'xxx-disk1':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.83a49.3d1b58ba
format: 1

The VM that's failing has a rbd format 2
this is what I had before things started breaking :
rbd image 'yyy-disk1':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8ae1f47398c89
format: 2
features: layering, striping
flags:
stripe unit: 4096 kB
stripe count: 1


The VM started behaving weirdly with a huge IOwait % during its install (that's 
to say it did not take long to go wrong ;) )
Now, this is the only thing that I can get

[root@ceph0 ~]# rbd -p irfu-virt info yyy-disk1
2016-02-24 18:30:33.213590 7f00e6f6d7c0 -1 librbd::ImageCtx: error reading 
image id: (95) Operation not supported
rbd: error opening image yyy-disk1: (95) Operation not supported

One thing to note : the VM *IS STILL* working : I can still do disk operations, 
apparently.
During the VM installation, I realized I wrongly set the target SSD caching 
size to 100Mbytes, instead of 100Gbytes, and ceph complained it was almost full 
:
 health HEALTH_WARN
'ssd-hot-irfu-virt' at/near target max

My question is.. am I facing the bug as reported in this list thread with 
title "Possible Cache Tier Bug - Can someone confirm" ?
Or did I do something wrong ?

The libvirt and kvm that are writing into ceph are the following :
libvirt-1.2.17-13.el7_2.3.x86_64
qemu-kvm-1.5.3-105.el7_2.3.x86_64

Any idea how I could recover the VM file, if possible ?
Please note I have no problem with deleting the VM and rebuilding it, I just 
spawned it to test.
As a matter of fact, I just "virsh destroyed" the VM, to see if I could start 
it again... and I cant :

# virsh start yyy
error: Failed to start domain yyy
error: internal error: process exited while connecting to monitor: 
2016-02-24T17:49:59.262170Z qemu-kvm: -drive 
file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***==:auth_supported=cephx\;none:mon_host=_\:6789,if=none,id=drive-virtio-disk0,format=raw:
 error reading header from yyy-disk1
2016-02-24T17:49:59.263743Z qemu-kvm: -drive 
file=rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=A***==:auth_supported=cephx\;none:mon_host=___\:6789,if=none,id=drive-virtio-disk0,format=raw:
 could not open disk image 
rbd:irfu-virt/___-disk1:id=irfu-***==:auth_supported=cephx\;none:mon_host=___\:6789:
 Could not open 'rbd:irfu-virt/yyy-disk1:id=irfu-virt:key=***

Ideas ?
Thanks
Frederic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Coding pool stuck at creation because of pre-existing crush ruleset ?

2015-09-30 Thread SCHAER Frederic
Hi,

With 5 hosts, I could successfully create pools with k=4 and m=1, with the 
failure domain being set to "host".
With 6 hosts, I could also create k=4,m=1 EC pools.
But I suddenly failed with 6 hosts k=5 and m=1, or k=4,m=2 : the PGs were never 
created - I reused the pool name for my tests, this seems to matter, see below- 
??

HEALTH_WARN 512 pgs stuck inactive; 512 pgs stuck unclean
pg 159.70 is stuck inactive since forever, current state creating, last acting 
[]
pg 159.71 is stuck inactive since forever, current state creating, last acting 
[]
pg 159.72 is stuck inactive since forever, current state creating, last acting 
[]

The pool is like this :
[root@ceph0 ~]# ceph osd pool get testec erasure_code_profile
erasure_code_profile: erasurep4_2_host
[root@ceph0 ~]# ceph osd erasure-code-profile get erasurep4_2_host
directory=/usr/lib64/ceph/erasure-code
k=4
m=2
plugin=isa
ruleset-failure-domain=host


The PG list is like this - all PGs are alike- :
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
159.0   0   0   0   0   0   0   0   0   
creating0.000'0 0:0 []  -1  []  -1  
0'0 2015-09-30 14:41:01.219196  0'0 2015-09-30 14:41:01.219196
159.1   0   0   0   0   0   0   0   0   
creating0.000'0 0:0 []  -1  []  -1  
0'0 2015-09-30 14:41:01.219197  0'0 2015-09-30 14:41:01.219197


I can't dump a PG (but if it's on no OSD then...)
[root@ceph0 ~]# ceph pg 159.0 dump
^CError EINTR: problem getting command descriptions from pg.159.0

?  Hangs.

The OSD tree is like this :
-1 21.71997 root default
-2  3.62000 host ceph4
  9  1.81000 osd.9   up  1.0  1.0
15  1.81000 osd.15  up  1.0  1.0
-3  3.62000 host ceph0
  5  1.81000 osd.5   up  1.0  1.0
11  1.81000 osd.11  up  1.0  1.0
-4  3.62000 host ceph1
  6  1.81000 osd.6   up  1.0  1.0
12  1.81000 osd.12  up  1.0  1.0
-5  3.62000 host ceph2
  7  1.81000 osd.7   up  1.0  1.0
13  1.81000 osd.13  up  1.0  1.0
-6  3.62000 host ceph3
  8  1.81000 osd.8   up  1.0  1.0
14  1.81000 osd.14  up  1.0  1.0
-13  3.62000 host ceph5
10  1.81000 osd.10  up  1.0  1.0
16  1.81000 osd.16  up  1.0  1.0


Then, I dumped the crush ruleset and noticed the "max_size=5".
[root@ceph0 ~]# ceph osd pool get testec crush_ruleset
crush_ruleset: 1
[root@ceph0 ~]# ceph osd crush rule dump testec
{
"rule_id": 1,
"rule_name": "testec",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 5,

I thought I should not care, since I'm not creating a replicated pool but...
I then deleted the pool + deleted the "testec" ruleset, re-created the pool 
and... boom, PGs started being created !?

Now, the ruleset looks like this :
[root@ceph0 ~]# ceph osd crush rule dump testec
{
"rule_id": 1,
"rule_name": "testec",
"ruleset": 1,
"type": 3,
"min_size": 3,
"max_size": 6,
   ^^^

Is this a bug, or a "feature" (if so, I'd be glad if someone could shed some 
light on it ?) ?
I'm presuming ceph is considering that an EC chunk is a replica, but I'm 
failing to understand the documentation : I did not select the crush ruleset 
when I created the pool.
Still, the ruleset was chosen by default (by CRUSH?) , and was not working... ?

Thanks && regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Important security noticed regarding release signing key

2015-09-21 Thread SCHAER Frederic
-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de Wido 
den Hollander
Envoyé : lundi 21 septembre 2015 15:50
À : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Important security noticed regarding release signing 
key



On 21-09-15 15:05, SCHAER Frederic wrote:
> Hi,
> 
> Forgive the question if the answer is obvious... It's been more than "an hour 
> or so" and eu.ceph.com apparently still hasn't been re-signed or at least 
> what I checked wasn't :
> 
> # rpm -qp --qf '%{RSAHEADER:pgpsig}' 
> http://eu.ceph.com/rpm-hammer/el7/x86_64/ceph-0.94.3-0.el7.centos.x86_64.rpm
> RSA/SHA1, Wed 26 Aug 2015 09:57:17 PM CEST, Key ID 7ebfdd5d17ed316d
> 
> Should this repository/mirror be discarded and should we (in EU) switch to 
> download.ceph.com ?

I fixed eu.ceph.com by putting a Varnish HTTP cache in between which now
links to ceph.com

You can still use eu.ceph.com and should be able to do so.

eu.ceph.com caches all traffic so that should be much snappier then
downloading everything from download.ceph.com directly.

Wido

[>- FS : -<] Many thanks for your quick reply and quick reaction !

Frederic 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-08-07 Thread SCHAER Frederic

De : Jake Young [mailto:jak3...@gmail.com]
Envoyé : mercredi 29 juillet 2015 17:13
À : SCHAER Frederic frederic.sch...@cea.fr
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

On Tue, Jul 28, 2015 at 11:48 AM, SCHAER Frederic 
frederic.sch...@cea.frmailto:frederic.sch...@cea.fr wrote:

 Hi again,

 So I have tried
 - changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
 - changing the memory configuration, from advanced ecc mode to performance 
 mode, boosting the memory bandwidth from 35GB/s to 40GB/s
 - plugged a second 10GB/s link and setup a ceph internal network
 - tried various tuned-adm profile such as throughput-performance

 This changed about nothing.

 If
 - the CPUs are not maxed out, and lowering the frequency doesn't change a 
 thing
 - the network is not maxed out
 - the memory doesn't seem to have an impact
 - network interrupts are spread across all 8 cpu cores and receive queues are 
 OK
 - disks are not used at their maximum potential (iostat shows my dd commands 
 produce much more tps than the 4MB ceph transfers...)

 Where can I possibly find a bottleneck ?

 I'm /(almost) out of ideas/ ... :'(

 Regards


Frederic,

I was trying to optimize my ceph cluster as well and I looked at all of the 
same things you described, which didn't help my performance noticeably.

The following network kernel tuning settings did help me significantly.

This is my /etc/sysctl.conf file on all of  my hosts: ceph mons, ceph osds and 
any client that connects to my ceph cluster.

# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
#net.core.rmem_max = 56623104
#net.core.wmem_max = 56623104
# Use 128M buffers
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
# Also increase the max packet backlog
net.core.somaxconn = 1024
# Increase the length of the processor input queue
net.core.netdev_max_backlog = 25
net.ipv4.tcp_max_syn_backlog = 3
net.ipv4.tcp_max_tw_buckets = 200
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10

# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0

# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192

# Disable source routing and redirects
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.accept_source_route = 0

# Recommended when jumbo frames are enabled
net.ipv4.tcp_mtu_probing = 1

I have 40 Gbps links on my osd nodes, and 10 Gbps links on everything else.

Let me know if that helps.

Jake
[- FS : -]
Hi,

Thanks for suggesting these :]

I finally got some time to try your kernel parameters… but that doesn’t seem to 
help at least for the EC pools.
I’ll need to re-add all the disk OSDs to be really sure, especially with the 
replicated pools – I’d like to see if at least the replicated pools are better, 
so that I can use them as frontend pools…

Regards



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-28 Thread SCHAER Frederic
Hi again,

So I have tried 
- changing the cpus frequency : either 1.6GHZ, or 2.4GHZ on all cores
- changing the memory configuration, from advanced ecc mode to performance 
mode, boosting the memory bandwidth from 35GB/s to 40GB/s
- plugged a second 10GB/s link and setup a ceph internal network
- tried various tuned-adm profile such as throughput-performance

This changed about nothing.

If 
- the CPUs are not maxed out, and lowering the frequency doesn't change a thing
- the network is not maxed out
- the memory doesn't seem to have an impact
- network interrupts are spread across all 8 cpu cores and receive queues are OK
- disks are not used at their maximum potential (iostat shows my dd commands 
produce much more tps than the 4MB ceph transfers...)

Where can I possibly find a bottleneck ?

I'm /(almost) out of ideas/ ... :'(

Regards

-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 juillet 2015 16:04
À : Christian Balzer; ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] Ceph 0.94 (and lower) 
performance on 1 hosts ??

Hi,

Thanks.
I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - 
I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores.
I also discovered turbostat which showed me the R510s were not configured for 
performance in the bios (but dbpm - demand based power management), and were 
not bumping the CPUs frequency to 2.4GHz as they should... only apparently 
remaining at 1.6Ghz...

But changing that did not improve things unfortunately. I know have CPUs  using 
their xeon turbo frequency, but no throughput improvement.

Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly 
according to redhat, i.e : one receive queue per physical core, spreading the 
IRQ load everywhere.
One thing I noticed though is that the dell BIOS allows to change IRQs... but 
once you change the network card IRQ, it also changes the RAID card IRQ as well 
as many others, all sharing the same bios IRQ (that's therefore apparently a 
useless option). Weird.

Still attempting to determine the bottleneck ;)

Regards
Frederic

-Message d'origine-
De : Christian Balzer [mailto:ch...@gol.com] 
Envoyé : jeudi 23 juillet 2015 14:18
À : ceph-users@lists.ceph.com
Cc : Gregory Farnum; SCHAER Frederic
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

 Your note that dd can do 2GB/s without networking makes me think that
 you should explore that. As you say, network interrupts can be
 problematic in some systems. The only thing I can think of that's been
 really bad in the past is that some systems process all network
 interrupts on cpu 0, and you probably want to make sure that it's
 splitting them across CPUs.


An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 /proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0  /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-24 Thread SCHAER Frederic
Hi,

Thanks.
I did not know about atop, nice tool... and I don't seem to be IRQ overloaded - 
I can reach 100% cpu % for IRQs, but that's shared across all 8 physical cores.
I also discovered turbostat which showed me the R510s were not configured for 
performance in the bios (but dbpm - demand based power management), and were 
not bumping the CPUs frequency to 2.4GHz as they should... only apparently 
remaining at 1.6Ghz...

But changing that did not improve things unfortunately. I know have CPUs  using 
their xeon turbo frequency, but no throughput improvement.

Looking at RPS/ RSS, it looks like our Broadcom cards are configured correctly 
according to redhat, i.e : one receive queue per physical core, spreading the 
IRQ load everywhere.
One thing I noticed though is that the dell BIOS allows to change IRQs... but 
once you change the network card IRQ, it also changes the RAID card IRQ as well 
as many others, all sharing the same bios IRQ (that's therefore apparently a 
useless option). Weird.

Still attempting to determine the bottleneck ;)

Regards
Frederic

-Message d'origine-
De : Christian Balzer [mailto:ch...@gol.com] 
Envoyé : jeudi 23 juillet 2015 14:18
À : ceph-users@lists.ceph.com
Cc : Gregory Farnum; SCHAER Frederic
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

On Thu, 23 Jul 2015 11:14:22 +0100 Gregory Farnum wrote:

 Your note that dd can do 2GB/s without networking makes me think that
 you should explore that. As you say, network interrupts can be
 problematic in some systems. The only thing I can think of that's been
 really bad in the past is that some systems process all network
 interrupts on cpu 0, and you probably want to make sure that it's
 splitting them across CPUs.


An IRQ overload would be very visible with atop.

Splitting the IRQs will help, but it is likely to need some smarts.

As in, irqbalance may spread things across NUMA nodes.

A card with just one IRQ line will need RPS (Receive Packet Steering),
irqbalance can't help it.

For example, I have a compute node with such a single line card and Quad
Opterons (64 cores, 8 NUMA nodes).

The default is all interrupt handling on CPU0 and that is very little,
except for eth2. So this gets a special treatment:
---
echo 4 /proc/irq/106/smp_affinity_list
---
Pinning the IRQ for eth2 to CPU 4 by default

---
echo f0  /sys/class/net/eth2/queues/rx-0/rps_cpus
---
giving RPS CPUs 4-7 to work with. At peak times it needs more than 2
cores, otherwise with this architecture just using 4 and 5 (same L2 cache)
would be better.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-23 Thread SCHAER Frederic
Hi,

Well I think the journaling would still appear in the dstat output, as that's 
still IOs : even if the user-side bandwidth indeed is cut in half, that should 
not be the case of disks IO.
For instance I just tried a replicated pool for the test, and got around 
1300MiB/s in dstat for about 600MiB/s in the rados bench - I take it that 
indeed, with replication/size=2, there's a total of 2 replicas, so that's 1 
user IO for 2 * [1 replicas + 1  journals] / number of hosts = 600*2*2/2 = 
1200MiBs of IOs per host (+/- the approximations) ...

Using the dd flag oflag=sync indeed lowers the dstat values down to 
1100-1300MiB/s. Still above what ceph uses with EC pools .

I have tried to identify/watch interrupt issues (using the watch command), but 
I have to say I failed until know.
The Broadcom card is indeed spreading the load on the cpus:

# egrep 'CPU|p2p' /proc/interrupts
CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  CPU12  
CPU13  CPU14  CPU15
  80: 881646372   1508 30  97328  0  
10459270   2715   8753  0  12765   5100   
9148   9420  0   PCI-MSI-edge  p2p1
  82: 179710 165107  94684 334842 210219  47403 
270330 166877   3516 229043  709844660  16512   5088   
2456312  12302   PCI-MSI-edge  p2p1-fp-0
  83:  12454  14073   5571  15196   5282  22301  
11522  21299 4092581302069   1303  79810  705953243   
1836  15190 883683   PCI-MSI-edge  p2p1-fp-1
  84:   6463  13994  57006  16200  16778 374815 
558398  11902  695554360  94228   1252  18649 825684   
7555 731875 190402   PCI-MSI-edge  p2p1-fp-2
  85: 163228 259899 143625 121326 107509 798435 
168027 144088  75321  89962  55297  715175665 784356  
53961  92153  92959   PCI-MSI-edge  p2p1-fp-3
  86:233267453226792070827220797122540051748938
39492831684674 65008514098872704778 140711 160954 
5910372981286  672487805   PCI-MSI-edge  p2p1-fp-4
  87:  33772 233318 136341  58163 506773 183451   
18269706  52425 226509  22150  17026 176203   5942  
681346619 270341  87435   PCI-MSI-edge  p2p1-fp-5
  88:   65103573  105514146   51193688   51330824   41771147   61202946   
41053735   49301547 181380   73028922  39525 172439 155778 
108065  154750931   26348797   PCI-MSI-edge  p2p1-fp-6
  89:   59287698  120778879   43446789   47063897   39634087   39463210   
46582805   48786230 342778   82670325 135397 438041 318995
3642955  179107495 833932   PCI-MSI-edge  p2p1-fp-7
  90:   1804   4453   2434  19885  11527   9771  
12724   2392840  12721439   1166   3354
560  69386   9233   PCI-MSI-edge  p2p2
  92:6455149433007258203245273513   115645711838476
22200494039978 977482   15351931 9494511685983 772531
271810175312351954224   PCI-MSI-edge  p2p2-fp-0

I don't know yet how to check if there are memory bandwith/latency/whatever 
issues...

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-22 Thread SCHAER Frederic
Hi Gregory,



Thanks for your replies.

Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts).



2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 
cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with 
and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC 
H800 + 11 2TB SAS disks (one unused SSD...)

The EC pool is defined with k=4, m=1

I set the failure domain to OSD for the test

The OSDs are set up with XFS and a 10GB journal 1st partition (the single 
doomed-dell SSD was a bottleneck for 23 disks…)

All disks are presently configured with a single-RAID0 because H700/H800 do not 
support JBOD.



I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command :

rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 
--run-name bench_`hostname -s` --no-cleanup

I'm aggregating the average bandwidth at the end of the tests.

I'm monitoring the Ceph servers stats live with this dstat command: dstat -N 
p2p1,p2p2,total

The network MTU is 9000 on all nodes.



With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for 
the whole 2-nodes ceph cluster / 5 clients.

I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the 
H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either 
get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less 
considering I removed half disks !

I'm therefore starting to think I am CPU /memory bandwidth limited... ?



That's not however what I am tempted to conclude (for the cpu at least) when I 
see the dstat output, as it says the cpus still sit idle or IO waiting :



total-cpu-usage -dsk/total- --net/p2p1net/p2p2---net/total- 
---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send: recv  send: recv  send|  in   
out | int   csw

  1   1  97   0   0   0| 586k 1870k|   0 0 :   0 0 :   0 0 |  49B  
455B|816715k

29  17  24  27   0   3| 128k  734M| 367M  870k:   0 0 : 367M  870k|   0 
0 |  61k   61k

30  17  34  16   0   3| 432k  750M| 229M  567k: 199M  168M: 427M  168M|   0 
0 |  65k   68k

25  14  38  20   0   3|  16k  634M| 232M  654k: 162M  133M: 393M  134M|   0 
0 |  56k   64k

19  10  46  23   0   2| 232k  463M| 244M  670k: 184M  138M: 428M  139M|   0 
0 |  45k   55k

15   8  46  29   0   1| 368k  422M| 213M  623k: 149M  110M: 362M  111M|   0 
0 |  35k   41k

25  17  37  19   0   3|  48k  584M| 139M  394k: 137M   90M: 276M   91M|   0 
0 |  54k   53k



Could it be the interruptions or system context switches that cause this 
relatively poor performance per node ?

PCI-E interractions with the PERC cards ?

I know I can get way more disk throughput with dd (command below)

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw

  1   1  97   0   0   0| 595k 2059k|   0 0 | 634B 2886B|797115k

  1  93   0   3   0   3|   0  1722M|  49k   78k|   0 0 |  40k   47k

  1  93   0   3   0   3|   0  1836M|  40k   69k|   0 0 |  45k   57k

  1  95   0   2   0   2|   0  1805M|  40k   69k|   0 0 |  38k   34k

  1  94   0   3   0   2|   0  1864M|  37k   38k|   0 0 |  35k   24k

(…)



Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep 
ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo 
writing $FS_THR times (threads)  $[ 4 * FILE_MB ]  mb on $i... ; for j in 
`seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M 
count=$[ FILE_MB / 4 ]  done ; done ; wait) ; echo wrote $[ N_FS * FILE_MB * 
FS_THR ] MB on $N_FS FS with $FS_THR threads ; rm -f 
/var/lib/ceph/osd/*/test.zero*





Hope I gave you more insights on what I’m trying to achieve, and where I’m 
failing ?



Regards





-Message d'origine-
De : Gregory Farnum [mailto:g...@gregs42.com]
Envoyé : mercredi 22 juillet 2015 16:01
À : Florent MONTHEL
Cc : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??



We might also be able to help you improve or better understand your

results if you can tell us exactly what tests you're conducting that

are giving you these numbers.

-Greg



On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL 
fmont...@flox-arts.netmailto:fmont...@flox-arts.net wrote:

 Hi Frederic,



 When you have Ceph cluster with 1 node you don’t experienced network and

 communication overhead due to distributed model

 With 2 nodes and EC 4+1 you will have communication between 2 nodes but you

 will keep internal communication (2 chunks on first node and 3 chunks on

 second node)

 On your configuration EC pool is setup with 4+1 so you will have for each

 write overhead due to write spreading on 5 nodes (for 1 customer IO, you

 will experience 5 Ceph IO due to EC 4+1)

 It’s the reason for that I think you’re

[ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-20 Thread SCHAER Frederic
Hi,

As I explained in various previous threads, I'm having a hard time getting the 
most out of my test ceph cluster.
I'm benching things with rados bench.
All Ceph hosts are on the same 10GB switch.

Basically, I know I can get about 1GB/s of disk write performance per host, 
when I bench things with dd (hundreds of dd threads) +iperf 10gbit 
inbound+iperf 10gbit outbound.
I also can get 2GB/s or even more if I don't bench the network at the same 
time, so yes, there is a bottleneck between disks and network, but I can't 
identify which one, and it's not relevant for what follows anyway
(Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about 
this strange bottleneck though...)

My hosts each are connected though a single 10Gbits/s link for now.

My problem is the following. Please note I see the same kind of poor 
performance with replicated pools...
When testing EC pools, I ended putting a 4+1 pool on a single node in order to 
track down the ceph bottleneck.
On that node, I can get approximately 420MB/s write performance using rados 
bench, but that's fair enough since the dstat output shows that real data 
throughput on disks is about 800+MB/s (that's the ceph journal effect, I 
presume).

I tested Ceph on my other standalone nodes : I can also get around 420MB/s, 
since they're identical.
I'm testing things with 5 10Gbits/s clients, each running rados bench.

But what I really don't get is the following :


-  With 1 host : throughput is 420MB/s

-  With 2 hosts : I get 640MB/s. That's surely not 2x420MB/s.

-  With 5 hosts : I get around 1375MB/s . That's far from the expected 
2GB/s.

The network never is maxed out, nor are the disks or CPUs.
The hosts throughput I see with rados bench seems to match the dstat throughput.
That's as if each additional host was only capable of adding 220MB/s of 
throughput. Compare this to the 1GB/s they are capable of (420MB/s with 
journals)...

I'm therefore wondering what could possibly be so wrong with my setup ??
Why would it impact so much the performance to add hosts ?

On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
I know, not perfect, but not THAT bad neither... ?

Any hint would be greatly appreciated !

Thanks
Frederic Schaer
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance VS network usage

2015-04-24 Thread SCHAER Frederic
Hi Nick,

Thanks for your explanation.
I have some doubts this is what's happening, but I'm going to first check what 
happens with disks IO with a clean pool and clean bench data (discarding any 
existing cache...)

I'm using the following commands for creating the bench data (and benching 
writes) on all 5 clients :
rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 
--run-name bench_`hostname -s` --no-cleanup

Replace write with seq for the read bench.
As you can see, I do specify the -b option, even though I'm wondering if this 
one affects the read bench, the help seems unclear to me:
-b op_size set the size of write ops for put or benchmarking

Still, even if it didn't work and if rados bench reads were issuing 4kb reads, 
how could this explain that all 5 servers receive 800MiB/s (and not megabits... 
) each, and that they only send on the average what each client receives ?
Where would the extra ~400MiB (not bits) come from ?
If the OSDs were reconstructing data using the other hosts data before sending 
that to the client, this would mean the OSD hosts would send much more data to 
their neighbor OSDs on the network than my average client throughput -and not 
roughly the same amount-, wouldn't it ?
I took a look at the network interfaces, hoping this would come from localhost, 
but this did not : this came in from the physical network interface...

Still trying to understand ;)

Regards

De : Nick Fisk [mailto:n...@fisk.me.uk]
Envoyé : jeudi 23 avril 2015 17:21
À : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : RE: read performance VS network usage

Hi Frederic,

If you are using EC pools, the primary OSD requests the remaining shards of the 
object from the other OSD's, reassembles it and then sends the data to the 
client. The entire object needs to be reconstructed even for a small IO 
operation, so 4kb reads could lead to quite a large IO amplification if you are 
using the default 4MB object sizes. I believe this is what you are seeing, 
although creating a RBD with smaller object sizes can help reduce this.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER 
Frederic
Sent: 23 April 2015 15:40
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] read performance VS network usage

Hi again,

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). 
For these tests, I've setup a RAID0 on the 23 disks.
For now, I'm not using SSDs as I discovered my vendor apparently decreased 
their perfs on purpose...

So : 5 server nodes of which 3 are MONS too.
I also have 5 clients.
All of them have a single 10G NIC,  I'm not using a private network.
I'm testing EC pools, with the failure domain set to hosts.
The EC pool k/m is set to k=4/m=1
I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64)

And... I just found out I had limited read performance.
While I was watching the stats using dstat on one server node, I noticed that 
during the rados (read) bench, all the server nodes sent about 370MiB/s on the 
network, which is the average speed I get per server, but they also all 
received about 750-800MiB/s on that same network. And 800MB/s is about as much 
as you can get on a 10G link...

I'm trying to understand why I see this inbound data flow ?

-  Why does a server node receive data at all during a read bench ?

-  Why is it about twice as much as the data the node is sending ?

-  Is this about verifying data integrity at read time ?

I'm alone on the cluster, it's not used anywhere else.
I will try tomorrow to see if adding a 2nd 10G port (with a private network 
this time) improves the performance, but I'm really curious here to understand 
what's the bottleneck and what's ceph doing... ?

Looking at the write performance, I see the same kind of behavior : nodes send 
about half the amount of data they receive (600MB/300MB), but this might be 
because this time the client only sends the real data and the erasure coding 
happens behind the scenes (or not ?)

Any idea ?

Regards
Frederic


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance VS network usage

2015-04-24 Thread SCHAER Frederic
OK, I must learn how to read dstat...
I took the recv column for the send column...
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
15  22  43  16   0   4| 343M 7916k| 252M  659M|   0 0 |  78k  122k
15  18  45  18   0   4| 368M 4500k| 271M  592M|   0 0 |  82k  138k
(...)

I also notice that I see less network throughput with an MTU=9000.
So... conclusion : the nodes indeed receive part of the data and send it back 
to the client (even with 4MB reads, if the bench takes the option).

My last surprise is with the clients :
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  2   1  97   0   0   0| 718B  116k|   0 0 |   0 0 |1947  3148
12  14  72   0   0   1|   028k| 764M 1910k|   0 0 |  25k   27k
11  13  75   0   0   1|   0  4096B| 758M 1860k|   0 0 |  25k   27k
13  14  71   0   0   1|   0  4096B| 785M 1815k|   0 0 |  25k   24k
12  14  73   0   0   1|   0 0 | 839M 1960k|   0 0 |  25k   25k
12  14  72   0   0   2|   0   548k| 782M 1873k|   0 0 |  24k   25k
11  14  73   0   0   1|   044k| 782M 1924k|   0 0 |  25k   26k


They are also receiving much more data than what rados bench reports (around 
275MB/s each)... would that be some sort of data amplification ??

Regards

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 avril 2015 10:03
À : Nick Fisk; ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage

Hi Nick,

Thanks for your explanation.
I have some doubts this is what's happening, but I'm going to first check what 
happens with disks IO with a clean pool and clean bench data (discarding any 
existing cache...)

I'm using the following commands for creating the bench data (and benching 
writes) on all 5 clients :
rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 
--run-name bench_`hostname -s` --no-cleanup

Replace write with seq for the read bench.
As you can see, I do specify the -b option, even though I'm wondering if this 
one affects the read bench, the help seems unclear to me:
-b op_size set the size of write ops for put or benchmarking

Still, even if it didn't work and if rados bench reads were issuing 4kb reads, 
how could this explain that all 5 servers receive 800MiB/s (and not megabits... 
) each, and that they only send on the average what each client receives ?
Where would the extra ~400MiB (not bits) come from ?
If the OSDs were reconstructing data using the other hosts data before sending 
that to the client, this would mean the OSD hosts would send much more data to 
their neighbor OSDs on the network than my average client throughput -and not 
roughly the same amount-, wouldn't it ?
I took a look at the network interfaces, hoping this would come from localhost, 
but this did not : this came in from the physical network interface...

Still trying to understand ;)

Regards

De : Nick Fisk [mailto:n...@fisk.me.uk]
Envoyé : jeudi 23 avril 2015 17:21
À : SCHAER Frederic; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Objet : RE: read performance VS network usage

Hi Frederic,

If you are using EC pools, the primary OSD requests the remaining shards of the 
object from the other OSD's, reassembles it and then sends the data to the 
client. The entire object needs to be reconstructed even for a small IO 
operation, so 4kb reads could lead to quite a large IO amplification if you are 
using the default 4MB object sizes. I believe this is what you are seeing, 
although creating a RBD with smaller object sizes can help reduce this.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER 
Frederic
Sent: 23 April 2015 15:40
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] read performance VS network usage

Hi again,

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). 
For these tests, I've setup a RAID0 on the 23 disks.
For now, I'm not using SSDs as I discovered my vendor apparently decreased 
their perfs on purpose...

So : 5 server nodes of which 3 are MONS too.
I also have 5 clients.
All of them have a single 10G NIC,  I'm not using a private network.
I'm testing EC pools, with the failure domain set to hosts.
The EC pool k/m is set to k=4/m=1
I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64)

And... I just found out I had limited read performance.
While I was watching the stats using dstat on one server node, I noticed that 
during the rados (read) bench, all the server nodes sent about 370MiB/s on the 
network, which is the average speed I get per server, but they also all 
received about 750-800MiB/s on that same network. And 800MB/s is about as much 
as you can get on a 10G link...

I'm trying to understand why I see this inbound data flow

Re: [ceph-users] read performance VS network usage

2015-04-24 Thread SCHAER Frederic
And to reply to myslef...

The client apparent network bandwidth is just the fact that dstat aggregates 
the bridge network interface and the physical interface, thus doubling the 
data...

Ah ah ah.
Regards

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 avril 2015 10:26
À : ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage

OK, I must learn how to read dstat...
I took the recv column for the send column...
total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
15  22  43  16   0   4| 343M 7916k| 252M  659M|   0 0 |  78k  122k
15  18  45  18   0   4| 368M 4500k| 271M  592M|   0 0 |  82k  138k
(...)

I also notice that I see less network throughput with an MTU=9000.
So... conclusion : the nodes indeed receive part of the data and send it back 
to the client (even with 4MB reads, if the bench takes the option).

My last surprise is with the clients :
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  2   1  97   0   0   0| 718B  116k|   0 0 |   0 0 |1947  3148
12  14  72   0   0   1|   028k| 764M 1910k|   0 0 |  25k   27k
11  13  75   0   0   1|   0  4096B| 758M 1860k|   0 0 |  25k   27k
13  14  71   0   0   1|   0  4096B| 785M 1815k|   0 0 |  25k   24k
12  14  73   0   0   1|   0 0 | 839M 1960k|   0 0 |  25k   25k
12  14  72   0   0   2|   0   548k| 782M 1873k|   0 0 |  24k   25k
11  14  73   0   0   1|   044k| 782M 1924k|   0 0 |  25k   26k


They are also receiving much more data than what rados bench reports (around 
275MB/s each)... would that be some sort of data amplification ??

Regards

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de SCHAER 
Frederic
Envoyé : vendredi 24 avril 2015 10:03
À : Nick Fisk; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Objet : [PROVENANCE INTERNET] Re: [ceph-users] read performance VS network usage

Hi Nick,

Thanks for your explanation.
I have some doubts this is what's happening, but I'm going to first check what 
happens with disks IO with a clean pool and clean bench data (discarding any 
existing cache...)

I'm using the following commands for creating the bench data (and benching 
writes) on all 5 clients :
rados -k ceph.client.admin.keyring -p testec bench 60 write -b 4194304 -t 16 
--run-name bench_`hostname -s` --no-cleanup

Replace write with seq for the read bench.
As you can see, I do specify the -b option, even though I'm wondering if this 
one affects the read bench, the help seems unclear to me:
-b op_size set the size of write ops for put or benchmarking

Still, even if it didn't work and if rados bench reads were issuing 4kb reads, 
how could this explain that all 5 servers receive 800MiB/s (and not megabits... 
) each, and that they only send on the average what each client receives ?
Where would the extra ~400MiB (not bits) come from ?
If the OSDs were reconstructing data using the other hosts data before sending 
that to the client, this would mean the OSD hosts would send much more data to 
their neighbor OSDs on the network than my average client throughput -and not 
roughly the same amount-, wouldn't it ?
I took a look at the network interfaces, hoping this would come from localhost, 
but this did not : this came in from the physical network interface...

Still trying to understand ;)

Regards

De : Nick Fisk [mailto:n...@fisk.me.uk]
Envoyé : jeudi 23 avril 2015 17:21
À : SCHAER Frederic; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Objet : RE: read performance VS network usage

Hi Frederic,

If you are using EC pools, the primary OSD requests the remaining shards of the 
object from the other OSD's, reassembles it and then sends the data to the 
client. The entire object needs to be reconstructed even for a small IO 
operation, so 4kb reads could lead to quite a large IO amplification if you are 
using the default 4MB object sizes. I believe this is what you are seeing, 
although creating a RBD with smaller object sizes can help reduce this.

Nick

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of SCHAER 
Frederic
Sent: 23 April 2015 15:40
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] read performance VS network usage

Hi again,

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). 
For these tests, I've setup a RAID0 on the 23 disks.
For now, I'm not using SSDs as I discovered my vendor apparently decreased 
their perfs on purpose...

So : 5 server nodes of which 3 are MONS too.
I also have 5 clients.
All of them have a single 10G NIC,  I'm not using a private network.
I'm testing EC pools, with the failure domain set to hosts.
The EC pool k/m is set to k=4/m=1
I'm testing EC pools using the giant release (ceph-0.87.1-0.el7

[ceph-users] read performance VS network usage

2015-04-23 Thread SCHAER Frederic
Hi again,

On my testbed, I have 5 ceph nodes, each containing 23 OSDs (2TB btrfs drives). 
For these tests, I've setup a RAID0 on the 23 disks.
For now, I'm not using SSDs as I discovered my vendor apparently decreased 
their perfs on purpose...

So : 5 server nodes of which 3 are MONS too.
I also have 5 clients.
All of them have a single 10G NIC,  I'm not using a private network.
I'm testing EC pools, with the failure domain set to hosts.
The EC pool k/m is set to k=4/m=1
I'm testing EC pools using the giant release (ceph-0.87.1-0.el7.centos.x86_64)

And... I just found out I had limited read performance.
While I was watching the stats using dstat on one server node, I noticed that 
during the rados (read) bench, all the server nodes sent about 370MiB/s on the 
network, which is the average speed I get per server, but they also all 
received about 750-800MiB/s on that same network. And 800MB/s is about as much 
as you can get on a 10G link...

I'm trying to understand why I see this inbound data flow ?

-  Why does a server node receive data at all during a read bench ?

-  Why is it about twice as much as the data the node is sending ?

-  Is this about verifying data integrity at read time ?

I'm alone on the cluster, it's not used anywhere else.
I will try tomorrow to see if adding a 2nd 10G port (with a private network 
this time) improves the performance, but I'm really curious here to understand 
what's the bottleneck and what's ceph doing... ?

Looking at the write performance, I see the same kind of behavior : nodes send 
about half the amount of data they receive (600MB/300MB), but this might be 
because this time the client only sends the real data and the erasure coding 
happens behind the scenes (or not ?)

Any idea ?

Regards
Frederic
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-crush-location + SSD detection ?

2015-04-22 Thread SCHAER Frederic
Hi,

I've seen and read a few things about ceph-crush-location and I think that's 
what I need.
What I need (want to try) is : a way to have SSDs in non-dedicated hosts, but 
also to put those SSDs in a dedicated ceph root.

From what I read, using ceph-crush-location, I could add a hostname with a SSD 
suffix in case the tool is called against a SSD... thing is : I must make sure 
this is a SSD, and this is where coding and experimenting comes.

Hence, I'd like to know if someone would have an already working implementation 
that would detect if the OSD is a SSD, and if so, append a string to the 
hostname ?
I'm for instance wondering when this tool is called, if the OSD is already 
mounted (or should have been...), what happens at boot ...

I know I can get the OSD mountpoint using something like that on a running OSD :
#ceph --format xml --admin-daemon /var/run/ceph/ceph-osd.0.asok config get 
osd_data|sed -e 's|./osd_data.*||;s|.*osd_data.||'
/var/lib/ceph/osd/ceph-0

I know I can find out if this is a disk or a SSD using for instance this :
[root@ceph0 ~]# cat /sys/block/sdy/queue/rotational
0
[root@ceph0 ~]# cat /sys/block/sda/queue/rotational
1

So I just have to associate the mountpoint with the device... provided OSD is 
mounted when the tool is called.
Anyone willing to share experience with ceph-crush-location ?

Thanks

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-crush-location + SSD detection ?

2015-04-22 Thread SCHAER Frederic


-Message d'origine-
(...)
 So I just have to associate the mountpoint with the device... provided OSD is 
 mounted when the tool is called.
 Anyone willing to share experience with ceph-crush-location ?
 

Something like this? https://gist.github.com/wido/5d26d88366e28e25e23d

I've used that a couple of times.

Wido

[- FS : -] Exactly That... and you confirm at the same time it works ... Many 
thanks :]
[- FS : -] 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?

2015-03-04 Thread SCHAER Frederic
Hi,

Many thanks for the explanations.
I haven't used the nodcache option when mounting cephfs, it actually got 
there by default 

My mount command is/was :
# mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret

I don't know what causes this option to be default, maybe it's the kernel 
module I compiled from git (because there is no kmod-ceph or kmod-rbd in any 
RHEL-like distributions except RHEV), I'll try to update/check ...

Concerning the rados pool ls, indeed : I created empty files in the pool, and 
they were not showing up probably because they were just empty - but when I 
create a non empty file, I see things in rados ls...

Thanks again
Frederic


-Message d'origine-
De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John 
Spray
Envoyé : mardi 3 mars 2015 17:15
À : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?



On 03/03/2015 15:21, SCHAER Frederic wrote:

 By the way : looks like the ceph fs ls command is inconsistent when 
 the cephfs is mounted (I used a locally compiled kmod-ceph rpm):

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ]

 (umount /mnt .)

 [root@ceph0 ~]# ceph fs ls

 name: cephfs_puppet, metadata pool: puppet_metadata, data pools: 
 [puppet root ]

This is probably #10288, which was fixed in 0.87.1

 So, I have this pool named root that I added in the cephfs filesystem.

 I then edited the filesystem xattrs :

 [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root

 ceph.dir.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=root

 I'm therefore assuming client.puppet should not be allowed to write or 
 read anything in /mnt/root, which belongs to the root pool. but that 
 is not the case.

 On another machine where I mounted cephfs using the client.puppet key, 
 I can do this :

 The mount was done with the client.puppet key, not the admin one that 
 is not deployed on that node :

 1.2.3.4:6789:/ on /mnt type ceph 
 (rw,relatime,name=puppet,secret=hidden,nodcache)

 [root@dev7248 ~]# echo not allowed  /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 not allowed

This is data you're seeing from the page cache, it hasn't been written 
to RADOS.

You have used the nodcache setting, but that doesn't mean what you 
think it does (it was about caching dentries, not data).  It's actually 
not even used in recent kernels (http://tracker.ceph.com/issues/11009).

You could try the nofsc option, but I don't know exactly how much 
caching that turns off -- the safer approach here is probably to do your 
testing using I/Os that have O_DIRECT set.

 And I can even see the xattrs inherited from the parent dir :

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=root

 Whereas on the node where I mounted cephfs as ceph admin, I get nothing :

 [root@ceph0 ~]# cat /mnt/root/secret.notfailed

 [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed

 -rw-r--r-- 1 root root 12 Mar  3 15:27 /mnt/root/secret.notfailed

 After some time, the file also gets empty on the puppet client host :

 [root@dev7248 ~]# cat /mnt/root/secret.notfailed

 [root@dev7248 ~]#

 (but the metadata remained ?)

Right -- eventually the cache goes away, and you see the true (empty) 
state of the file.

 Also, as an unpriviledged user, I can get ownership of a secret file 
 by changing the extended attribute :

 [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet 
 /mnt/root/secret.notfailed

 [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed

 getfattr: Removing leading '/' from absolute path names

 # file: mnt/root/secret.notfailed

 ceph.file.layout=stripe_unit=4194304 stripe_count=1 
 object_size=4194304 pool=puppet

Well, you're not really getting ownership of anything here: you're 
modifying the file's metadata, which you are entitled to do (pool 
permissions have nothing to do with file metadata).  There was a recent 
bug where a file's pool layout could be changed even if it had data, but 
that was about safety rather than permissions.

 Final question for those that read down here : it appears that before 
 creating the cephfs filesystem, I used the puppet pool to store a 
 test rbd instance.

 And it appears I cannot get the list of cephfs objects in that pool, 
 whereas I can get those that are on the newly created root pool :

 [root@ceph0 ~]# rados -p puppet ls

 test.rbd

 rbd_directory

 [root@ceph0 ~]# rados -p root ls

 10a.

 10b.

 Bug, or feature ?


I didn't see anything in your earlier steps that would have led to any 
objects

[ceph-users] cephfs filesystem layouts : authentication gotchas ?

2015-03-03 Thread SCHAER Frederic
Hi,

I am attempting to test the cephfs filesystem layouts.
I created a user with rights to write only in one pool :

client.puppet
key:zzz
caps: [mon] allow r
caps: [osd] allow rwx pool=puppet

I also created another pool in which I would assume this user is allowed to do 
nothing after I successfully configure things.
By the way : looks like the ceph fs ls command is inconsistent when the 
cephfs is mounted (I used a locally compiled kmod-ceph rpm):

[root@ceph0 ~]# ceph fs ls
name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ]
(umount /mnt ...)
[root@ceph0 ~]# ceph fs ls
name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ]

So, I have this pool named root that I added in the cephfs filesystem.
I then edited the filesystem xattrs :

[root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root
getfattr: Removing leading '/' from absolute path names
# file: mnt/root
ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=root

I'm therefore assuming client.puppet should not be allowed to write or read 
anything in /mnt/root, which belongs to the root pool... but that is not the 
case.
On another machine where I mounted cephfs using the client.puppet key, I can do 
this :

The mount was done with the client.puppet key, not the admin one that is not 
deployed on that node :
1.2.3.4:6789:/ on /mnt type ceph 
(rw,relatime,name=puppet,secret=hidden,nodcache)

[root@dev7248 ~]# echo not allowed  /mnt/root/secret.notfailed
[root@dev7248 ~]#
[root@dev7248 ~]# cat /mnt/root/secret.notfailed
not allowed

And I can even see the xattrs inherited from the parent dir :
[root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed
getfattr: Removing leading '/' from absolute path names
# file: mnt/root/secret.notfailed
ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=root

Whereas on the node where I mounted cephfs as ceph admin, I get nothing :
[root@ceph0 ~]# cat /mnt/root/secret.notfailed
[root@ceph0 ~]# ls -l /mnt/root/secret.notfailed
-rw-r--r-- 1 root root 12 Mar  3 15:27 /mnt/root/secret.notfailed

After some time, the file also gets empty on the puppet client host :
[root@dev7248 ~]# cat /mnt/root/secret.notfailed
[root@dev7248 ~]#
(but the metadata remained ?)

Also, as an unpriviledged user, I can get ownership of a secret file by 
changing the extended attribute :

[root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet 
/mnt/root/secret.notfailed
[root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed
getfattr: Removing leading '/' from absolute path names
# file: mnt/root/secret.notfailed
ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 
pool=puppet

But fortunately, I haven't succeeded yet (?) in reading that file...
My question therefore is : what am I doing wrong ?

Final question for those that read down here : it appears that before creating 
the cephfs filesystem, I used the puppet pool to store a test rbd instance.
And it appears I cannot get the list of cephfs objects in that pool, whereas I 
can get those that are on the newly created root pool :

[root@ceph0 ~]# rados -p puppet ls
test.rbd
rbd_directory
[root@ceph0 ~]# rados -p root ls
10a.
10b.

Bug, or feature ?

Thanks  regards


P.S : ceph release :

[root@dev7248 ~]# rpm -qa '*ceph*'
kmod-libceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64
libcephfs1-0.87-0.el7.centos.x86_64
ceph-common-0.87-0.el7.centos.x86_64
ceph-0.87-0.el7.centos.x86_64
kmod-ceph-3.10.0-0.1.20150130gitee04310.el7.centos.x86_64
ceph-fuse-0.87.1-0.el7.centos.x86_64
python-ceph-0.87-0.el7.centos.x86_64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] XFS recovery on boot : rogue mounts ?

2015-03-02 Thread SCHAER Frederic
Hi,

I rebooted a failed server, which is now showing a rogue filesystem mount.
Actually, there were also several disks missing in the node, all reported as 
prepared by ceph-disk, but not activated.

[root@ceph2 ~]# grep /var/lib/ceph/tmp /etc/mtab
/dev/sdo1 /var/lib/ceph/tmp/mnt.usVRe8 xfs rw,noatime,attr2,inode64,noquota 0 0

This path does not exist, and after having to ceph-disk activate-all, I can 
now see the OSD under it's correct path (and missing ones got mounted too) :

[root@ceph2 ~]# grep sdo1 /etc/mtab
/dev/sdo1 /var/lib/ceph/tmp/mnt.usVRe8 xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/sdo1 /var/lib/ceph/osd/ceph-53 xfs rw,noatime,attr2,inode64,noquota 0 0
[root@ceph2 ~]# ll /var/lib/ceph/tmp/mnt.usVRe8
ls: cannot access /var/lib/ceph/tmp/mnt.usVRe8: No such file or directory

I just looked at the logs, and it appears that this sdo disk performed an XFS 
recovery at boot :

Mar  2 11:33:45 ceph2 kernel: [   21.479747] XFS (sdo1): Mounting Filesystem
Mar  2 11:33:45 ceph2 kernel: [   21.641263] XFS (sdo1): Starting recovery 
(logdev: internal)
Mar  2 11:33:45 ceph2 kernel: [   21.674451] XFS (sdo1): Ending recovery 
(logdev: internal)

I do not see any Ending clean mount line for this disk.
If I check the syslogs, I can see OSDs are usually mounted twice, but not 
always, and sometimes they even aren't mounted at all :

[root@ceph2 ~]# zegrep 'XFS.*Ending clean' /var/log/messages.1.gz |sed -e 
s/.*XFS/XFS/|sort |uniq -c
  2 XFS (sdb1): Ending clean mount
  2 XFS (sdd1): Ending clean mount
  1 XFS (sde1): Ending clean mount
  2 XFS (sdg1): Ending clean mount
  1 XFS (sdh1): Ending clean mount
  1 XFS (sdi1): Ending clean mount
  2 XFS (sdj1): Ending clean mount
  3 XFS (sdk1): Ending clean mount
  3 XFS (sdl1): Ending clean mount
  4 XFS (sdm1): Ending clean mount

So : would there be an issue with disks that perform an XFS recovery at boot ?
I know that reboot will cleanup things, but rebooting isn't the cleanest thing 
to do...

Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-12-11 Thread SCHAER Frederic
Hi,

Back on this.
I finally found out a logic in the mapping.

So after taking the time to note all the disks serial numbers on 3 different 
machines and 2 different OSes, I now know that my specific LSI SAS 2008 cards 
(no reference on them, but I think those are LSI sas 9207-8i) map the disks of 
the MD1000  in the reverse alphabetic order :

sd{b..p} map to slot{14..0}

There is absolutely nothing else that appears usable, except the sas_address of 
the disks which seems associated with slots. 
But even this one is different depending on machines, and the address - slot 
mapping does not seem very obvious at the very least...

Good thing is that I now know that fun tools exist in packages such as 
sg3_tils, smp_utils and others like mpt-status...
Next step is to try an md1200 ;)

Thanks again
Cheers

-Message d'origine-
De : JF Le Fillâtre [mailto:jean-francois.lefilla...@uni.lu] 
Envoyé : mercredi 19 novembre 2014 13:42
À : SCHAER Frederic
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ?


Hello again,

So whatever magic allows the Dell MD1200 to report the slot position for
each disk isn't present in your JBODs. Time for something else.

There are two sides to your problem:

1) Identifying which disk is where in your JBOD

Quite easy. Again I'd go for a udev rule + script that will either
rename the disks entirely, or create a symlink with a name like
jbodX-slotY or something to figure out easily which is which. The
mapping end-device-to-slot can be static in the script, so you need to
identify once the order in which the kernel scans the slots and then you
can map.

But it won't survive a disk swap or a change of scanning order from a
kernel upgrade, so it's not enough.

2) Finding a way of identification independent of hot-plugs and scan order

That's the tricky part. If you remove a disk from your JBOD and replace
it with another one, the other one will get another sdX name, and in
my experience even another end_device-... name. But given that you
want the new disk to have the exact same name or symlink as the previous
one, you have to find something in the path of the device or (better) in
the udev attributes that is immutable.

If possible at all, it will depend on your specific hardware
combination, so you will have to try for yourself.

Suggested methodology:

1) write down the serial number of one drive in any slot, and figure out
its device name (sdX) with smartctl -i /dev/sd...

2) grab the detailed /sys path name and list of udev attributes:
readlink -f /sys/class/block/sdX
udevadm info --attribute-walk /dev/sdX

3) pull that disk and replace it. Check the logs to see which is its new
device name (sdY)

4) rerun the commands from #2 with sdY

5) compare the outputs and find something in the path or in the
attributes that didn't change and is unique to that disk (ie not a
common parent for example).

If you have something that really didn't change, you're in luck. Either
use the serial numbers or unplug and replug all disks one by one to
figure out the mapping slot number / immutable item.

Then write the udev rule. :)

Thanks!
JF



On 19/11/14 11:29, SCHAER Frederic wrote:
 Hi
 
 Thanks.
 I hoped it would be it, but no ;)
 
 With this mapping :
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh
 lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - 
 ../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1

Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-19 Thread SCHAER Frederic
Hi

Thanks.
I hoped it would be it, but no ;)

With this mapping :
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdb - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:0/end_device-1:1:0/target1:0:1/1:0:1:0/block/sdb
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdc - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:1/end_device-1:1:1/target1:0:2/1:0:2:0/block/sdc
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdd - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:2/end_device-1:1:2/target1:0:3/1:0:3:0/block/sdd
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sde - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:3/end_device-1:1:3/target1:0:4/1:0:4:0/block/sde
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdf - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:4/end_device-1:1:4/target1:0:5/1:0:5:0/block/sdf
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdg - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:5/end_device-1:1:5/target1:0:6/1:0:6:0/block/sdg
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdh - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:6/end_device-1:1:6/target1:0:7/1:0:7:0/block/sdh
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdi - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:0/expander-1:1/port-1:1:7/end_device-1:1:7/target1:0:8/1:0:8:0/block/sdi
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdj - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:0/end_device-1:2:0/target1:0:9/1:0:9:0/block/sdj
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdk - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:1/end_device-1:2:1/target1:0:10/1:0:10:0/block/sdk
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdl - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:2/end_device-1:2:2/target1:0:11/1:0:11:0/block/sdl
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdm - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:3/end_device-1:2:3/target1:0:12/1:0:12:0/block/sdm
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdn - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:4/end_device-1:2:4/target1:0:13/1:0:13:0/block/sdn
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdo - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:5/end_device-1:2:5/target1:0:14/1:0:14:0/block/sdo
lrwxrwxrwx 1 root root 0 Nov 12 12:31 /sys/class/block/sdp - 
../../devices/pci:00/:00:04.0/:0a:00.0/host1/port-1:0/expander-1:0/port-1:0:1/expander-1:2/port-1:2:6/end_device-1:2:6/target1:0:15/1:0:15:0/block/sdp

sdd was on physical slot 12, sdk was on slot 5, and sdg was on slot 9 (and I 
did not check the others)...
so clearly this cannot be put in production as is and I'll have to find a way.

Regards


-Message d'origine-
De : Carl-Johan Schenström [mailto:carl-johan.schenst...@gu.se] 
Envoyé : lundi 17 novembre 2014 14:14
À : SCHAER Frederic; Scottix; Erik Logtenberg
Cc : ceph-users@lists.ceph.com
Objet : RE: [ceph-users] jbod + SMART : how to identify failing disks ?

Hi!

I'm fairly sure that the link targets in /sys/class/block were correct the last 
time I had to change a drive on a system with a Dell HBA connected to an 
MD1000, but perhaps I was just lucky. =/

I.e.,

# ls -l /sys/class/block/sdj
lrwxrwxrwx. 1 root root 0 17 nov 13.54 /sys/class/block/sdj - 
../../devices/pci:20/:20:0a.0/:21:00.0/host7/port-7:0/expander-7:0/port-7:0:1/expander-7:2/port-7:2:6/end_device-7:2:6/target7:0:7/7:0:7:0/block/sdj

would be first port on HBA, first expander, 7th slot (6, starting from 0). 
Don't take my word for it, though!

-- 
Carl-Johan Schenström
Driftansvarig / System Administrator
Språkbanken  Svensk nationell datatjänst /
The Swedish Language Bank  Swedish National Data Service
Göteborgs universitet / University of Gothenburg
carl-johan.schenst...@gu.se / +46 709 116769


From: ceph-users ceph-users-boun...@lists.ceph.com on behalf of SCHAER 
Frederic frederic.sch...@cea.fr
Sent: Friday, November 14, 2014 17:24
To: Scottix; Erik Logtenberg
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph

[ceph-users] rogue mount in /var/lib/ceph/tmp/mnt.eml1yz ?

2014-11-19 Thread SCHAER Frederic
Hi,

I rebooted a node (I'm doing some tests, and breaking many things ;) ), I see I 
have :

[root@ceph0 ~]# mount|grep sdp1
/dev/sdp1 on /var/lib/ceph/tmp/mnt.eml1yz type xfs 
(rw,noatime,attr2,inode64,noquota)
/dev/sdp1 on /var/lib/ceph/osd/ceph-55 type xfs 
(rw,noatime,attr2,inode64,noquota)

[root@ceph0 ~]# ls -l /var/lib/ceph/tmp/mnt.eml1yz
ls: cannot access /var/lib/ceph/tmp/mnt.eml1yz: No such file or directory

In /var/lib/ceph/tmp, all I see is :
[root@ceph0 ~]# ll /var/lib/ceph/tmp/
total 0
-rw-r--r-- 1 root root 0 Nov 19 11:34 ceph-disk.activate.lock
-rw-r--r-- 1 root root 0 Oct 31 18:19 ceph-disk.prepare.lock

I think (but I'm not sure) that I already faced this before the reboot  with 
another device - but since the device naming seems completely inconsistent on 
my systems (see another thread of mine), I can't say for sure it's not the same 
OSD that's buggy - in that case, I could try to just zap it and recreate it.

Any idea what goes wrong in this case or where I could look at ? (nothing 
usefull in /var/log/ceph/ceph-osd.55.log)
Ceph version is giant : ceph-0.87-0.el7.centos.x86_64

Thanks  regards

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-18 Thread SCHAER Frederic
Wow. Thanks
Not very operations friendly though…

Wouldn’t it be just OK to pull the disk that we think is the bad one, check the 
serial number, and if not, just replug and let the udev rules do their job and 
re-insert the disk in the ceph cluster ?
(provided XFS doesn’t freeze for good when we do that)

Regards

De : Craig Lewis [mailto:cle...@centraldesktop.com]
Envoyé : lundi 17 novembre 2014 22:32
À : SCHAER Frederic
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ?

I use `dd` to force activity to the disk I want to replace, and watch the 
activity lights.  That only works if your disks aren't 100% busy.  If they are, 
stop the ceph-osd daemon, and see which drive stops having activity.  Repeat 
until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic 
frederic.sch...@cea.frmailto:frederic.sch...@cea.fr wrote:
Hi,

I’m used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I’m going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it’s 
the correct one…. But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can’t imagine having to guess slots that each time, and I can’t imagine 
neither creating serial number stickers for every single disk I could have to 
manage …
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to “blink” a drive in linux ?

Thanks  regards

___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-12 Thread SCHAER Frederic
Hi,

I'm used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL 6GB HBA SAS JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I'd use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI sas2ircu utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I'm going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it's 
the correct one But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can't imagine having to guess slots that each time, and I can't imagine 
neither creating serial number stickers for every single disk I could have to 
manage ...
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to blink a drive in linux ?

Thanks  regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-30 Thread SCHAER Frederic
Hi loic,

Back on this issue...
Using the epel package, I still get  prepared-only  disks, e.g :

/dev/sdc :
 /dev/sdc1 ceph data, prepared, cluster ceph, journal /dev/sdc2
 /dev/sdc2 ceph journal, for /dev/sdc1

Looking at udev output, I can see that there is no ACTION=add with 
ID_PART_ENTRY_TYPE= 4fbd7e29-9d25-41b8-afd0-062c0ceff05d , only a change 
action.
This was on a previously prepared disk, which I zapped.

When I start partx -u /dev/sdc , then and only then the kernel sees the new 
partitions, and it also sees the old ones disappeared too - see the part udev 
log attached.
In this log are only the events that udev saw right when I ran 'partx -u' : 
nothing before, and nothing after.

This still looks like it's not partx -a that should be used on this system 
when running ceph-disk prepare, but partx -u.. ?
And off course, after the partx -u is run, the disk is activated.

This is what I have :

[root@ceph1 ~]# rpm -qf /usr/sbin/partx
util-linux-2.23.2-16.el7.x86_64

[root@ceph1 ~]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)

[root@ceph1 ~]# rpm -qi ceph
Name: ceph
Epoch   : 1
Version : 0.80.5
Release : 8.el7
Architecture: x86_64
Install Date: Tue 28 Oct 2014 12:28:41 PM CET
Group   : System Environment/Base
Size: 39154515
License : GPL-2.0
Signature   : RSA/SHA256, Sat 23 Aug 2014 08:02:08 PM CEST, Key ID 
6a2faea2352c64e5
Source RPM  : ceph-0.80.5-8.el7.src.rpm
Build Date  : Fri 22 Aug 2014 02:36:05 AM CEST
Build Host  : buildhw-08.phx2.fedoraproject.org

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-10 Thread SCHAER Frederic
-Message d'origine-
De : Loic Dachary [mailto:l...@dachary.org] 

The failure 

journal check: ondisk fsid ---- doesn't match 
expected 244973de-7472-421c-bb25-4b09d3f8d441

and the udev logs

DEBUG:ceph-disk:Journal /dev/sdc2 has OSD UUID 
----

means /usr/bin/ceph-osd -i 0 --get-journal-uuid --osd-journal /dev/sdc2

fails to read the OSD UUID from /dev/sdc2 which means something went wrong when 
preparing the journal.

It would be great if you could send the command you used to prepare the disk 
and the output (verbose if possible). I think you can reproduce the problem by 
zapping the disk with ceph-disk zap /dev/sdc and running partx -u if the 
corresponding entries in /dev/disk/by-partuuid have not been removed. That 
would also help me fix zap in the context of 
https://github.com/ceph/ceph/pull/2648 ... or have confirmation that it does 
not need fixing because it updates correctly on RHEL ;-)

Cheers

-- 
Loïc Dachary, Artisan Logiciel Libre

[- FS : -] 
Hi Loic,

At first, some notes :
-  I noticed I have to wait at least 1sec before I run partx -u on a prepared 
disk, otherwise even with that disks won't get properly handled by udev.
Maybe some caching somewhere ?
- the '-u' option does not seem to exist under RHEL6... so maybe the RHEL6 
behaviour was just to include the kernel updating in the -a option, and not 
anymore ?
- this is CentOS 7, i.e RHEL like, but not pure RHEL7 even if very close.


I Zapped the disk (it seems to work as expected) : 
[root@ceph1 ~]# ll /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 Oct  9 15:57 668f92f5-df46-4052-92ba-e8b8f7efd2d9 - 
../../sdb1
lrwxrwxrwx 1 root root 10 Oct  9 15:57 feb09ba1-30a2-44a8-a338-fef39ae6626a - 
../../sdb2
[root@ceph1 ~]

partx -u :

[root@ceph1 ~]# partx -u /dev/sdc
partx: specified range 1:0 does not make sense


Right after that, I re-prepared the disk :

[root@ceph1 ~]# parted -s /dev/sdc mklabel gpt
[root@ceph1 ~]# ceph-disk -v  prepare /dev/sdc
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=fsid
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mkfs_type
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_fs_type
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mkfs_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_fs_mkfs_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_mount_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. 
--lookup osd_fs_mount_options_xfs
INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=osd_journal_size
INFO:ceph-disk:Will colocate journal with data on /dev/sdc
DEBUG:ceph-disk:Creating journal partition num 2 size 5120 on /dev/sdc
INFO:ceph-disk:Running command: /usr/sbin/sgdisk --new=2:0:5120M 
--change-name=2:ceph journal 
--partition-guid=2:46bf261f-7ec3-485e-98c9-3c185de5efb8 
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc
Information: Moved requested sector from 34 to 2048 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
INFO:ceph-disk:calling partx on prepared device /dev/sdc
INFO:ceph-disk:re-reading known partitions will display errors
INFO:ceph-disk:Running command: /usr/sbin/partx -a /dev/sdc
partx: /dev/sdc: error adding partition 2
INFO:ceph-disk:Running command: /usr/bin/udevadm settle
DEBUG:ceph-disk:Journal is GPT partition 
/dev/disk/by-partuuid/46bf261f-7ec3-485e-98c9-3c185de5efb8
DEBUG:ceph-disk:Creating osd partition on /dev/sdc
INFO:ceph-disk:Running command: /usr/sbin/sgdisk --largest-new=1 
--change-name=1:ceph data 
--partition-guid=1:436ac41b-8800-466e-98f5-098aa2c64ba9 
--typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sdc
Information: Moved requested sector from 10485761 to 10487808 in
order to align on 2048-sector boundaries.
The operation has completed successfully.
INFO:ceph-disk:Running command: /usr/sbin/partprobe /dev/sdc
INFO:ceph-disk:Running command: /usr/bin/udevadm settle
DEBUG:ceph-disk:Creating xfs fs on /dev/sdc1
INFO:ceph-disk:Running command: /usr/sbin/mkfs -t xfs -f -i size=2048 -- 
/dev/sdc1
meta-data=/dev/sdc1  isize=2048   agcount=4, agsize=60686271 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0
data =   bsize=4096   blocks=242745083, imaxpct=25
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal log   bsize=4096   blocks=118527, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, 

Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-10 Thread SCHAER Frederic
Hi Loic,

Patched, and still not working (sorry)...
I'm attaching the prepare output, and also a different a real  udev debug 
output I captured using  udevadm monitor --environment  (udev.log file)

I added a sync command in ceph-disk-udev (this did not change a thing), and I 
noticed that udev script is called 3 times when adding one disk, and that the 
debug output was captured and then mixed all into one file.
This may lead to log mis-interpretation (race conditions ?)...
I changed a bit the logging in order to get one file per call and attached 
those logs to this mail.

File timestamps are as follows :
  File: '/var/log/udev_ceph.log.out.22706'
Change: 2014-10-10 15:48:09.136386306 +0200
  File: '/var/log/udev_ceph.log.out.22749'
Change: 2014-10-10 15:48:11.502425395 +0200
  File: '/var/log/udev_ceph.log.out.22750'
Change: 2014-10-10 15:48:11.606427113 +0200

Actually, I can reproduce the UUID=0 thing with this command :

[root@ceph1 ~]# /usr/sbin/ceph-disk -v activate-journal /dev/sdc2
INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
--osd-journal /dev/sdc2
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
DEBUG:ceph-disk:Journal /dev/sdc2 has OSD UUID 
----
INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- 
/dev/disk/by-partuuid/----
error: /dev/disk/by-partuuid/----: No such file 
or directory
ceph-disk: Cannot discover filesystem type: device 
/dev/disk/by-partuuid/----: Command 
'/sbin/blkid' returned non-zero exit status 2

Ah - to answer previous mails :
- I tried to manually create the gpt partition table to see if things would 
improve, but this was not the case (I also tried to zero out the start and end 
of disks, and also to add random data)
- running ceph-disk prepare twice does not work, it's just that once every 20 
(?) times it surprisingly does not fail on this hardware/os combination ;)

Regards

-Message d'origine-
De : Loic Dachary [mailto:l...@dachary.org] 
Envoyé : vendredi 10 octobre 2014 14:37
À : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] ceph-dis prepare : 
UUID=----

Hi Frederic,

To be 100% sure it would be great if you could manually patch your local 
ceph-disk script and change 'partprobe', into 'partx', '-a', in 
https://github.com/ceph/ceph/blob/v0.80.6/src/ceph-disk#L1284

ceph-disk zap
ceph-disk prepare

and hopefully it will show up as it should. It works for me on centos7 but ...

Cheers

On 10/10/2014 14:33, Loic Dachary wrote:
 Hi Frederic,
 
 It looks like this is just because 
 https://github.com/ceph/ceph/blob/v0.80.6/src/ceph-disk#L1284 should call 
 partx instead of partprobe. The udev debug output makes this quite clear 
 http://tracker.ceph.com/issues/9721
 
 I think 
 https://github.com/dachary/ceph/commit/8d914001420e5bfc1e12df2d4882bfe2e1719a5c#diff-788c3cea6213c27f5fdb22f8337096d5R1285
  fixes it
 
 Cheers
 
 On 09/10/2014 16:29, SCHAER Frederic wrote:


 -Message d'origine-
 De : Loic Dachary [mailto:l...@dachary.org] 
 Envoyé : jeudi 9 octobre 2014 16:20
 À : SCHAER Frederic; ceph-users@lists.ceph.com
 Objet : Re: [ceph-users] ceph-dis prepare : 
 UUID=----



 On 09/10/2014 16:04, SCHAER Frederic wrote:
 Hi Loic,

 Back on sdb, as the sde output was from another machine on which I ran 
 partx -u afterwards.
 To reply your last question first : I think the SG_IO error comes from the 
 fact that disks are exported as a single disks RAID0 on a PERC 6/E, which 
 does not support JBOD - this is decommissioned hardware on which I'd like 
 to test and validate we can use ceph for our use case...

 So back on the  UUID.
 It's funny : I retried and ceph-disk prepare worked this time. I tried on 
 another disk, and it failed.
 There is a difference in the output from ceph-disk : on the failing disk, I 
 have these extra lines after disks are prepared :

 (...)
 realtime =none   extsz=4096   blocks=0, rtextents=0
 Warning: The kernel is still using the old partition table.
 The new table will be used at the next reboot.
 The operation has completed successfully.
 partx: /dev/sdc: error adding partitions 1-2

 I didn't have the warning about the old partition tables on the disk that 
 worked. 
 So on this new disk, I have :

 [root@ceph1 ~]# mount /dev/sdc1 /mnt
 [root@ceph1 ~]# ll /mnt/
 total 16
 -rw-r--r-- 1 root root 37 Oct  9 15:58 ceph_fsid
 -rw-r--r-- 1 root root 37 Oct  9 15:58 fsid
 lrwxrwxrwx 1 root root 58 Oct  9 15:58 journal - 
 /dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc
 -rw-r--r-- 1 root root 37 Oct  9 15:58 journal_uuid
 -rw-r--r-- 1 root root 21 Oct  9 15:58 magic

 [root@ceph1 ~]# cat /mnt/journal_uuid
 5e50bb8b-0b99-455f-af71-10815a32bfbc

 [root@ceph1

[ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-09 Thread SCHAER Frederic
Hi,

I am setting up a test ceph cluster, on decommissioned  hardware (hence : not 
optimal, I know).
I have installed CentOS7, installed and setup ceph mons and OSD machines using 
puppet, and now I'm trying to add OSDs with the servers OSD disks... and I have 
issues (of course ;) )
I used the Ceph RHEL7 RPMs (ceph-0.80.6-0.el7.x86_64)

When I run ceph-disk prepare for a disk, I most of the time (but not always) 
get the partitions created, but not activated :

[root@ceph4 ~]# ceph-disk list|grep sdh
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying 
sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sdh :
/dev/sdh1 ceph data, prepared, cluster ceph, journal /dev/sdh2
/dev/sdh2 ceph journal, for /dev/sdh1

I tried to debug udev rules thinking they were not launched to activate the 
OSD, but they are, and they fail on this error :

+ ln -sf ../../sdh2 /dev/disk/by-partuuid/5b3bde8f-ccad-4093-a8a5-ad6413ae8931
+ mkdir -p /dev/disk/by-parttypeuuid
+ ln -sf ../../sdh2 
/dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.5b3bde8f-ccad-4093-a8a5-ad6413ae8931
+ case $ID_PART_ENTRY_TYPE in
+ /usr/sbin/ceph-disk -v activate-journal /dev/sdh2
INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
--osd-journal /dev/sdh2
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
DEBUG:ceph-disk:Journal /dev/sdh2 has OSD UUID 
----
INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- 
/dev/disk/by-partuuid/----
error: /dev/disk/by-partuuid/----: No such file 
or directory
ceph-disk: Cannot discover filesystem type: device 
/dev/disk/by-partuuid/----: Command 
'/sbin/blkid' returned non-zero exit status 2
+ exit
+ exec

You'll notice the zeroed UUID...
Because of this, I looked at the output of ceph-disk prepare, and saw that 
partx complains at the end (this is the partx -a command) :

Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.
partx: /dev/sdh: error adding partitions 1-2

And indeed, running partx -a /dev/sdh does not change anything.
But I just discovered that running partx -u /dev/sdh will fix everything 

I.e : right after I send this update command to the kernel, my debug logs show 
that the udev rule does everything fine and the OSD starts up.

I'm therefore wondering what I did wrong ?
is this CentOS 7 that is misbehaving, or the kernel, or...?
Any reason why partx -a is used instead of partx -u ?

I'd be glad to hear others advice on this !
Thanks  regards

Frederic Schaer

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-09 Thread SCHAER Frederic
Hi Loic,

With this example disk/machine that I left untouched until now :

/dev/sdb :
 /dev/sdb1 ceph data, prepared, cluster ceph, osd.44, journal /dev/sdb2
 /dev/sdb2 ceph journal, for /dev/sdb1

[root@ceph1 ~]# ll /dev/disk/by-partuuid/
total 0
lrwxrwxrwx 1 root root 10 Oct  9 15:09 2c27dbda-fbe3-48d6-80fe-b513e1c11702 - 
../../sdb1
lrwxrwxrwx 1 root root 10 Oct  9 15:09 d2352e3b-f7f2-40c7-8273-8bfa8ab4206a - 
../../sdb2

This is the blkid output :

[root@ceph1 ~]# blkid  /dev/sdb2
[root@ceph1 ~]# blkid  /dev/sdb1
/dev/sdb1: UUID=c8feaaad-bd83-41a3-a82a-0a8727d0b067 TYPE=xfs 
PARTLABEL=ceph data PARTUUID=2c27dbda-fbe3-48d6-80fe-b513e1c11702

If I run partx -u /dev/sdb, then the filesystem will get activated and the 
OSD started.
And sometimes, it just works without intervention, but that's the exception.

I modified the udev script this morning, so I can give you the output of what 
happens when things go wrong : links are created, but somewhere the UUIDD is 
wrongly detected by ceph-osd, as far as I understand :

Thu Oct  9 11:15:13 CEST 2014
+ PARTNO=2
+ NAME=sde2
+ PARENT_NAME=sde
++ /usr/sbin/sgdisk --info=2 /dev/sde
++ grep 'Partition GUID code'
++ awk '{print $4}'
++ tr '[:upper:]' '[:lower:]'
+ ID_PART_ENTRY_TYPE=45b0969e-9b03-4f30-b4c6-b4b80ceff106
+ '[' -z 45b0969e-9b03-4f30-b4c6-b4b80ceff106 ']'
++ /usr/sbin/sgdisk --info=2 /dev/sde
++ grep 'Partition unique GUID'
++ awk '{print $4}'
++ tr '[:upper:]' '[:lower:]'
+ ID_PART_ENTRY_UUID=a9e8d490-82a7-48c1-8ef1-aff92351c69c
+ mkdir -p /dev/disk/by-partuuid
+ ln -sf ../../sde2 /dev/disk/by-partuuid/a9e8d490-82a7-48c1-8ef1-aff92351c69c
+ mkdir -p /dev/disk/by-parttypeuuid
+ ln -sf ../../sde2 
/dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.a9e8d490-82a7-48c1-8ef1-aff92351c69c
+ case $ID_PART_ENTRY_TYPE in
+ /usr/sbin/ceph-disk -v activate-journal /dev/sde2
INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
--osd-journal /dev/sde2
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
DEBUG:ceph-disk:Journal /dev/sde2 has OSD UUID 
----
INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE -ovalue -- 
/dev/disk/by-partuuid/----
error: /dev/disk/by-partuuid/----: No such file 
or directory
ceph-disk: Cannot discover filesystem type: device 
/dev/disk/by-partuuid/----: Command 
'/sbin/blkid' returned non-zero exit status 2
+ exit
+ exec

regards

Frederic.

P.S : in your puppet module, it seems impossible to specify osd disks by path, 
i.e : 
ceph::profile::params::osds:
  '/dev/disk/by-path/pci-\:0a\:00.0-scsi-0\:2\:':
(I tried without the backslashes too)

-Message d'origine-
De : Loic Dachary [mailto:l...@dachary.org] 
Envoyé : jeudi 9 octobre 2014 15:01
À : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] ceph-dis prepare : 
UUID=----

Bonjour,

I'm not familiar with RHEL7 but willing to learn ;-) I recently ran into 
confusing situations regarding the content of /dev/disk/by-partuuid because 
partprobe was not called when it should have (ubuntu). On RHEL, kpartx is used 
instead because partprobe reboots, apparently. What is the content of 
/dev/disk/by-partuuid on your machine ?

ls -l /dev/disk/by-partuuid 

Cheers

On 09/10/2014 12:24, SCHAER Frederic wrote:
 Hi,
 
  
 
 I am setting up a test ceph cluster, on decommissioned  hardware (hence : not 
 optimal, I know).
 
 I have installed CentOS7, installed and setup ceph mons and OSD machines 
 using puppet, and now I'm trying to add OSDs with the servers OSD disks. and 
 I have issues (of course ;) )
 
 I used the Ceph RHEL7 RPMs (ceph-0.80.6-0.el7.x86_64)
 
  
 
 When I run ceph-disk prepare for a disk, I most of the time (but not 
 always) get the partitions created, but not activated :
 
  
 
 [root@ceph4 ~]# ceph-disk list|grep sdh
 
 WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying 
 sgdisk; may not correctly identify ceph volumes with dmcrypt
 
 /dev/sdh :
 
 /dev/sdh1 ceph data, prepared, cluster ceph, journal /dev/sdh2
 
 /dev/sdh2 ceph journal, for /dev/sdh1
 
  
 
 I tried to debug udev rules thinking they were not launched to activate the 
 OSD, but they are, and they fail on this error :
 
  
 
 + ln -sf ../../sdh2 /dev/disk/by-partuuid/5b3bde8f-ccad-4093-a8a5-ad6413ae8931
 
 + mkdir -p /dev/disk/by-parttypeuuid
 
 + ln -sf ../../sdh2 
 /dev/disk/by-parttypeuuid/45b0969e-9b03-4f30-b4c6-b4b80ceff106.5b3bde8f-ccad-4093-a8a5-ad6413ae8931
 
 + case $ID_PART_ENTRY_TYPE in
 
 + /usr/sbin/ceph-disk -v activate-journal /dev/sdh2
 
 INFO:ceph-disk:Running command: /usr/bin/ceph-osd -i 0 --get-journal-uuid 
 --osd-journal /dev/sdh2
 
 SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-09 Thread SCHAER Frederic
Hi Loic,

Back on sdb, as the sde output was from another machine on which I ran partx -u 
afterwards.
To reply your last question first : I think the SG_IO error comes from the fact 
that disks are exported as a single disks RAID0 on a PERC 6/E, which does not 
support JBOD - this is decommissioned hardware on which I'd like to test and 
validate we can use ceph for our use case...

So back on the  UUID.
It's funny : I retried and ceph-disk prepare worked this time. I tried on 
another disk, and it failed.
There is a difference in the output from ceph-disk : on the failing disk, I 
have these extra lines after disks are prepared :

(...)
realtime =none   extsz=4096   blocks=0, rtextents=0
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
The operation has completed successfully.
partx: /dev/sdc: error adding partitions 1-2

I didn't have the warning about the old partition tables on the disk that 
worked. 
So on this new disk, I have :

[root@ceph1 ~]# mount /dev/sdc1 /mnt
[root@ceph1 ~]# ll /mnt/
total 16
-rw-r--r-- 1 root root 37 Oct  9 15:58 ceph_fsid
-rw-r--r-- 1 root root 37 Oct  9 15:58 fsid
lrwxrwxrwx 1 root root 58 Oct  9 15:58 journal - 
/dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc
-rw-r--r-- 1 root root 37 Oct  9 15:58 journal_uuid
-rw-r--r-- 1 root root 21 Oct  9 15:58 magic

[root@ceph1 ~]# cat /mnt/journal_uuid
5e50bb8b-0b99-455f-af71-10815a32bfbc

[root@ceph1 ~]# sgdisk --info=1 /dev/sdc
Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown)
Partition unique GUID: 244973DE-7472-421C-BB25-4B09D3F8D441
First sector: 10487808 (at 5.0 GiB)
Last sector: 1952448478 (at 931.0 GiB)
Partition size: 1941960671 sectors (926.0 GiB)
Attribute flags: 
Partition name: 'ceph data'

[root@ceph1 ~]# sgdisk --info=2 /dev/sdc
Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
Partition unique GUID: 5E50BB8B-0B99-455F-AF71-10815A32BFBC
First sector: 2048 (at 1024.0 KiB)
Last sector: 10485760 (at 5.0 GiB)
Partition size: 10483713 sectors (5.0 GiB)
Attribute flags: 
Partition name: 'ceph journal'

Puzzling, isn't it ?


-Message d'origine-
De : Loic Dachary [mailto:l...@dachary.org] 
Envoyé : jeudi 9 octobre 2014 15:37
À : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] ceph-dis prepare : 
UUID=----


Does what do sgdisk --info=1 /dev/sde and sgdisk --info=2 /dev/sde print ?

It looks like the journal points to an incorrect location (you should see this 
by mounting /dev/sde1). Here is what I have on a cluster

root@bm0015:~# ls -l /var/lib/ceph/osd/ceph-1/
total 56
-rw-r--r--   1 root root  192 Nov  2  2013 activate.monmap
-rw-r--r--   1 root root3 Nov  2  2013 active
-rw-r--r--   1 root root   37 Nov  2  2013 ceph_fsid
drwxr-xr-x 114 root root 8192 Sep 14 11:01 current
-rw-r--r--   1 root root   37 Nov  2  2013 fsid
lrwxrwxrwx   1 root root   58 Nov  2  2013 journal - 
/dev/disk/by-partuuid/7e811295-1b45-477d-907a-41c4c90d9687
-rw-r--r--   1 root root   37 Nov  2  2013 journal_uuid
-rw---   1 root root   56 Nov  2  2013 keyring
-rw-r--r--   1 root root   21 Nov  2  2013 magic
-rw-r--r--   1 root root6 Nov  2  2013 ready
-rw-r--r--   1 root root4 Nov  2  2013 store_version
-rw-r--r--   1 root root   42 Dec 27  2013 superblock
-rw-r--r--   1 root root0 May  2 14:01 upstart
-rw-r--r--   1 root root2 Nov  2  2013 whoami
root@bm0015:~# cat /var/lib/ceph/osd/ceph-1/journal_uuid
7e811295-1b45-477d-907a-41c4c90d9687
root@bm0015:~#

I guess in your case the content of journal_uuid is 0- etc. for some 
reason.

Do you know where that

SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

comes from ?

On 09/10/2014 15:20, SCHAER Frederic wrote:
 Hi Loic,
 
 With this example disk/machine that I left untouched until now :
 
 /dev/sdb :
  /dev/sdb1 ceph data, prepared, cluster ceph, osd.44, journal /dev/sdb2
  /dev/sdb2 ceph journal, for /dev/sdb1
 
 [root@ceph1 ~]# ll /dev/disk/by-partuuid/
 total 0
 lrwxrwxrwx 1 root root 10 Oct  9 15:09 2c27dbda-fbe3-48d6-80fe-b513e1c11702 
 - ../../sdb1
 lrwxrwxrwx 1 root root 10 Oct  9 15:09 d2352e3b-f7f2-40c7-8273-8bfa8ab4206a 
 - ../../sdb2
 
 This is the blkid output :
 
 [root@ceph1 ~]# blkid  /dev/sdb2
 [root@ceph1 ~]# blkid  /dev/sdb1
 /dev/sdb1: UUID=c8feaaad-bd83-41a3-a82a-0a8727d0b067 TYPE=xfs 
 PARTLABEL=ceph data PARTUUID=2c27dbda-fbe3-48d6-80fe-b513e1c11702
 
 If I run partx -u /dev/sdb, then the filesystem will get activated and the 
 OSD started.
 And sometimes, it just works without intervention, but that's the exception.
 
 I modified the udev script this morning, so I can give you the output of what 
 happens when things go wrong : links are created, but somewhere the UUIDD is 
 wrongly detected by ceph-osd, as far as I understand :
 
 Thu Oct  9 11

Re: [ceph-users] ceph-dis prepare : UUID=00000000-0000-0000-0000-000000000000

2014-10-09 Thread SCHAER Frederic


-Message d'origine-
De : Loic Dachary [mailto:l...@dachary.org] 
Envoyé : jeudi 9 octobre 2014 16:20
À : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] ceph-dis prepare : 
UUID=----



On 09/10/2014 16:04, SCHAER Frederic wrote:
 Hi Loic,
 
 Back on sdb, as the sde output was from another machine on which I ran partx 
 -u afterwards.
 To reply your last question first : I think the SG_IO error comes from the 
 fact that disks are exported as a single disks RAID0 on a PERC 6/E, which 
 does not support JBOD - this is decommissioned hardware on which I'd like to 
 test and validate we can use ceph for our use case...
 
 So back on the  UUID.
 It's funny : I retried and ceph-disk prepare worked this time. I tried on 
 another disk, and it failed.
 There is a difference in the output from ceph-disk : on the failing disk, I 
 have these extra lines after disks are prepared :
 
 (...)
 realtime =none   extsz=4096   blocks=0, rtextents=0
 Warning: The kernel is still using the old partition table.
 The new table will be used at the next reboot.
 The operation has completed successfully.
 partx: /dev/sdc: error adding partitions 1-2
 
 I didn't have the warning about the old partition tables on the disk that 
 worked. 
 So on this new disk, I have :
 
 [root@ceph1 ~]# mount /dev/sdc1 /mnt
 [root@ceph1 ~]# ll /mnt/
 total 16
 -rw-r--r-- 1 root root 37 Oct  9 15:58 ceph_fsid
 -rw-r--r-- 1 root root 37 Oct  9 15:58 fsid
 lrwxrwxrwx 1 root root 58 Oct  9 15:58 journal - 
 /dev/disk/by-partuuid/5e50bb8b-0b99-455f-af71-10815a32bfbc
 -rw-r--r-- 1 root root 37 Oct  9 15:58 journal_uuid
 -rw-r--r-- 1 root root 21 Oct  9 15:58 magic
 
 [root@ceph1 ~]# cat /mnt/journal_uuid
 5e50bb8b-0b99-455f-af71-10815a32bfbc
 
 [root@ceph1 ~]# sgdisk --info=1 /dev/sdc
 Partition GUID code: 4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D (Unknown)
 Partition unique GUID: 244973DE-7472-421C-BB25-4B09D3F8D441
 First sector: 10487808 (at 5.0 GiB)
 Last sector: 1952448478 (at 931.0 GiB)
 Partition size: 1941960671 sectors (926.0 GiB)
 Attribute flags: 
 Partition name: 'ceph data'
 
 [root@ceph1 ~]# sgdisk --info=2 /dev/sdc
 Partition GUID code: 45B0969E-9B03-4F30-B4C6-B4B80CEFF106 (Unknown)
 Partition unique GUID: 5E50BB8B-0B99-455F-AF71-10815A32BFBC
 First sector: 2048 (at 1024.0 KiB)
 Last sector: 10485760 (at 5.0 GiB)
 Partition size: 10483713 sectors (5.0 GiB)
 Attribute flags: 
 Partition name: 'ceph journal'
 
 Puzzling, isn't it ?
 
 

Yes :-) Just to be 100% sure, when you try to activate this /dev/sdc it shows 
an error and complains that the journal uuid is -000* etc ? If so could you 
copy your udev debug output ?

Cheers

[- FS : -]  

No, when I manually activate the disk instead of attempting to go the udev way, 
it seems to work :
[root@ceph1 ~]# ceph-disk activate /dev/sdc1
got monmap epoch 1
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2014-10-09 16:21:43.286288 7f2be6a027c0 -1 journal check: ondisk fsid 
---- doesn't match expected 
244973de-7472-421c-bb25-4b09d3f8d441, invalid (someone else's?) journal
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SG_IO: bad/missing sense data, sb[]:  70 00 05 00 00 00 00 0b 00 00 00 00 20 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
2014-10-09 16:21:43.301957 7f2be6a027c0 -1 
filestore(/var/lib/ceph/tmp/mnt.4lJlzP) could not find 
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-10-09 16:21:43.305941 7f2be6a027c0 -1 created object store 
/var/lib/ceph/tmp/mnt.4lJlzP journal /var/lib/ceph/tmp/mnt.4lJlzP/journal for 
osd.47 fsid 70ac4a78-46c0-45e6-8ff9-878b37f50fa1
2014-10-09 16:21:43.305992 7f2be6a027c0 -1 auth: error reading file: 
/var/lib/ceph/tmp/mnt.4lJlzP/keyring: can't open 
/var/lib/ceph/tmp/mnt.4lJlzP/keyring: (2) No such file or directory
2014-10-09 16:21:43.306099 7f2be6a027c0 -1 created new key in keyring 
/var/lib/ceph/tmp/mnt.4lJlzP/keyring
added key for osd.47
=== osd.47 ===
create-or-move updating item name 'osd.47' weight 0.9 at location 
{host=ceph1,root=default} to crush map
Starting Ceph osd.47 on ceph1...
Running as unit run-12392.service.

The osd then appeared in the osd tree...
I attached the logs to this email (I just added a set -x in the script called 
by udev, and redirected the output)

Regards


udev_ceph.log.out
Description: udev_ceph.log.out
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com