Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-16 Thread Lars Täuber
Thanks Paul for the judgement.

Tue, 16 Apr 2019 10:13:03 +0200
Paul Emmerich  ==> Lars Täuber  :
> Seems in line with what I'd expect for the hardware.
> 
> Your hardware seems to be way overspecced, you'd be fine with half the
> RAM, half the CPU and way cheaper disks.

Do you mean all the components of the cluster or only the OSD-nodes?
Before making the requirements i only read about mirroring clusters. I was 
afraid of the CPUs being to slow to calculate the erasure codes we planned to 
use.


> In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.

This is a really good hint, because we just started to plan the extension.

> 
> I'd probably only use the 25G network for both networks instead of
> using both. Splitting the network usually doesn't help.

This is something i was told to do, because a reconstruction of failed 
OSDs/disks would have a heavy impact on the backend network.


> 
> Paul
> 

Thanks again.

Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Glance client and RBD export checksum mismatch

2019-04-16 Thread Brayan Perera
Dear All,

Thanks for the input.

We have created another pool and copied all the objects from images pool to it.

New pool : images_new

Once its done, have ran the script against the new pool and expected
checksum received. So issue is only happening from current glance
integrated pool only,

When comparing both pools configurations, received same output.
===
[root@storage01moc pool_details]# ceph osd pool get images all
size: 4
min_size: 2
crash_replay_interval: 0
pg_num: 256
pgp_num: 256
crush_rule: replicated_rule
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
fast_read: 0

[root@storage01moc pool_details]# ceph osd pool get images_new all
size: 4
min_size: 2
crash_replay_interval: 0
pg_num: 256
pgp_num: 256
crush_rule: replicated_rule
hashpspool: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
auid: 0
fast_read: 0
===

So we are thinking on, swapping images with images_new pool and test
with glance.


Thanks & Regards,
Brayan

On Fri, Apr 12, 2019 at 11:34 AM Erik McCormick
 wrote:
>
>
>
> On Thu, Apr 11, 2019, 8:53 AM Jason Dillaman  wrote:
>>
>> On Thu, Apr 11, 2019 at 8:49 AM Erik McCormick
>>  wrote:
>> >
>> >
>> >
>> > On Thu, Apr 11, 2019, 8:39 AM Erik McCormick  
>> > wrote:
>> >>
>> >>
>> >>
>> >> On Thu, Apr 11, 2019, 12:07 AM Brayan Perera  
>> >> wrote:
>> >>>
>> >>> Dear Jason,
>> >>>
>> >>>
>> >>> Thanks for the reply.
>> >>>
>> >>> We are using python 2.7.5
>> >>>
>> >>> Yes. script is based on openstack code.
>> >>>
>> >>> As suggested, we have tried chunk_size 32 and 64, and both giving same
>> >>> incorrect checksum value.
>> >>
>> >>
>> >> The value of rbd_store_chunk_size in glance is expressed in MB and then 
>> >> converted to mb. I think the default is 8, so you would want 8192 if 
>> >> you're trying to match what the image was uploaded with.
>> >
>> >
>> > Sorry, that should have been "...converted to KB."
>>
>> Wouldn't it be converted to bytes since all rbd API methods are in bytes? [1]
>
>
> Well yeah in the end that's true. Old versions I recall just passed a KB 
> number, but now it's
>
> self.chunk_size = CONF.rbd_store_chunk_size * 1024 * 1024
>
> My main point though was just that glance defaults to 8 MB chunks which is nk 
> uch larger than IP was using.
>
>>
>> >>
>> >>>
>> >>> We tried to copy same image in different pool and resulted same
>> >>> incorrect checksum.
>> >>>
>> >>>
>> >>> Thanks & Regards,
>> >>> Brayan
>> >>>
>> >>> On Wed, Apr 10, 2019 at 6:21 PM Jason Dillaman  
>> >>> wrote:
>> >>> >
>> >>> > On Wed, Apr 10, 2019 at 1:46 AM Brayan Perera 
>> >>> >  wrote:
>> >>> > >
>> >>> > > Dear All,
>> >>> > >
>> >>> > > Ceph Version : 12.2.5-2.ge988fb6.el7
>> >>> > >
>> >>> > > We are facing an issue on glance which have backend set to ceph, when
>> >>> > > we try to create an instance or volume out of an image, it throws
>> >>> > > checksum error.
>> >>> > > When we use rbd export and use md5sum, value is matching with glance 
>> >>> > > checksum.
>> >>> > >
>> >>> > > When we use following script, it provides same error checksum as 
>> >>> > > glance.
>> >>> >
>> >>> > What version of Python are you using?
>> >>> >
>> >>> > > We have used below images for testing.
>> >>> > > 1. Failing image (checksum mismatch): 
>> >>> > > ffed4088-74e1-4f22-86cb-35e7e97c377c
>> >>> > > 2. Passing image (checksum identical): 
>> >>> > > c048f0f9-973d-4285-9397-939251c80a84
>> >>> > >
>> >>> > > Output from storage node:
>> >>> > >
>> >>> > > 1. Failing image: ffed4088-74e1-4f22-86cb-35e7e97c377c
>> >>> > > checksum from glance database: 34da2198ec7941174349712c6d2096d8
>> >>> > > [root@storage01moc ~]# python test_rbd_format.py
>> >>> > > ffed4088-74e1-4f22-86cb-35e7e97c377c admin
>> >>> > > Image size: 681181184
>> >>> > > checksum from ceph: b82d85ae5160a7b74f52be6b5871f596
>> >>> > > Remarks: checksum is different
>> >>> > >
>> >>> > > 2. Passing image: c048f0f9-973d-4285-9397-939251c80a84
>> >>> > > checksum from glance database: 4f977f748c9ac2989cff32732ef740ed
>> >>> > > [root@storage01moc ~]# python test_rbd_format.py
>> >>> > > c048f0f9-973d-4285-9397-939251c80a84 admin
>> >>> > > Image size: 1411121152
>> >>> > > checksum from ceph: 4f977f748c9ac2989cff32732ef740ed
>> >>> > > Remarks: checksum is identical
>> >>> > >
>> >>> > > Wondering whether this issue is from ceph python libs or from ceph 
>> >>> > > itself.
>> >>> > >
>> >>> > > Please note that we do not have ceph pool tiering configured.
>> >>> > >
>> >>> > > Please let us know whether anyone faced similar issue and any fixes 
>> >>> > > for this.
>> >>> > >
>> >>> > > test_rbd_format.py
>> >>> > > ===
>> >>> > > import rados, sys, rbd
>> >>> > >
>> >>> > > image_id = 

Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-16 Thread Can Zhang
Thanks for your suggestions.

I tried to build libfio_ceph_objectstore.so, but it fails to load:

```
$ LD_LIBRARY_PATH=./lib ./bin/fio --enghelp=libfio_ceph_objectstore.so

fio: engine libfio_ceph_objectstore.so not loadable
IO engine libfio_ceph_objectstore.so not found
```

I managed to print the dlopen error, it said:

```
dlopen error: ./lib/libfio_ceph_objectstore.so: undefined symbol:
_ZTIN13PriorityCache8PriCacheE
```

I found a not-so-relevant
issue(https://tracker.ceph.com/issues/38360), the error seems to be
caused by mixed versions. My build environment is CentOS 7.5.1804 with
SCL devtoolset-7, and ceph is latest master branch. Does someone know
about the symbol?


Best,
Can Zhang

Best,
Can Zhang


On Tue, Apr 16, 2019 at 8:37 PM Igor Fedotov  wrote:
>
> Besides already mentioned store_test.cc one can also use ceph
> objectstore fio plugin
> (https://github.com/ceph/ceph/tree/master/src/test/fio) to access
> standalone BlueStore instance from FIO benchmarking tool.
>
>
> Thanks,
>
> Igor
>
> On 4/16/2019 7:58 AM, Can ZHANG wrote:
> > Hi,
> >
> > I'd like to run a standalone Bluestore instance so as to test and tune
> > its performance. Are there any tools about it, or any suggestions?
> >
> >
> >
> > Best,
> > Can Zhang
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-16 Thread Matt Benjamin
Hi Francois,

Why is using an explicit unix socket problematic for you?  For what it
does, that decision has always seemed sensible.

Matt

On Tue, Apr 16, 2019 at 7:04 PM Francois Lafont
 wrote:
>
> Hi @all,
>
> On 4/9/19 12:43 PM, Francois Lafont wrote:
>
> > I have tried this config:
> >
> > -
> > rgw enable ops log  = true
> > rgw ops log socket path = /tmp/opslog
> > rgw log http headers= http_x_forwarded_for
> > -
> >
> > and I have logs in the socket /tmp/opslog like this:
> >
> > -
> > {"bucket":"test1","time":"2019-04-09 
> > 09:41:18.188350Z","time_local":"2019-04-09 
> > 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET
> >  /?prefix=toto/=%2F 
> > HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk
> >  1.05 ( http://www.dragondisk.com 
> > )","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]},
> > -
> >
> > I can see the IP address of the client in the value of 
> > HTTP_X_FORWARDED_FOR, that's cool.
> >
> > But I don't understand why there is a specific socket to log that? I'm 
> > using radosgw in a Docker container (installed via ceph-ansible) and I have 
> > logs of the "radosgw" daemon in the "/var/log/syslog" file of my host (I'm 
> > using the Docker "syslog" log-driver).
> >
> > 1. Why is there a _separate_ log source for that? Indeed, in 
> > "/var/log/syslog" I have already some logs of civetweb. For instance:
> >
> >  2019-04-09 12:33:45.926 7f02e021c700  1 civetweb: 0x55876dc9c000: 
> > 10.111.222.51 - - [09/Apr/2019:12:33:45 +0200] "GET 
> > /?prefix=toto/=%2F HTTP/1.1" 200 1014 - DragonDisk 1.05 ( 
> > http://www.dragondisk.com )
>
> The fact that radosgw uses a separate log source for "ops log" (ie a specific 
> Unix socket) is still very mysterious for me.
>
>
> > 2. In my Docker container context, is it possible to put the logs above in 
> > the file "/var/log/syslog" of my host, in other words is it possible to 
> > make sure to log this in stdout of the daemon "radosgw"?
>
> It seems to me impossible to put ops log in the stdout of the "radosgw" 
> process (or, if it's possible, I have not found). So I have made a 
> workaround. I have set:
>
>  rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok
>
> in my ceph.conf and I have created a daemon (via un systemd unit file) which 
> runs this loop:
>
>  while true;
>  do
>  netcat -U "/var/run/ceph/rgw-opslog.asok" | logger -t "rgwops" -p 
> "local5.notice"
>  done
>
> to retrieve logs in syslog. It's not very satisfying but it's works.
>
> --
> François (flaf)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Try to log the IP in the header X-Forwarded-For with radosgw behind haproxy

2019-04-16 Thread Francois Lafont

Hi @all,

On 4/9/19 12:43 PM, Francois Lafont wrote:


I have tried this config:

-
rgw enable ops log  = true
rgw ops log socket path = /tmp/opslog
rgw log http headers    = http_x_forwarded_for
-

and I have logs in the socket /tmp/opslog like this:

-
{"bucket":"test1","time":"2019-04-09 09:41:18.188350Z","time_local":"2019-04-09 11:41:18.188350","remote_addr":"10.111.222.51","user":"flaf","operation":"GET","uri":"GET /?prefix=toto/=%2F 
HTTP/1.1","http_status":"200","error_code":"","bytes_sent":832,"bytes_received":0,"object_size":0,"total_time":39,"user_agent":"DragonDisk 1.05 ( http://www.dragondisk.com 
)","referrer":"","http_x_headers":[{"HTTP_X_FORWARDED_FOR":"10.111.222.55"}]},
-

I can see the IP address of the client in the value of HTTP_X_FORWARDED_FOR, 
that's cool.

But I don't understand why there is a specific socket to log that? I'm using radosgw in a Docker container 
(installed via ceph-ansible) and I have logs of the "radosgw" daemon in the 
"/var/log/syslog" file of my host (I'm using the Docker "syslog" log-driver).

1. Why is there a _separate_ log source for that? Indeed, in "/var/log/syslog" 
I have already some logs of civetweb. For instance:

     2019-04-09 12:33:45.926 7f02e021c700  1 civetweb: 0x55876dc9c000: 10.111.222.51 - - 
[09/Apr/2019:12:33:45 +0200] "GET /?prefix=toto/=%2F HTTP/1.1" 200 
1014 - DragonDisk 1.05 ( http://www.dragondisk.com )


The fact that radosgw uses a separate log source for "ops log" (ie a specific 
Unix socket) is still very mysterious for me.



2. In my Docker container context, is it possible to put the logs above in the file 
"/var/log/syslog" of my host, in other words is it possible to make sure to log this in 
stdout of the daemon "radosgw"?


It seems to me impossible to put ops log in the stdout of the "radosgw" process 
(or, if it's possible, I have not found). So I have made a workaround. I have set:

rgw_ops_log_socket_path = /var/run/ceph/rgw-opslog.asok

in my ceph.conf and I have created a daemon (via un systemd unit file) which 
runs this loop:

while true;
do
netcat -U "/var/run/ceph/rgw-opslog.asok" | logger -t "rgwops" -p 
"local5.notice"
done

to retrieve logs in syslog. It's not very satisfying but it's works.

--
François (flaf)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-16 Thread Brad Hubbard
$ ceph config set osd osd_recovery_max_active 4
$ ceph daemon osd.0 config diff|grep -A5 osd_recovery_max_active
"osd_recovery_max_active": {
"default": 3,
"mon": 4,
"override": 4,
"final": 4
},

On Wed, Apr 17, 2019 at 5:29 AM solarflow99  wrote:
>
> I wish there was a way to query the running settings from one of the MGR 
> hosts, and it doesn't help that ansible doesn't even copy the keyring to the 
> OSD nodes so commands there wouldn't work anyway.
> I'm still puzzled why it doesn't show any change when I run this no matter 
> what I set it to:
>
> # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
> in fact it doesn't matter if I use an OSD number that doesn't exist, same 
> thing if I use ceph get
>
>
>
> On Tue, Apr 16, 2019 at 1:18 AM Brad Hubbard  wrote:
>>
>> On Tue, Apr 16, 2019 at 6:03 PM Paul Emmerich  wrote:
>> >
>> > This works, it just says that it *might* require a restart, but this
>> > particular option takes effect without a restart.
>>
>> We've already looked at changing the wording once to make it more palatable.
>>
>> http://tracker.ceph.com/issues/18424
>>
>> >
>> > Implementation detail: this message shows up if there's no internal
>> > function to be called when this option changes, so it can't be sure if
>> > the change is actually doing anything because the option might be
>> > cached or only read on startup. But in this case this option is read
>> > in the relevant path every time and no notification is required. But
>> > the injectargs command can't know that.
>>
>> Right on all counts. The functions are referred to as observers and
>> register to be notified if the value changes, hence "not observed."
>>
>> >
>> > Paul
>> >
>> > On Mon, Apr 15, 2019 at 11:38 PM solarflow99  wrote:
>> > >
>> > > Then why doesn't this work?
>> > >
>> > > # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
>> > > osd.0: osd_recovery_max_active = '4' (not observed, change may require 
>> > > restart)
>> > > osd.1: osd_recovery_max_active = '4' (not observed, change may require 
>> > > restart)
>> > > osd.2: osd_recovery_max_active = '4' (not observed, change may require 
>> > > restart)
>> > > osd.3: osd_recovery_max_active = '4' (not observed, change may require 
>> > > restart)
>> > > osd.4: osd_recovery_max_active = '4' (not observed, change may require 
>> > > restart)
>> > >
>> > > # ceph -n osd.1 --show-config | grep osd_recovery_max_active
>> > > osd_recovery_max_active = 3
>> > >
>> > >
>> > >
>> > > On Wed, Apr 10, 2019 at 7:21 AM Eugen Block  wrote:
>> > >>
>> > >> > I always end up using "ceph --admin-daemon
>> > >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get 
>> > >> > what
>> > >> > is in effect now for a certain daemon.
>> > >> > Needs you to be on the host of the daemon of course.
>> > >>
>> > >> Me too, I just wanted to try what OP reported. And after trying that,
>> > >> I'll keep it that way. ;-)
>> > >>
>> > >>
>> > >> Zitat von Janne Johansson :
>> > >>
>> > >> > Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
>> > >> >
>> > >> >> > If you don't specify which daemon to talk to, it tells you what the
>> > >> >> > defaults would be for a random daemon started just now using the 
>> > >> >> > same
>> > >> >> > config as you have in /etc/ceph/ceph.conf.
>> > >> >>
>> > >> >> I tried that, too, but the result is not correct:
>> > >> >>
>> > >> >> host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
>> > >> >> osd_recovery_max_active = 3
>> > >> >>
>> > >> >
>> > >> > I always end up using "ceph --admin-daemon
>> > >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get 
>> > >> > what
>> > >> > is in effect now for a certain daemon.
>> > >> > Needs you to be on the host of the daemon of course.
>> > >> >
>> > >> > --
>> > >> > May the most significant bit of your life be positive.
>> > >>
>> > >>
>> > >>
>> > >> ___
>> > >> ceph-users mailing list
>> > >> ceph-users@lists.ceph.com
>> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Cheers,
>> Brad



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-16 Thread solarflow99
I wish there was a way to query the running settings from one of the MGR
hosts, and it doesn't help that ansible doesn't even copy the keyring to
the OSD nodes so commands there wouldn't work anyway.
I'm still puzzled why it doesn't show any change when I run this no matter
what I set it to:

# ceph -n osd.1 --show-config | grep osd_recovery_max_active
osd_recovery_max_active = 3

in fact it doesn't matter if I use an OSD number that doesn't exist, same
thing if I use ceph get



On Tue, Apr 16, 2019 at 1:18 AM Brad Hubbard  wrote:

> On Tue, Apr 16, 2019 at 6:03 PM Paul Emmerich 
> wrote:
> >
> > This works, it just says that it *might* require a restart, but this
> > particular option takes effect without a restart.
>
> We've already looked at changing the wording once to make it more
> palatable.
>
> http://tracker.ceph.com/issues/18424
>
> >
> > Implementation detail: this message shows up if there's no internal
> > function to be called when this option changes, so it can't be sure if
> > the change is actually doing anything because the option might be
> > cached or only read on startup. But in this case this option is read
> > in the relevant path every time and no notification is required. But
> > the injectargs command can't know that.
>
> Right on all counts. The functions are referred to as observers and
> register to be notified if the value changes, hence "not observed."
>
> >
> > Paul
> >
> > On Mon, Apr 15, 2019 at 11:38 PM solarflow99 
> wrote:
> > >
> > > Then why doesn't this work?
> > >
> > > # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> > > osd.0: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> > > osd.1: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> > > osd.2: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> > > osd.3: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> > > osd.4: osd_recovery_max_active = '4' (not observed, change may require
> restart)
> > >
> > > # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> > > osd_recovery_max_active = 3
> > >
> > >
> > >
> > > On Wed, Apr 10, 2019 at 7:21 AM Eugen Block  wrote:
> > >>
> > >> > I always end up using "ceph --admin-daemon
> > >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to
> get what
> > >> > is in effect now for a certain daemon.
> > >> > Needs you to be on the host of the daemon of course.
> > >>
> > >> Me too, I just wanted to try what OP reported. And after trying that,
> > >> I'll keep it that way. ;-)
> > >>
> > >>
> > >> Zitat von Janne Johansson :
> > >>
> > >> > Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
> > >> >
> > >> >> > If you don't specify which daemon to talk to, it tells you what
> the
> > >> >> > defaults would be for a random daemon started just now using the
> same
> > >> >> > config as you have in /etc/ceph/ceph.conf.
> > >> >>
> > >> >> I tried that, too, but the result is not correct:
> > >> >>
> > >> >> host1:~ # ceph -n osd.1 --show-config | grep
> osd_recovery_max_active
> > >> >> osd_recovery_max_active = 3
> > >> >>
> > >> >
> > >> > I always end up using "ceph --admin-daemon
> > >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to
> get what
> > >> > is in effect now for a certain daemon.
> > >> > Needs you to be on the host of the daemon of course.
> > >> >
> > >> > --
> > >> > May the most significant bit of your life be positive.
> > >>
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Cheers,
> Brad
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot quiet "pools have many more objects per pg than average" warning

2019-04-16 Thread Sergei Genchev
On Tue, Apr 16, 2019 at 1:28 PM Paul Emmerich  wrote:
>
> I think the warning is triggered by the mgr daemon and not the mon,
> try setting it there
>
Thank you Paul.

How do I set it in the mgr daemon?
I tried:
  ceph tell mon.* injectargs '--mgr_pg_warn_max_object_skew 0'
  ceph tell mgr.* injectargs '--mon_pg_warn_max_object_skew 0'
  ceph tell mgr.* injectargs '--mgr_pg_warn_max_object_skew 0'
All of these commands error out.
in ceph.conf I tried:
[mgr]
mon pg warn max object skew = 0

But it did not help either :-(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limits of mds bal fragment size max

2019-04-16 Thread Kjetil Joergensen
On Fri, Apr 12, 2019 at 10:31 AM Benjeman Meekhof  wrote:
>
> We have a user syncing data with some kind of rsync + hardlink based
> system creating/removing large numbers of hard links.  We've
> encountered many of the issues with stray inode re-integration as
> described in the thread and tracker below.
>
> As noted one fix is to increase mds_bal_fragment_size_max so the stray
> directories can accommodate the high stray count.  We blew right
> through 200,000, then 300,000, and at this point I'm wondering if
> there is an upper safe limit on this parameter?   If I go to something
> like 1mil to work with this use case will I have other problems?

I'd recommend to try to find a solution that doesn't require you to tweak this.

We ended up essentially doing a "repository of origin files", and maybe abusing
rsync --link-dest (I don't quite recall). This were a case where changes always
were additive at the file level. Files never change, and are only ever
added, never
removed. So we didn't have to worry about "garbage collecting" it, the amounts
were also pretty small.

Assuming it doesn't fragment the stray directories, your primary problem is
going to be omap sizes:

Problems we've run into with large omaps:
- Replication of omaps isn't fast if you ever have to do recovery
(which you will)
- LevelDB/RocksDB compaction for large sets is painful, the bigger the
more painful.
  This is the kind of thing that'll creep up on you - you may not
notice this until you have
  a multi-minute compaction, which ends up blocking requests at the
affected osd(s)
  for the duration.
- OSDs being flagged as down due to the above - when the omap's get
sufficiently large
- Specifically for ceph-mds and stray directories, potentially higher
memory usage
- Back on hammer - we suspected we'd found some replication corner-cases where
  we ended up with omap's out of sync (inconsistent objects, which
required some surgery
  with ceph-objectstore-tool), this happened infrequently. Given that
you're essentially
  exceeding "recommended" limits, you are more likely to find
corner-cases/bugs though.

In terms of "actual numbers" - I'm hesitant to commit to anything, at
some point we did run
mds with bal fragment size max of 10M. We didn't notice any problems,
this could well be
because this was the cluster that were the target of every experiments
- it was a very noisy
environment with relatively low expectations. Where we *really*
noticed the omaps I think
were well over 10M, although since it came from radosgw on jewel, it
crept up on us and
didn't appear on our radars until we had blocked requests an osd for
minutes ending up
affecting i.e. rbd.

> Background:
> https://www.spinics.net/lists/ceph-users/msg51985.html
> http://tracker.ceph.com/issues/38849
>
> thanks,
> Ben
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot quiet "pools have many more objects per pg than average" warning

2019-04-16 Thread Paul Emmerich
I think the warning is triggered by the mgr daemon and not the mon,
try setting it there


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Apr 16, 2019 at 8:18 PM Sergei Genchev  wrote:
>
>  Hi,
> I am getting a health warning about many more objects for PG than
> average. Seems to be common with RadosGW, where pools other than data
> contain very small number of objects.
>
> ceph@ola-s3-stg:/etc/ceph$ ceph health detail
> HEALTH_WARN 1 pools have many more objects per pg than average
> MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
> pool s3dev.rgw.buckets.data objects per pg (22675) is more than
> 84.2937 times cluster average (269)
>
> According to documentation, I should be able to quiet this warning by
> setting mon_pg_warn_max_object_skew.
> I did try to set  this settings to both 0 and 1 (docs are unclear
> about 0 value), but I cannot make this warning disappear.
>
> ceph@ola-s3-stg:/etc/ceph$ ceph tell mon.* injectargs
> '--mon_pg_warn_max_object_skew 0'
> mon.olaxsa-cephmon00: injectargs:
> mon.olaxsa-cephmon01: injectargs:
> mon.olaxsa-cephmon02: injectargs:
> ceph@ola-s3-stg:/etc/ceph$ ceph health
> HEALTH_WARN 1 pools have many more objects per pg than average
> ceph@ola-s3-stg:/etc/ceph$ ceph tell mon.* injectargs
> '--mon_pg_warn_max_object_skew 10'
> mon.olaxsa-cephmon00: injectargs:
> mon.olaxsa-cephmon01: injectargs:
> mon.olaxsa-cephmon02: injectargs:
> ceph@ola-s3-stg:/etc/ceph$ ceph health
> HEALTH_WARN 1 pools have many more objects per pg than average
>
> I have also tried to set  these values in cepf.conf and restart
> monitors, but warning is still there
> [global]
> .
> mon pg warn max object skew = 1
> mon pg warn max object skew = 0
>
> Where should I look next?
>
> Thank you!
>
> ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cannot quiet "pools have many more objects per pg than average" warning

2019-04-16 Thread Sergei Genchev
 Hi,
I am getting a health warning about many more objects for PG than
average. Seems to be common with RadosGW, where pools other than data
contain very small number of objects.

ceph@ola-s3-stg:/etc/ceph$ ceph health detail
HEALTH_WARN 1 pools have many more objects per pg than average
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
pool s3dev.rgw.buckets.data objects per pg (22675) is more than
84.2937 times cluster average (269)

According to documentation, I should be able to quiet this warning by
setting mon_pg_warn_max_object_skew.
I did try to set  this settings to both 0 and 1 (docs are unclear
about 0 value), but I cannot make this warning disappear.

ceph@ola-s3-stg:/etc/ceph$ ceph tell mon.* injectargs
'--mon_pg_warn_max_object_skew 0'
mon.olaxsa-cephmon00: injectargs:
mon.olaxsa-cephmon01: injectargs:
mon.olaxsa-cephmon02: injectargs:
ceph@ola-s3-stg:/etc/ceph$ ceph health
HEALTH_WARN 1 pools have many more objects per pg than average
ceph@ola-s3-stg:/etc/ceph$ ceph tell mon.* injectargs
'--mon_pg_warn_max_object_skew 10'
mon.olaxsa-cephmon00: injectargs:
mon.olaxsa-cephmon01: injectargs:
mon.olaxsa-cephmon02: injectargs:
ceph@ola-s3-stg:/etc/ceph$ ceph health
HEALTH_WARN 1 pools have many more objects per pg than average

I have also tried to set  these values in cepf.conf and restart
monitors, but warning is still there
[global]
.
mon pg warn max object skew = 1
mon pg warn max object skew = 0

Where should I look next?

Thank you!

ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS-Ganesha CEPH_FSAL | potential locking issue

2019-04-16 Thread Jeff Layton
On Tue, Apr 16, 2019 at 10:36 AM David C  wrote:
>
> Hi All
>
> I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7 
> machine mounts a sub-directory of the export [2] and is using it for the home 
> directory of a user (e.g everything under ~ is on the server).
>
> This works fine until I start a long sequential write into the home directory 
> such as:
>
> dd if=/dev/zero of=~/deleteme bs=1M count=8096
>
> This saturates the 1GbE link on the client which is great but during the 
> transfer, apps that are accessing files in home start to lock up. Google 
> Chrome for example, which puts it's config in ~/.config/google-chrome/,  
> locks up during the transfer, e.g I can't move between tabs, as soon as the 
> transfer finishes, Chrome goes back to normal. Essentially the desktop 
> environment reacts as I'd expect if the server was to go away. I'm using the 
> MATE DE.
>
> However, if I mount a separate directory from the same export on the machine 
> [3] and do the same write into that directory, my desktop experience isn't 
> affected.
>
> I hope that makes some sense, it's a bit of a weird one to describe. This 
> feels like a locking issue to me, although I can't explain why a single write 
> into the root of a mount would affect access to other files under that same 
> mount.
>

It's not a single write. You're doing 8G worth of 1M I/Os. The server
then has to do all of those to the OSD backing store.

> [1] CephFS export:
>
> EXPORT
> {
> Export_ID=100;
> Protocols = 4;
> Transports = TCP;
> Path = /;
> Pseudo = /ceph/;
> Access_Type = RW;
> Attr_Expiration_Time = 0;
> Disable_ACL = FALSE;
> Manage_Gids = TRUE;
> Filesystem_Id = 100.1;
> FSAL {
> Name = CEPH;
> }
> }
>
> [2] Home directory mount:
>
> 10.10.10.226:/ceph/homes/username on /homes/username type nfs4 
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
>
> [3] Test directory mount:
>
> 10.10.10.226:/ceph/testing on /tmp/testing type nfs4 
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
>
> Versions:
>
> Luminous 12.2.10
> nfs-ganesha-2.7.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64
>
> Ceph.conf on nfs-ganesha server:
>
> [client]
> mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789
> client_oc_size = 8388608000
> client_acl_type=posix_acl
> client_quota = true
> client_quota_df = true
>

No magic bullets here, I'm afraid.

Sounds like ganesha is probably just too swamped with write requests
to do much else, but you'll probably want to do the legwork starting
with the hanging application, and figure out what it's doing that
takes so long. Is it some syscall? Which one?

>From there you can start looking at statistics in the NFS client to
see what's going on there. Are certain RPCs taking longer than they
should? Which ones?

Once you know what's going on with the client, you can better tell
what's going on with the server.
-- 
Jeff Layton 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limiting osd process memory use in nautilus.

2019-04-16 Thread Patrick Hein
I am using osd_memory_target (
http://docs.ceph.com/docs/nautilus/rados/configuration/bluestore-config-ref/#automatic-cache-sizing)
for this purpose and it works flawless. Kept working after upgrading from
mimic to nautilus.

Jonathan Proulx  schrieb am Di., 16. Apr. 2019, 18:06:

> Hi All,
>
> I have a a few servers that are a bit undersized on RAM for number of
> osds they run.
>
> When we swithced to bluestore about 1yr ago I'd "fixed" this (well
> kept them from OOMing) by setting bluestore_cache_size_ssd and
> bluestore_cache_size_hdd, this worked.
>
> after upgrading to Nautilus the OSDs again are running away and OOMing
> out.
>
> I noticed osd_memory_target_cgroup_limit_ratio": "0.80" so tried
> setting 'MemoryHigh' and 'MemoryMax' in the unit file. But the osd
> process still happily runs right upto that line and lets the OS deal
> with it (and it deals harshly).
>
> currently I have:
>
> "bluestore_cache_size": "0",
> "bluestore_cache_size_hdd": "1073741824",
> "bluestore_cache_size_ssd": "1073741824",
>
> and
> MemoryHigh=2560M
> MemoryMax=3072M
>
> and processes keep running right upto that 3G line and getting smacked
> down which is causing performance issues as they thrash and I suspect
> some scrub issues I've seen recently.
>
> I guess my next traw to grab at is to set "bluestore_cache_size" but
> is there something I'm missing here?
>
> Thanks,
> -Jon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Limiting osd process memory use in nautilus.

2019-04-16 Thread Adam Tygart
As of 13.2.3, you should use 'osd_memory_target' instead of
'bluestore_cache_size'
--
Adam


On Tue, Apr 16, 2019 at 10:28 AM Jonathan Proulx  wrote:
>
> Hi All,
>
> I have a a few servers that are a bit undersized on RAM for number of
> osds they run.
>
> When we swithced to bluestore about 1yr ago I'd "fixed" this (well
> kept them from OOMing) by setting bluestore_cache_size_ssd and
> bluestore_cache_size_hdd, this worked.
>
> after upgrading to Nautilus the OSDs again are running away and OOMing
> out.
>
> I noticed osd_memory_target_cgroup_limit_ratio": "0.80" so tried
> setting 'MemoryHigh' and 'MemoryMax' in the unit file. But the osd
> process still happily runs right upto that line and lets the OS deal
> with it (and it deals harshly).
>
> currently I have:
>
> "bluestore_cache_size": "0",
> "bluestore_cache_size_hdd": "1073741824",
> "bluestore_cache_size_ssd": "1073741824",
>
> and
> MemoryHigh=2560M
> MemoryMax=3072M
>
> and processes keep running right upto that 3G line and getting smacked
> down which is causing performance issues as they thrash and I suspect
> some scrub issues I've seen recently.
>
> I guess my next traw to grab at is to set "bluestore_cache_size" but
> is there something I'm missing here?
>
> Thanks,
> -Jon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Limiting osd process memory use in nautilus.

2019-04-16 Thread Jonathan Proulx
Hi All,

I have a a few servers that are a bit undersized on RAM for number of
osds they run.

When we swithced to bluestore about 1yr ago I'd "fixed" this (well
kept them from OOMing) by setting bluestore_cache_size_ssd and
bluestore_cache_size_hdd, this worked.

after upgrading to Nautilus the OSDs again are running away and OOMing
out.

I noticed osd_memory_target_cgroup_limit_ratio": "0.80" so tried
setting 'MemoryHigh' and 'MemoryMax' in the unit file. But the osd
process still happily runs right upto that line and lets the OS deal
with it (and it deals harshly).

currently I have:

"bluestore_cache_size": "0",
"bluestore_cache_size_hdd": "1073741824",
"bluestore_cache_size_ssd": "1073741824",

and
MemoryHigh=2560M
MemoryMax=3072M

and processes keep running right upto that 3G line and getting smacked
down which is causing performance issues as they thrash and I suspect
some scrub issues I've seen recently.

I guess my next traw to grab at is to set "bluestore_cache_size" but
is there something I'm missing here?

Thanks,
-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-site replication speed

2019-04-16 Thread Casey Bodley

Hi Brian,

On 4/16/19 1:57 AM, Brian Topping wrote:
On Apr 15, 2019, at 5:18 PM, Brian Topping > wrote:


If I am correct, how do I trigger the full sync?


Apologies for the noise on this thread. I came to discover the 
`radosgw-admin [meta]data sync init` command. That’s gotten me with 
something that looked like this for several hours:



[root@master ~]# radosgw-admin  sync status
          realm 54bb8477-f221-429a-bbf0-76678c767b5f (example)
      zonegroup 8e33f5e9-02c8-4ab8-a0ab-c6a37c2bcf07 (us)
           zone b6e32bc8-f07e-4971-b825-299b5181a5f0 (secondary)
  metadata sync preparing for full sync
                full sync: 64/64 shards
                full sync: 0 entries to sync
                incremental sync: 0/64 shards
                metadata is behind on 64 shards
                behind shards: 
[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63]

      data sync source: 35835cb0-4639-43f4-81fd-624d40c7dd6f (master)
                        preparing for full sync
                        full sync: 1/128 shards
                        full sync: 0 buckets to sync
                        incremental sync: 127/128 shards
                        data is behind on 1 shards
                        behind shards: [0]


I also had the data sync showing a list of “behind shards”, but both 
of them sat in “preparing for full sync” for several hours, so I tried 
`radosgw-admin [meta]data sync run`. My sense is that was a bad idea, 
but neither of the commands seem to be documented and the thread I 
found them on indicated they wouldn’t damage the source data.


QUESTIONS at this point:

1) What is the best sequence of commands to properly start the sync? 
Does init just set things up and do nothing until a run is started?
The sync is always running. Each shard starts with full sync (where it 
lists everything on the remote, and replicates each), then switches to 
incremental sync (where it polls the replication logs for changes). The 
'metadata sync init' command clears the sync status, but this isn't 
synchronized with the metadata sync process running in radosgw(s) - so 
the gateways need to restart before they'll see the new status and 
restart the full sync. The same goes for 'data sync init'.
2) Are there commands I should run before that to clear out any 
previous bad runs?

Just restart gateways, and you should see progress via 'sync status'.


*Thanks very kindly for any assistance. *As I didn’t really see any 
documentation outside of setting up the realms/zones/groups, it seems 
like this would be useful information for others that follow.


best, Brian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS-Ganesha CEPH_FSAL | potential locking issue

2019-04-16 Thread David C
Hi All

I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7
machine mounts a sub-directory of the export [2] and is using it for the
home directory of a user (e.g everything under ~ is on the server).

This works fine until I start a long sequential write into the home
directory such as:

dd if=/dev/zero of=~/deleteme bs=1M count=8096

This saturates the 1GbE link on the client which is great but during the
transfer, apps that are accessing files in home start to lock up. Google
Chrome for example, which puts it's config in ~/.config/google-chrome/,
locks up during the transfer, e.g I can't move between tabs, as soon as the
transfer finishes, Chrome goes back to normal. Essentially the desktop
environment reacts as I'd expect if the server was to go away. I'm using
the MATE DE.

However, if I mount a separate directory from the same export on the
machine [3] and do the same write into that directory, my desktop
experience isn't affected.

I hope that makes some sense, it's a bit of a weird one to describe. This
feels like a locking issue to me, although I can't explain why a single
write into the root of a mount would affect access to other files under
that same mount.

[1] CephFS export:

EXPORT
{
Export_ID=100;
Protocols = 4;
Transports = TCP;
Path = /;
Pseudo = /ceph/;
Access_Type = RW;
Attr_Expiration_Time = 0;
Disable_ACL = FALSE;
Manage_Gids = TRUE;
Filesystem_Id = 100.1;
FSAL {
Name = CEPH;
}
}

[2] Home directory mount:

10.10.10.226:/ceph/homes/username on /homes/username type nfs4
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

[3] Test directory mount:

10.10.10.226:/ceph/testing on /tmp/testing type nfs4
(rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)

Versions:

Luminous 12.2.10
nfs-ganesha-2.7.1-0.1.el7.x86_64
nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64

Ceph.conf on nfs-ganesha server:

[client]
mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789
client_oc_size = 8388608000
client_acl_type=posix_acl
client_quota = true
client_quota_df = true

Thanks,
David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW failure cause client IO drops

2019-04-16 Thread Darius Kasparavičius
Hello,

Are you using a BBU backed raid controller? It sounds more like your
write cache is acting up if you are using one. Can you check what your
raid controller is showing? I have sometimes seen raid controllers
performing consistency checks or patrol read on single drive raid0.
You can disable that if it's running.
If it's lsi based controller you can use this "MegaCli64 -AdpPR -Dsbl
-aALL" for stopping patrol reads or "MegaCli64 -LDCC -Stop -lall
-aall" for stopping consistency check. You can also have a BBU learn
cycle active. Which discharges and charges the battery back up
disabling writeback cache. If it's running the cycle unfortunately,
but you will not be able to enable writeback cache. I recommend
enabling read cache and controller readahead. Use "MegaCli64
-LDSetProp -RA -Immediate -Lall -aAll" to enable read ahead and
"MegaCli64 -LDSetProp -Cached -Immediate -Lall -aAll" to enable cache
on I/O.

Now I wouldn't do this, but you can force writeback mode even with the
BBU off. YOU CAN AND YOU WILL LOSE ALL THE OSDS ON THE NODE IF
SOMETHING BAD HAPPENS. Use at your own risk and discretion: "MegaCli64
-LDSetProp -CachedBadBBU -Immediate -Lall -aAll" .

If these options didn't work. Respond and we will try to help you.

On Tue, Apr 16, 2019 at 3:27 PM M Ranga Swami Reddy
 wrote:
>
> Its Smart Storage battery, which was disabled due to high ambient temperature.
> All OSD processes/daemon working as is...but those OSDs not responding to 
> other OSD due to high CPU utilization..
> Don't observe the clock skew issue.
>
> On Tue, Apr 16, 2019 at 12:49 PM Marco Gaiarin  wrote:
>>
>> Mandi! M Ranga Swami Reddy
>>   In chel di` si favelave...
>>
>> > Hello - Recevenlt we had an issue with storage node's battery failure, 
>> > which
>> > cause ceph client IO dropped to '0' bytes. Means ceph cluster couldn't 
>> > perform
>> > IO operations on the cluster till the node takes out. This is not expected 
>> > from
>> > Ceph, as some HW fails, those respective OSDs should mark as out/down and 
>> > IO
>> > should go as is..
>> > Please let me know if anyone seen the similar behavior and is this issue
>> > resolved?
>>
>> 'battery' mean 'CMOS battery'?
>>
>>
>> OSDs and MONs need accurate clock sync between them. So, if a node
>> reboot with a clock skew more than (AFAI Remember well) 5 seconds, OSD
>> does not start.
>>
>> Provide a stable NTP server for all your OSDs and MONs, and restart
>> OSDs after clock are in sync.
>>
>> --
>> dott. Marco Gaiarin GNUPG Key ID: 
>> 240A3D66
>>   Associazione ``La Nostra Famiglia''  
>> http://www.lanostrafamiglia.it/
>>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento 
>> (PN)
>>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f 
>> +39-0434-842797
>>
>> Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>>   http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
>> (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore bitmap allocator under Luminous and Mimic

2019-04-16 Thread Igor Fedotov


On 4/15/2019 4:17 PM, Wido den Hollander wrote:


On 4/15/19 2:55 PM, Igor Fedotov wrote:

Hi Wido,

the main driver for this backport were multiple complains on write ops
latency increasing over time.

E.g. see thread named:  "ceph osd commit latency increase over time,
until restart" here.

Or http://tracker.ceph.com/issues/38738

Most symptoms showed Stupid Allocator as a root cause for that.

Hence we've got a decision to backport bitmap allocator which should
work a fix/workaround.


I see, that makes things clear. Anything users should take into account
when setting:

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Writing this here for archival purposes so that users who have the same
question can find it easily.


Nothing specific but a bit different memory usage pattern: stupid 
allocator has more dynamic memory usage approach while bitmap allocator 
is absolutely static in this respect. So depending on the use case OSD 
might require more or less RAM. E.g. on fresh deployment stupid 
allocator memory requirements are most probably less that bitmap 
allocator ones. But RAM usage for bitmap allocator doesn't change with 
OSD evolution while ones for stupid allocator might grow unexpectedly high.


FWIW resulting disk fragmentation might be different too. The same apply 
to their performance but I'm not sure if the latter is visible with the 
full Ceph stack.





Wido


Thanks,

Igor


On 4/15/2019 3:39 PM, Wido den Hollander wrote:

Hi,

With the release of 12.2.12 the bitmap allocator for BlueStore is now
available under Mimic and Luminous.

[osd]
bluestore_allocator = bitmap
bluefs_allocator = bitmap

Before setting this in production: What might the implications be and
what should be thought of?

  From what I've read the bitmap allocator seems to be better in read
performance and uses less memory.

In Nautilus bitmap is the default, but L and M still default to stupid.

Since the bitmap allocator was backported there must be a use-case to
use the bitmap allocator instead of stupid.

Thanks!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-16 Thread Igor Fedotov
Besides already mentioned store_test.cc one can also use ceph 
objectstore fio plugin 
(https://github.com/ceph/ceph/tree/master/src/test/fio) to access 
standalone BlueStore instance from FIO benchmarking tool.



Thanks,

Igor

On 4/16/2019 7:58 AM, Can ZHANG wrote:

Hi,

I'd like to run a standalone Bluestore instance so as to test and tune 
its performance. Are there any tools about it, or any suggestions?




Best,
Can Zhang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW failure cause client IO drops

2019-04-16 Thread Wido den Hollander


On 4/16/19 2:27 PM, M Ranga Swami Reddy wrote:
> Its Smart Storage battery, which was disabled due to high ambient
> temperature.
> All OSD processes/daemon working as is...but those OSDs not responding
> to other OSD due to high CPU utilization..
> Don't observe the clock skew issue.
> 

As the I/O was stalling on those OSDs you needed to wait for the OSDs to
commit suicide which can take up to 600 seconds.

The storage subsystem underneath the OSDs was causing problems and those
are very hard to work around.

If OSDs commit suicide too quickly you could get a snowball effect if
you have disks which are sluggish or overloaded.

Ceph can handle this just fine after these OSDs have committed suicide
or you stop them manually. But by nature Ceph is a synchronous system
and it will wait for all the OSDs to commit a write if it comes in.

The behavior in this case was fully expected and normal.

Wido

> On Tue, Apr 16, 2019 at 12:49 PM Marco Gaiarin  > wrote:
> 
> Mandi! M Ranga Swami Reddy
>   In chel di` si favelave...
> 
> > Hello - Recevenlt we had an issue with storage node's battery
> failure, which
> > cause ceph client IO dropped to '0' bytes. Means ceph cluster
> couldn't perform
> > IO operations on the cluster till the node takes out. This is not
> expected from
> > Ceph, as some HW fails, those respective OSDs should mark as
> out/down and IO
> > should go as is..
> > Please let me know if anyone seen the similar behavior and is this
> issue
> > resolved?
> 
> 'battery' mean 'CMOS battery'?
> 
> 
> OSDs and MONs need accurate clock sync between them. So, if a node
> reboot with a clock skew more than (AFAI Remember well) 5 seconds, OSD
> does not start.
> 
> Provide a stable NTP server for all your OSDs and MONs, and restart
> OSDs after clock are in sync.
> 
> -- 
> dott. Marco Gaiarin                                     GNUPG Key
> ID: 240A3D66
>   Associazione ``La Nostra Famiglia''         
> http://www.lanostrafamiglia.it/
>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al
> Tagliamento (PN)
>   marco.gaiarin(at)lanostrafamiglia.it  
>  t +39-0434-842711   f +39-0434-842797
> 
>                 Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>       http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
>         (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW failure cause client IO drops

2019-04-16 Thread M Ranga Swami Reddy
Its Smart Storage battery, which was disabled due to high ambient
temperature.
All OSD processes/daemon working as is...but those OSDs not responding to
other OSD due to high CPU utilization..
Don't observe the clock skew issue.

On Tue, Apr 16, 2019 at 12:49 PM Marco Gaiarin  wrote:

> Mandi! M Ranga Swami Reddy
>   In chel di` si favelave...
>
> > Hello - Recevenlt we had an issue with storage node's battery failure,
> which
> > cause ceph client IO dropped to '0' bytes. Means ceph cluster couldn't
> perform
> > IO operations on the cluster till the node takes out. This is not
> expected from
> > Ceph, as some HW fails, those respective OSDs should mark as out/down
> and IO
> > should go as is..
> > Please let me know if anyone seen the similar behavior and is this issue
> > resolved?
>
> 'battery' mean 'CMOS battery'?
>
>
> OSDs and MONs need accurate clock sync between them. So, if a node
> reboot with a clock skew more than (AFAI Remember well) 5 seconds, OSD
> does not start.
>
> Provide a stable NTP server for all your OSDs and MONs, and restart
> OSDs after clock are in sync.
>
> --
> dott. Marco Gaiarin GNUPG Key ID:
> 240A3D66
>   Associazione ``La Nostra Famiglia''
> http://www.lanostrafamiglia.it/
>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento
> (PN)
>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f
> +39-0434-842797
>
> Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>   http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
> (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 17:05, Paul Emmerich  wrote:
>
> No, the problem is that a storage system should never tell a client
> that it has written data if it cannot guarantee that the data is still
> there if one device fails.
[...]

Ah, now I got your point.

Anyways, it should be users' choice (with warning probably, but
still). I can easily
(but with heavy heart) remind what happened twice and not too long ago
when someone decided
"we better know what to do than users^W pilots do". Too many similar
decisions were and (still are)
popping up in IT industry too. Of course always "for good reasons" --
who'd have a doubt(?)...

Oh, and BTW -- is it not possible to change EC(2,1)'s 3/3 to 3/2 in
Luminous, is it?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Paul Emmerich
No, the problem is that a storage system should never tell a client
that it has written data if it cannot guarantee that the data is still
there if one device fails.

Scenario: one OSD is down for whatever reason and another one fails.
You've now lost all writes that happened while one OSD was down.
You never want this scenario if you care about your data.

I've worked on a lot of broken clusters and all cases of data loss
where related to running with min_size 1 or erasure coding with this
setting.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Apr 16, 2019 at 12:02 PM Igor Podlesny  wrote:
>
> On Tue, 16 Apr 2019 at 16:52, Paul Emmerich  wrote:
> > On Tue, Apr 16, 2019 at 11:50 AM Igor Podlesny  wrote:
> > > On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  
> > > wrote:
> [...]
> > > Looked at it, didn't see any explanation of your point of view. If
> > > there're 2 active data instances
> > > (and 3rd is missing) how is it different to replicated pools with 3/2 
> > > config(?)
> >
> > each of these "copies" has only half the data
>
> Still not seeing how come.
>
> EC(2, 1) is conceptually RAID5 on 3 devices. You're basically saying
> that if one disk of those 3 disks is missing
> you can't safely write to 2 others that are still in. But CEPH's EC
> has no partial update issue, has it?
>
> Can you elaborate?
>
> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 16:52, Paul Emmerich  wrote:
> On Tue, Apr 16, 2019 at 11:50 AM Igor Podlesny  wrote:
> > On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  wrote:
[...]
> > Looked at it, didn't see any explanation of your point of view. If
> > there're 2 active data instances
> > (and 3rd is missing) how is it different to replicated pools with 3/2 
> > config(?)
>
> each of these "copies" has only half the data

Still not seeing how come.

EC(2, 1) is conceptually RAID5 on 3 devices. You're basically saying
that if one disk of those 3 disks is missing
you can't safely write to 2 others that are still in. But CEPH's EC
has no partial update issue, has it?

Can you elaborate?

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Paul Emmerich
On Tue, Apr 16, 2019 at 11:50 AM Igor Podlesny  wrote:
>
> On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  wrote:
> > Sorry, I just realized I didn't answer your original question.
> [...]
>
> No problemo. -- I've figured out the answer to my own question earlier 
> anyways.
> And actually gave a hint today
>
>   http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034278.html
>
> based on those findings.
>
> > Regarding min_size: yes, you are right about a 2+1 pool being created
> > with min_size 2 by default in the latest Nautilus release.
> > This seems like a bug to me, I've opened a ticket here:
> > http://tracker.ceph.com/issues/39307
>
> Looked at it, didn't see any explanation of your point of view. If
> there're 2 active data instances
> (and 3rd is missing) how is it different to replicated pools with 3/2 
> config(?)

each of these "copies" has only half the data


Paul

>
> [... overquoting removed ...]
>
> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Igor Podlesny
On Tue, 16 Apr 2019 at 14:46, Paul Emmerich  wrote:
> Sorry, I just realized I didn't answer your original question.
[...]

No problemo. -- I've figured out the answer to my own question earlier anyways.
And actually gave a hint today

  http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034278.html

based on those findings.

> Regarding min_size: yes, you are right about a 2+1 pool being created
> with min_size 2 by default in the latest Nautilus release.
> This seems like a bug to me, I've opened a ticket here:
> http://tracker.ceph.com/issues/39307

Looked at it, didn't see any explanation of your point of view. If
there're 2 active data instances
(and 3rd is missing) how is it different to replicated pools with 3/2 config(?)

[... overquoting removed ...]

-- 
End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: HW failure cause client IO drops

2019-04-16 Thread M Ranga Swami Reddy
OSD processes/daemon running as is...So ceph not making those OSD down or
out.
But as battery failed, which leads temperature high, leads CPU utlization
increased  - leads
OSD response time more, so that other OSDs failed to response on time..
causing the utter slow or no IO...



On Tue, Apr 16, 2019 at 12:23 PM Eugen Block  wrote:

> Good morning,
>
> the OSDs are usually marked out after 10 minutes, that's when
> rebalancing starts. But the I/O should not drop during that time, this
> could be related to your pool configuration. If you have a replicated
> pool of size 3 and also set min_size to 3 the I/O would pause if a
> node or OSD fails. So more information about the cluster would help,
> can you share that?
>
> ceph osd tree
> ceph osd pool ls detail
>
> Were all pools affected or just specific pools?
>
> Regards,
> Eugen
>
>
> Zitat von M Ranga Swami Reddy :
>
> > Hello - Recevenlt we had an issue with storage node's battery failure,
> > which cause ceph client IO dropped to '0' bytes. Means ceph cluster
> > couldn't perform IO operations on the cluster till the node takes out.
> This
> > is not expected from Ceph, as some HW fails, those respective OSDs should
> > mark as out/down and IO should go as is..
> >
> > Please let me know if anyone seen the similar behavior and is this issue
> > resolved?
> >
> > Thanks
> > Swami
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-16 Thread Brad Hubbard
On Tue, Apr 16, 2019 at 6:03 PM Paul Emmerich  wrote:
>
> This works, it just says that it *might* require a restart, but this
> particular option takes effect without a restart.

We've already looked at changing the wording once to make it more palatable.

http://tracker.ceph.com/issues/18424

>
> Implementation detail: this message shows up if there's no internal
> function to be called when this option changes, so it can't be sure if
> the change is actually doing anything because the option might be
> cached or only read on startup. But in this case this option is read
> in the relevant path every time and no notification is required. But
> the injectargs command can't know that.

Right on all counts. The functions are referred to as observers and
register to be notified if the value changes, hence "not observed."

>
> Paul
>
> On Mon, Apr 15, 2019 at 11:38 PM solarflow99  wrote:
> >
> > Then why doesn't this work?
> >
> > # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> > osd.0: osd_recovery_max_active = '4' (not observed, change may require 
> > restart)
> > osd.1: osd_recovery_max_active = '4' (not observed, change may require 
> > restart)
> > osd.2: osd_recovery_max_active = '4' (not observed, change may require 
> > restart)
> > osd.3: osd_recovery_max_active = '4' (not observed, change may require 
> > restart)
> > osd.4: osd_recovery_max_active = '4' (not observed, change may require 
> > restart)
> >
> > # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> > osd_recovery_max_active = 3
> >
> >
> >
> > On Wed, Apr 10, 2019 at 7:21 AM Eugen Block  wrote:
> >>
> >> > I always end up using "ceph --admin-daemon
> >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get 
> >> > what
> >> > is in effect now for a certain daemon.
> >> > Needs you to be on the host of the daemon of course.
> >>
> >> Me too, I just wanted to try what OP reported. And after trying that,
> >> I'll keep it that way. ;-)
> >>
> >>
> >> Zitat von Janne Johansson :
> >>
> >> > Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
> >> >
> >> >> > If you don't specify which daemon to talk to, it tells you what the
> >> >> > defaults would be for a random daemon started just now using the same
> >> >> > config as you have in /etc/ceph/ceph.conf.
> >> >>
> >> >> I tried that, too, but the result is not correct:
> >> >>
> >> >> host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> >> >> osd_recovery_max_active = 3
> >> >>
> >> >
> >> > I always end up using "ceph --admin-daemon
> >> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get 
> >> > what
> >> > is in effect now for a certain daemon.
> >> > Needs you to be on the host of the daemon of course.
> >> >
> >> > --
> >> > May the most significant bit of your life be positive.
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-16 Thread Paul Emmerich
Seems in line with what I'd expect for the hardware.

Your hardware seems to be way overspecced, you'd be fine with half the
RAM, half the CPU and way cheaper disks.
In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.

I'd probably only use the 25G network for both networks instead of
using both. Splitting the network usually doesn't help.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Apr 8, 2019 at 2:16 PM Lars Täuber  wrote:
>
> Hi there,
>
> i'm new to ceph and just got my first cluster running.
> Now i'd like to know if the performance we get is expectable.
>
> Is there a website with benchmark results somewhere where i could have a look 
> to compare with our HW and our results?
>
> This are the results:
> rados bench single threaded:
> # rados bench 10 write --rbd-cache=false -t 1
>
> Object size:4194304
> Bandwidth (MB/sec): 53.7186
> Stddev Bandwidth:   3.86437
> Max bandwidth (MB/sec): 60
> Min bandwidth (MB/sec): 48
> Average IOPS:   13
> Stddev IOPS:0.966092
> Average Latency(s): 0.0744599
> Stddev Latency(s):  0.00911778
>
> nearly maxing out one (idle) client with 28 threads
> # rados bench 10 write --rbd-cache=false -t 28
>
> Bandwidth (MB/sec): 850.451
> Stddev Bandwidth:   40.6699
> Max bandwidth (MB/sec): 904
> Min bandwidth (MB/sec): 748
> Average IOPS:   212
> Stddev IOPS:10.1675
> Average Latency(s): 0.131309
> Stddev Latency(s):  0.0318489
>
> four concurrent benchmarks on four clients each with 24 threads:
> Bandwidth (MB/sec): 396 376 381 389
> Stddev Bandwidth:   30  25  22  22
> Max bandwidth (MB/sec): 440 420 416 428
> Min bandwidth (MB/sec): 352 348 344 364
> Average IOPS:   99  94  95  97
> Stddev IOPS:7.5 6.3 5.6 5.6
> Average Latency(s): 0.240.250.250.24
> Stddev Latency(s):  0.120.150.150.14
>
> summing up: write mode
> ~1500 MB/sec Bandwidth
> ~385 IOPS
> ~0.25s Latency
>
> rand mode:
> ~3500 MB/sec
> ~920 IOPS
> ~0.154s Latency
>
>
>
> Maybe someone could judge our numbers. I am actually very satisfied with the 
> values.
>
> The (mostly idle) cluster is build from these components:
> * 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes
> ** no bonding to clients
> * 25GB backend network, bonding two connections to osd-nodes
>
>
> cluster:
> * 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM
> * 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM
> * 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM
> ** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis 
> for later growth)
> ** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD 
> for DB and WAL
>
> erasure encoded pool: (made for CephFS)
> * plugin=clay k=5 m=2 d=6 crush-failure-domain=host
>
> Thanks and best regards
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] showing active config settings

2019-04-16 Thread Paul Emmerich
This works, it just says that it *might* require a restart, but this
particular option takes effect without a restart.

Implementation detail: this message shows up if there's no internal
function to be called when this option changes, so it can't be sure if
the change is actually doing anything because the option might be
cached or only read on startup. But in this case this option is read
in the relevant path every time and no notification is required. But
the injectargs command can't know that.

Paul

On Mon, Apr 15, 2019 at 11:38 PM solarflow99  wrote:
>
> Then why doesn't this work?
>
> # ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> osd.0: osd_recovery_max_active = '4' (not observed, change may require 
> restart)
> osd.1: osd_recovery_max_active = '4' (not observed, change may require 
> restart)
> osd.2: osd_recovery_max_active = '4' (not observed, change may require 
> restart)
> osd.3: osd_recovery_max_active = '4' (not observed, change may require 
> restart)
> osd.4: osd_recovery_max_active = '4' (not observed, change may require 
> restart)
>
> # ceph -n osd.1 --show-config | grep osd_recovery_max_active
> osd_recovery_max_active = 3
>
>
>
> On Wed, Apr 10, 2019 at 7:21 AM Eugen Block  wrote:
>>
>> > I always end up using "ceph --admin-daemon
>> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
>> > is in effect now for a certain daemon.
>> > Needs you to be on the host of the daemon of course.
>>
>> Me too, I just wanted to try what OP reported. And after trying that,
>> I'll keep it that way. ;-)
>>
>>
>> Zitat von Janne Johansson :
>>
>> > Den ons 10 apr. 2019 kl 13:37 skrev Eugen Block :
>> >
>> >> > If you don't specify which daemon to talk to, it tells you what the
>> >> > defaults would be for a random daemon started just now using the same
>> >> > config as you have in /etc/ceph/ceph.conf.
>> >>
>> >> I tried that, too, but the result is not correct:
>> >>
>> >> host1:~ # ceph -n osd.1 --show-config | grep osd_recovery_max_active
>> >> osd_recovery_max_active = 3
>> >>
>> >
>> > I always end up using "ceph --admin-daemon
>> > /var/run/ceph/name-of-socket-here.asok config show | grep ..." to get what
>> > is in effect now for a certain daemon.
>> > Needs you to be on the host of the daemon of course.
>> >
>> > --
>> > May the most significant bit of your life be positive.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to run a standalone Bluestore instance?

2019-04-16 Thread Paul Emmerich
Depending on how low-level you are willing to go, a good place to
start would be the unit test for the various object store
implementations:
https://github.com/ceph/ceph/blob/master/src/test/objectstore/store_test.cc


Paul


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Apr 16, 2019 at 6:58 AM Can ZHANG  wrote:
>
> Hi,
>
> I'd like to run a standalone Bluestore instance so as to test and tune its 
> performance. Are there any tools about it, or any suggestions?
>
>
>
> Best,
> Can Zhang
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does "ceph df" use "bogus" copies factor instead of (k, m) for erasure coded pool?

2019-04-16 Thread Paul Emmerich
Sorry, I just realized I didn't answer your original question.

ceph df does take erasure coding settings into account and shows the
correct free space.
However, it also takes the current data distribution into account,
i.e., the amount of data you can write until the first OSD is full
assuming you don't do any rebalancing.
So that's why you sometimes see lower-than-expected values there.


Regarding min_size: yes, you are right about a 2+1 pool being created
with min_size 2 by default in the latest Nautilus release.
This seems like a bug to me, I've opened a ticket here:
http://tracker.ceph.com/issues/39307


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sat, Apr 13, 2019 at 5:18 AM Igor Podlesny  wrote:
>
> And as to min_size choice -- since you've replied exactly to that part
> of mine message only.
>
> On Sat, 13 Apr 2019 at 06:54, Paul Emmerich  wrote:
> > On Fri, Apr 12, 2019 at 9:30 PM Igor Podlesny  wrote:
> > > For e. g., an EC pool with default profile (2, 1) has bogus "sizing"
> > > params (size=3, min_size=3).
>
> {{
> > > Min. size 3 is wrong as far as I know and it's been fixed in fresh
> > > releases (but not in Luminous).
> }}
>
> I didn't give any proof when writing this due being more focused on EC
> Pool usage calculation.
> Take a look at:
>
>   https://github.com/ceph/ceph/pull/8008
>
> As it can be seen formula for min_size became min_size = k + min(1, m
> - 1) effectively on March 2019.
> -- That's why I've said "fixed in fresh releases but not in Luminous".
>
> Let's see what does this new formula produce for k=2, m=1 (the default
> and documented EC profile):
>
> min_size = 2 + min(1, 1 - 1) = 2 + 0 = 2.
>
> Before that change it would be 3 instead, thus giving that 3/3 for EC (2, 1).
>
> [...]
> > min_size 3 is the default for that pool, yes. That means your data
> > will be unavailable if any OSD is offline.
> > Reducing min_size to 2 means you are accepting writes when you cannot
> > guarantee durability which will cause problems in the long run.
> > See older discussions about min_size here
>
> Would be glad doing so, but It's not a forum (here), but mail list
> instead, right(?) -- so the only way
> to "see here" is to rely on search engine that might have indexed mail
> list archive. If you have
> specific URL or at least exact keywords allowing to find what you're
> referring to, I'd gladly see
> what you're talking about.
>
> And of course I did search before writing and the fact I wrote it
> anyways means I didn't find
> anything answering my question "here or there".
>
> --
> End of message. Next message?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HW failure cause client IO drops

2019-04-16 Thread Marco Gaiarin
Mandi! M Ranga Swami Reddy
  In chel di` si favelave...

> Hello - Recevenlt we had an issue with storage node's battery failure, which
> cause ceph client IO dropped to '0' bytes. Means ceph cluster couldn't perform
> IO operations on the cluster till the node takes out. This is not expected 
> from
> Ceph, as some HW fails, those respective OSDs should mark as out/down and IO
> should go as is..
> Please let me know if anyone seen the similar behavior and is this issue
> resolved?

'battery' mean 'CMOS battery'?


OSDs and MONs need accurate clock sync between them. So, if a node
reboot with a clock skew more than (AFAI Remember well) 5 seconds, OSD
does not start.

Provide a stable NTP server for all your OSDs and MONs, and restart
OSDs after clock are in sync.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 'Missing' capacity

2019-04-16 Thread Mark Schouten


root@proxmox01:~# ceph osd df tree | sort -n -k8 | tail -1
  1   ssd  0.87000  1.0  889GiB  721GiB  168GiB 81.14 1.50  82         
osd.1      


root@proxmox01:~# ceph osd df tree | grep -c osd
68


68*168=11424

That is closer, thanks. I thought that available was the same as the cluster 
available. But appearantly it is the available on the fullest OSD. Thanks, 
learned someting again!
--

Mark Schouten 

Tuxis, Ede, https://www.tuxis.nl

T: +31 318 200208 
 



- Originele bericht -


Van: Sinan Polat (si...@turka.nl)
Datum: 16-04-2019 06:43
Naar: Igor Podlesny (ceph-u...@poige.ru)
Cc: Mark Schouten (m...@tuxis.nl), Ceph Users (ceph-users@lists.ceph.com)
Onderwerp: Re: [ceph-users] 'Missing' capacity


Probably inbalance of data across your OSDs.

Could you show ceph osd df.

From there take the disk with lowest available space. Multiply that number with 
number of OSDs. How much is it?

Kind regards,
Sinan Polat

> Op 16 apr. 2019 om 05:21 heeft Igor Podlesny  het 
> volgende geschreven:
>
>> On Tue, 16 Apr 2019 at 06:43, Mark Schouten  wrote:
>> [...]
>> So where is the rest of the free space? :X
>
> Makes sense to see:
>
> sudo ceph osd df tree
>
> --
> End of message. Next message?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: HW failure cause client IO drops

2019-04-16 Thread Eugen Block

Good morning,

the OSDs are usually marked out after 10 minutes, that's when  
rebalancing starts. But the I/O should not drop during that time, this  
could be related to your pool configuration. If you have a replicated  
pool of size 3 and also set min_size to 3 the I/O would pause if a  
node or OSD fails. So more information about the cluster would help,  
can you share that?


ceph osd tree
ceph osd pool ls detail

Were all pools affected or just specific pools?

Regards,
Eugen


Zitat von M Ranga Swami Reddy :


Hello - Recevenlt we had an issue with storage node's battery failure,
which cause ceph client IO dropped to '0' bytes. Means ceph cluster
couldn't perform IO operations on the cluster till the node takes out. This
is not expected from Ceph, as some HW fails, those respective OSDs should
mark as out/down and IO should go as is..

Please let me know if anyone seen the similar behavior and is this issue
resolved?

Thanks
Swami




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com