Re: [ceph-users] osd_recovery_max_chunk value

2018-02-05 Thread Karun Josy
 Hi Christian,

Thank you for your help.

Ceph version is 12.2.2. So is this value bad ? Do you have any suggestions ?


So to reduce the max chunk ,I assume I can choose something like
7 << 20 ,ie 7340032 ?

Karun Josy

On Tue, Feb 6, 2018 at 1:15 PM, Christian Balzer  wrote:

> On Tue, 6 Feb 2018 13:01:12 +0530 Karun Josy wrote:
>
> > Hello,
> >
> > We are seeing slow requests while recovery process going on.
> >
> > I am trying to slow down the recovery process. I set
> osd_recovery_max_active
> > and  osd_recovery_sleep as below :
> > --
> > ceph tell osd.* injectargs '--osd_recovery_max_active 1'
> > ceph tell osd.* injectargs '--osd_recovery_sleep .1'
> > --
> What version of Ceph, in some "sleep" values will make things _worse_!
> Would be nice if that was documented in like, the documentation...
>
> >
> > But I am confused with the  osd_recovery_max_chunk. Currently, it shows
> > 8388608.
> >
> > # ceph daemon osd.4 config get osd_recovery_max_chunk
> > {
> > "osd_recovery_max_chunk": "8388608"
> >
> >
> > In ceph documentation, it shows
> >
> > ---
> > osd recovery max chunk
> > Description: The maximum size of a recovered chunk of data to push.
> > Type: 64-bit Unsigned Integer
> > Default: 8 << 20
> > 
> >
> > I am confused. Can anyone let me know what is the value that I have to
> give
> > to reduce this parameter ?
> >
> This is what you get when programmers write docs.
>
> The above is a left-shift operation, see for example:
> http://bit-calculator.com/bit-shift-calculator
>
> Now if shrinking that value is beneficial for reducing recovery load,
> that's for you to find out.
>
> Christian
>
> >
> >
> > Karun Josy
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_recovery_max_chunk value

2018-02-05 Thread Christian Balzer
On Tue, 6 Feb 2018 13:01:12 +0530 Karun Josy wrote:

> Hello,
> 
> We are seeing slow requests while recovery process going on.
> 
> I am trying to slow down the recovery process. I set  osd_recovery_max_active
> and  osd_recovery_sleep as below :
> --
> ceph tell osd.* injectargs '--osd_recovery_max_active 1'
> ceph tell osd.* injectargs '--osd_recovery_sleep .1'
> --
What version of Ceph, in some "sleep" values will make things _worse_!
Would be nice if that was documented in like, the documentation...

> 
> But I am confused with the  osd_recovery_max_chunk. Currently, it shows
> 8388608.
> 
> # ceph daemon osd.4 config get osd_recovery_max_chunk
> {
> "osd_recovery_max_chunk": "8388608"
> 
> 
> In ceph documentation, it shows
> 
> ---
> osd recovery max chunk
> Description: The maximum size of a recovered chunk of data to push.
> Type: 64-bit Unsigned Integer
> Default: 8 << 20
> 
> 
> I am confused. Can anyone let me know what is the value that I have to give
> to reduce this parameter ?
>
This is what you get when programmers write docs.

The above is a left-shift operation, see for example:
http://bit-calculator.com/bit-shift-calculator

Now if shrinking that value is beneficial for reducing recovery load,
that's for you to find out.

Christian

> 
> 
> Karun Josy


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd_recovery_max_chunk value

2018-02-05 Thread Karun Josy
Hello,

We are seeing slow requests while recovery process going on.

I am trying to slow down the recovery process. I set  osd_recovery_max_active
and  osd_recovery_sleep as below :
--
ceph tell osd.* injectargs '--osd_recovery_max_active 1'
ceph tell osd.* injectargs '--osd_recovery_sleep .1'
--

But I am confused with the  osd_recovery_max_chunk. Currently, it shows
8388608.

# ceph daemon osd.4 config get osd_recovery_max_chunk
{
"osd_recovery_max_chunk": "8388608"


In ceph documentation, it shows

---
osd recovery max chunk
Description: The maximum size of a recovered chunk of data to push.
Type: 64-bit Unsigned Integer
Default: 8 << 20


I am confused. Can anyone let me know what is the value that I have to give
to reduce this parameter ?



Karun Josy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken Buckets after Jewel->Luminous Upgrade

2018-02-05 Thread Robin H. Johnson
On Tue, Jan 30, 2018 at 10:32:04AM +0100, Ingo Reimann wrote:
> What could be the problem,and how may I solve that?
For anybody else tracking this, the logs & debugging info are filed at
http://tracker.ceph.com/issues/22928

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph-cluster and performance "questions"

2018-02-05 Thread Konstantin Shalygin

/offtopic



When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.


I was behold P3700 in Russia since December 2017 with real quantity on 
stock, not just a "price with out of stock".


https://market.yandex.ru/catalog/55316/list?text=intel%20p3700=3=srch_ddl=7893318%3A453797=0=0=0=aprice



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MGR and RGW cannot start after logrotate

2018-02-05 Thread blackpiglet J.
Hi,

We have a 5 nodes Ceph cluster. Four of them are OSD server. One is
monitor, manager and RGW. At first, we use the default logroate setting, so
all ceph processes will be restarted everyday, but RGW and manager goes
down basically per week. To prevent this, we set the logroate to per month.
And after a month, when logroated, RGW and manager went down again. By went
down, I mean the process is there, but they can't listen the port they
should. Not much log are printed for it.

Have you guys met something similar before?

BR,
Bruce J.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latency for the Public Network

2018-02-05 Thread Christian Balzer

Hello,

On Mon, 5 Feb 2018 22:04:00 +0100 Tobias Kropf wrote:

> Hi ceph list,
> 
> we have a hyperconvergent ceph cluster with kvm on 8 nodes with ceph
> hammer 0.94.10. 
Do I smell Proxmox?

> The cluster is now 3 years old an we plan with a new
> cluster for a high iops project. We use replicated pools 3/2 and have
> not the best latency on our switch backend.
> 
> 
> ping -s 8192 10.10.10.40 
> 
> 8200 bytes from 10.10.10.40: icmp_seq=1 ttl=64 time=0.153 ms
> 
Not particularly great, yes.
However your network latency is only one factor, Ceph OSDs add quite
another layer there and do affect IOPS even more usually. 
For high IOPS you need of course fast storage, network AND CPUs. 

> 
> We plan to split the hyperconvergent setup to storage an compute nodes
> and want to split ceph cluster and public network. Cluster network with
> 40 gbit mellanox switches and public network with the existant 10gbit
> switches.
> 
You'd do a lot better if you were to go all 40Gb/s and forget about
splitting networks. 

The faster replication network will:
a) be underutilized all of the time in terms of bandwidth 
b) not help with read IOPS at all
c) still be hobbled by the public network latency when it comes to write
IOPS (but of course help in regards to replication latency). 

> Now my question... are 0.153ms - 0.170ms fast enough for the public
> network? We must deploy a setup with 1500 - 2000 terminalserver
>
Define terminal server, are we talking Windows Virtual Desktops with RDP?
Windows is quite the hog when it comes to I/O.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw not listening after installation

2018-02-05 Thread Jean-Charles Lopez
Hi

see inline

JC

> On Feb 5, 2018, at 18:14, Piers Haken  wrote:
> 
> Thanks, JC,
>  
> You’re right I didn’t deploy any OSDs at that point. I didn’t think that 
> would be a problem since the last `ceph-deploy` command completed without 
> error and its log ended with:
>  
> The Ceph Object Gateway (RGW) is now running on host storage-test01 and 
> default port 7480
>  
> Maybe that’s a bug?
What version is this (ceph -v and ceph-deploy version)
>  
>  
> Anyway, I purged the cluster, rebuilt it with some OSDs, but I still don’t 
> see radosgw listening on port 7480.
>  
> Here’s my ceph.conf (it’s just the default, I haven’t touched it):
>  
> [global]
> fsid = 849f7b15-1e31-450b-b17c-6599fb6ff94d
> mon_initial_members = storage-test01
> mon_host = 10.0.4.127
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
May be some minimum pieces are missing from your config file assuming the 
outpout above is complete. It’s a long time I haven’t deployed with ceph-deploy
[client.rgw.storage-test01]
rgw_frontends = "civetweb port=7480" just to be sure
>  
> here’s my netstat:
>  
> # netstat -alp | grep rados
>  
> tcp0  0 10.0.4.127:5750410.0.4.127:6816 
> ESTABLISHED 19833/radosgw
> tcp0  0 10.0.4.127:4346210.0.4.127:6832 
> ESTABLISHED 19833/radosgw
> tcp0  0 10.0.4.127:4984810.0.4.127:6789 
> ESTABLISHED 19833/radosgw
Here you can see that the RGW is connected to the MON (6789) not listening on 
6789. The second column is the remote address the first one is the local address
> unix  2  [ ACC ] STREAM LISTENING 262758   19833/radosgw  
>   /var/run/ceph/ceph-client.rgw.storage-test01.asok
> unix  3  [ ] STREAM CONNECTED 264433   19833/radosgw
>  
> ps:
>  
> 20243 ?Ssl0:00 /usr/bin/radosgw -f --cluster ceph --name 
> client.rgw.storage-test01 --setuser ceph --setgroup ceph
>  
>  
> Sometimes there’s an intermittent  “Initialization timeout, failed to 
> initialize” in the rgw log, but it doesn’t occur when I restart the services.
I suspect that the RGW doesn’t fully initialize because it can’t create the 
necessary pools. Only way you can trace that is by bumping up the trace with 
debug_rgw = 20 in the configuration file and from there you should be able to 
see where it sits and what it did.
>  
> Let me know if I can send anything else, I’d really like to get this up and 
> running!
> 
> Thanks
> Piers.
>  
>  
> From: Jean-Charles Lopez [mailto:jelo...@redhat.com] 
> Sent: Monday, February 05, 2018 5:23 PM
> To: Piers Haken 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] radosgw not listening after installation
>  
> Hi,
>  
> first of all just in case, it looks like your script does not deploy any OSDs 
> as you go straight from MON to RGW.
>  
> then, RGW does listen by default on 7480 and what you see on 6789 is the MON 
> listening.
>  
> Investigation:
> - Make sure your ceph-radosgw process is running first.
> - If not running, have a look at the log to see why it may have failed.
> - Paste some more information in this mailing list so we can help you find 
> the problem (e.g. output of ceph-deploy, log of your RGW, ...)
>  
> My bet is that given that you haven’t deployed any OSDs the RGW can’t create 
> the pools it needs to stroe data. May be not but just guessing from what you 
> showed us.
>  
> Regards
> JC
>  
> On Feb 5, 2018, at 16:51, Piers Haken  > wrote:
>  
> i'm trying to setup radosgw on a brand new clutser, but I'm running into an 
> issue where it's not listening on the default port (7480)
>  
> here's my install script:
>  
>ceph-deploy new $NODE
>ceph-deploy install --release luminous $NODE
>ceph-deploy install --release luminous --rgw $NODE
>ceph-deploy mon create-initial
>ceph-deploy admin $NODE
>ceph-deploy rgw create $NODE
>  
> this is on debian 9.3 (stretch) on a clean machine.
>  
> the /usr/bin/radosgw process is running, and it's listening on port 6789 
> (this is not an HTTP server, but some insternal binary protocol), but the 
> docs say it should be listening for HTTP requestst on port 7480.
>  
> what am i missing here?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw not listening after installation

2018-02-05 Thread Piers Haken
Thanks, JC,

You’re right I didn’t deploy any OSDs at that point. I didn’t think that would 
be a problem since the last `ceph-deploy` command completed without error and 
its log ended with:

The Ceph Object Gateway (RGW) is now running on host storage-test01 and default 
port 7480

Maybe that’s a bug?


Anyway, I purged the cluster, rebuilt it with some OSDs, but I still don’t see 
radosgw listening on port 7480.

Here’s my ceph.conf (it’s just the default, I haven’t touched it):

[global]
fsid = 849f7b15-1e31-450b-b17c-6599fb6ff94d
mon_initial_members = storage-test01
mon_host = 10.0.4.127
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx

here’s my netstat:

# netstat -alp | grep rados

tcp0  0 10.0.4.127:5750410.0.4.127:6816 ESTABLISHED 
19833/radosgw
tcp0  0 10.0.4.127:4346210.0.4.127:6832 ESTABLISHED 
19833/radosgw
tcp0  0 10.0.4.127:4984810.0.4.127:6789 ESTABLISHED 
19833/radosgw
unix  2  [ ACC ] STREAM LISTENING 262758   19833/radosgw
/var/run/ceph/ceph-client.rgw.storage-test01.asok
unix  3  [ ] STREAM CONNECTED 264433   19833/radosgw

ps:

20243 ?Ssl0:00 /usr/bin/radosgw -f --cluster ceph --name 
client.rgw.storage-test01 --setuser ceph --setgroup ceph


Sometimes there’s an intermittent  “Initialization timeout, failed to 
initialize” in the rgw log, but it doesn’t occur when I restart the services.

Let me know if I can send anything else, I’d really like to get this up and 
running!

Thanks
Piers.


From: Jean-Charles Lopez [mailto:jelo...@redhat.com]
Sent: Monday, February 05, 2018 5:23 PM
To: Piers Haken 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] radosgw not listening after installation

Hi,

first of all just in case, it looks like your script does not deploy any OSDs 
as you go straight from MON to RGW.

then, RGW does listen by default on 7480 and what you see on 6789 is the MON 
listening.

Investigation:
- Make sure your ceph-radosgw process is running first.
- If not running, have a look at the log to see why it may have failed.
- Paste some more information in this mailing list so we can help you find the 
problem (e.g. output of ceph-deploy, log of your RGW, ...)

My bet is that given that you haven’t deployed any OSDs the RGW can’t create 
the pools it needs to stroe data. May be not but just guessing from what you 
showed us.

Regards
JC

On Feb 5, 2018, at 16:51, Piers Haken 
> wrote:

i'm trying to setup radosgw on a brand new clutser, but I'm running into an 
issue where it's not listening on the default port (7480)

here's my install script:

   ceph-deploy new $NODE
   ceph-deploy install --release luminous $NODE
   ceph-deploy install --release luminous --rgw $NODE
   ceph-deploy mon create-initial
   ceph-deploy admin $NODE
   ceph-deploy rgw create $NODE

this is on debian 9.3 (stretch) on a clean machine.

the /usr/bin/radosgw process is running, and it's listening on port 6789 (this 
is not an HTTP server, but some insternal binary protocol), but the docs say it 
should be listening for HTTP requestst on port 7480.

what am i missing here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph-cluster and performance "questions"

2018-02-05 Thread Christian Balzer

Hello,

> I'm not a "storage-guy" so please excuse me if I'm missing /
> overlooking something obvious. 
> 
> My question is in the area "what kind of performance am I to expect
> with this setup". We have bought servers, disks and networking for our
> future ceph-cluster and are now in our "testing-phase" and I simply
> want to understand if our numbers line up, or if we are missing
> something obvious. 
> 
A myriad of variables will make for a myriad of results, expected and
otherwise.

For example, you say nothing about the Ceph version, how the OSDs are
created (filestore, bluestore, details), OS and kernel (PTI!!) version.

> Background, 
> - cephmon1, DELL R730, 1 x E5-2643, 64 GB 
> - cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
Unless you're planning on having 16 SSDs per node, a CPU with less and
faster cores would be better (see archives). 

In general, you will want to run atop or something similar on your ceph
and client nodes during these tests to see where and if any resources
(CPU, DISK, NET) are getting stressed.

> - each server is connected to a dedicated 50 Gbe network, with
> Mellanox-4 Lx cards (teamed into one interface, team0).  
> 
> In our test we only have one monitor. This will of course not be the
> case later on. 
> 
> Each OSD, has the following SSD's configured as pass-through (not raid
> 0 through the raid-controller),
> 
> - 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
> can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
> - 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
> D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
When and where did you get those?
I wonder if they're available again, had 0 luck getting any last year.

> - 3 HDD's, which is uninteresting here. At the moment I'm only
> interested in the performance of the SSD-pool.
> 
> Ceph-cluster is created with ceph-ansible with "default params" (ie.
> have not added / changed anything except the necessary). 
> 
> When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
> The min_size is 3 on the pool. 
Any reason for that?
It will make any OSD failure result in a cluster lockup with a size of 3.
Unless you did set your size to 4, in which case you wrecked performance.

> Rules are created as follows, 
> 
> $ > ceph osd crush rule create-replicated ssd-rule default host ssd
> $ > ceph osd crush rule create-replicated hdd-rule default host hdd
> 
> Testing is done on a separate node (same nic and network though), 
> 
> $ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule
> 
> $ > ceph osd pool application enable ssd-bench rbd
> 
> $ > rbd create ssd-image --size 1T --pool ssd-pool
> 
> $ > rbd map ssd-image --pool ssd-bench
> 
> $ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image
> 
> $ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench
> 
Unless you're planning on using the Ceph cluster in this fashion (kernel
mounted images), you'd be better off testing in an environment that
matches the use case, i.e. from a VM.

> Fio is then run like this, 
> $ > 
> actions="read randread write randwrite"
> blocksizes="4k 128k 8m"
> tmp_dir="/tmp/"
> 
> for blocksize in ${blocksizes}; do
>   for action in ${actions}; do
> rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
> fio --directory=/ssd-bench \
> --time_based \ 
> --direct=1 \
> --rw=${action} \
> --bs=$blocksize \
> --size=1G \
> --numjobs=100 \
> --runtime=120 \
> --group_reporting \
> --name=testfile \
> --output=${tmp_dir}${action}_${blocksize}_${suffix}
>   done
> done
> 
> After running this, we end up with these numbers 
> 
> read_4k iops : 159266 throughput : 622MB / sec
> randread_4k iops : 151887 throughput : 593MB / sec
> 
These are very nice numbers. 
Too nice, in my book.
I have a test cluster with a cache-tier based on 2 nodes with 3 DC S3610s
400GB each, obviously with size 2 and min_size=1. So just based on that,
it will be faster than a size 3 pool, Jewel with Filestore.
Network is IPoIB (40Gb), so in that aspect similar to yours, 
64k MTU though.
Ceph nodes have E5-2620 v3 @ 2.40GHz CPUs and 32GB RAM.
I've run the following fio (with different rw actions of course) from a
KVM/qemu VM and am also showing how busy the SSDs, OSD processes, qemu
process on the comp node and the fio inside the VM are:
"fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
--rw=read --name=fiojob --blocksize=4K --iodepth=64"

READ
  read : io=4096.0MB, bw=81361KB/s, iops=20340, runt= 51552msec
SSDs: 0% (pagecache on the nodes), OSDs: 45%, qemu: 330%, fio_in_VM: 19%

RANDREAD
  read : io=4096.0MB, bw=62760KB/s, iops=15689, runt= 66831msec
SSDs: 0% (pagecache on the nodes), OSDs: 50%, qemu: 550%!!, fio_in_VM: 23%

WRITE
  write: io=4096.0MB, bw=256972KB/s, iops=64243, runt= 16322msec
SSDs: 40%, OSDs: 20%, qemu: 150%, fio_in_VM: 45%

RANDWRITE
  write: io=4096.0MB, bw=43981KB/s, iops=10995, 

Re: [ceph-users] radosgw not listening after installation

2018-02-05 Thread Jean-Charles Lopez
Hi,

first of all just in case, it looks like your script does not deploy any OSDs 
as you go straight from MON to RGW.

then, RGW does listen by default on 7480 and what you see on 6789 is the MON 
listening.

Investigation:
- Make sure your ceph-radosgw process is running first.
- If not running, have a look at the log to see why it may have failed.
- Paste some more information in this mailing list so we can help you find the 
problem (e.g. output of ceph-deploy, log of your RGW, ...)

My bet is that given that you haven’t deployed any OSDs the RGW can’t create 
the pools it needs to stroe data. May be not but just guessing from what you 
showed us.

Regards
JC

> On Feb 5, 2018, at 16:51, Piers Haken  wrote:
> 
> i'm trying to setup radosgw on a brand new clutser, but I'm running into an 
> issue where it's not listening on the default port (7480)
> 
> here's my install script:
> 
>ceph-deploy new $NODE
>ceph-deploy install --release luminous $NODE
>ceph-deploy install --release luminous --rgw $NODE
>ceph-deploy mon create-initial
>ceph-deploy admin $NODE
>ceph-deploy rgw create $NODE
> 
> this is on debian 9.3 (stretch) on a clean machine.
> 
> the /usr/bin/radosgw process is running, and it's listening on port 6789 
> (this is not an HTTP server, but some insternal binary protocol), but the 
> docs say it should be listening for HTTP requestst on port 7480.
> 
> what am i missing here?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw not listening after installation

2018-02-05 Thread Piers Haken
i'm trying to setup radosgw on a brand new clutser, but I'm running into an 
issue where it's not listening on the default port (7480)

here's my install script:

   ceph-deploy new $NODE
   ceph-deploy install --release luminous $NODE
   ceph-deploy install --release luminous --rgw $NODE
   ceph-deploy mon create-initial
   ceph-deploy admin $NODE
   ceph-deploy rgw create $NODE

this is on debian 9.3 (stretch) on a clean machine.

the /usr/bin/radosgw process is running, and it's listening on port 6789 (this 
is not an HTTP server, but some insternal binary protocol), but the docs say it 
should be listening for HTTP requestst on port 7480.

what am i missing here?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Retrieving ceph health from restful manager plugin

2018-02-05 Thread Hans van den Bogert
Hi All,

I might really be bad at searching, but I can't seem to find the ceph
health status through the new(ish) restful api. Is that right? I know
how I could retrieve it through a Python script, however I'm trying to
keep our monitoring application as layer cake free as possible -- as
such a restful API call would be preferred.

Regards,

Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Latency for the Public Network

2018-02-05 Thread Tobias Kropf
Hi ceph list,

we have a hyperconvergent ceph cluster with kvm on 8 nodes with ceph
hammer 0.94.10. The cluster is now 3 years old an we plan with a new
cluster for a high iops project. We use replicated pools 3/2 and have
not the best latency on our switch backend.


ping -s 8192 10.10.10.40

8200 bytes from 10.10.10.40: icmp_seq=1 ttl=64 time=0.153 ms


We plan to split the hyperconvergent setup to storage an compute nodes
and want to split ceph cluster and public network. Cluster network with
40 gbit mellanox switches and public network with the existant 10gbit
switches.

Now my question... are 0.153ms - 0.170ms fast enough for the public
network? We must deploy a setup with 1500 - 2000 terminalserver


Has anyone some experience with a lot of terminalservers on a ceph backend?


Thanks for replys...


-- 
Tobias Kropf

 

Technik

 

 

--

inett5-100x56

inett GmbH » Ihr IT Systemhaus in Saarbrücken

Mainzerstrasse 183
66121 Saarbrücken
Geschäftsführer: Marco Gabriel
Handelsregister Saarbrücken
HRB 16588


Telefon: 0681 / 41 09 93 – 0
Telefax: 0681 / 41 09 93 – 99
E-Mail: i...@inett.de
Web: www.inett.de

Cyberoam Gold Partner - Zarafa Gold Partner - Proxmox Authorized Reseller - 
Proxmox Training Center - SEP sesam Certified Partner – Open-E Partner - Endian 
Certified Partner - Kaspersky Silver Partner – ESET Silver Partner - Mitglied 
im iTeam Systemhausverbund für den Mittelstand 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-05 Thread Wido den Hollander



On 02/05/2018 04:54 PM, Wes Dillingham wrote:
Good data point on not trimming when non active+clean PGs are present. 
So am I reading this correct? It grew to 32GB? Did it end up growing 
beyond that, what was the max?Also is only ~18PGs per OSD a reasonable
amount of PGs per OSD? I would think about quadruple that would be 
ideal. Is this an artifact of a steadily growing cluster or a design choice?




The backfills are still busy and the MONs are at 39GB right now. Still 
have plenty of space left.


Regarding the PGs it's a long story, but two sided.

1. This is an archive running on Atom 8-core CPUs to keep power 
consumption low, so we went low on amount of PGs
2. The system is still growing and after adding OSDs recently we didn't 
increase the amount of PGs yet


On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander > wrote:


Hi,

I just wanted to inform people about the fact that Monitor databases
can grow quite big when you have a large cluster which is performing
a very long rebalance.

I'm posting this on ceph-users and ceph-large as it applies to both,
but you'll see this sooner on a cluster with a lof of OSDs.

Some information:

- Version: Luminous 12.2.2
- Number of OSDs: 2175
- Data used: ~2PB

We are in the middle of migrating from FileStore to BlueStore and
this is causing a lot of PGs to backfill at the moment:

              33488 active+clean
              4802  active+undersized+degraded+remapped+backfill_wait
              1670  active+remapped+backfill_wait
              263   active+undersized+degraded+remapped+backfilling
              250   active+recovery_wait+degraded
              54    active+recovery_wait+degraded+remapped
              27    active+remapped+backfilling
              13    active+recovery_wait+undersized+degraded+remapped
              2     active+recovering+degraded

This has been running for a few days now and it has caused this warning:

MON_DISK_BIG mons
srv-zmb03-05,srv-zmb04-05,srv-zmb05-05,srv-zmb06-05,srv-zmb07-05 are
using a lot of disk space
     mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
     mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)

This is to be expected as MONs do not trim their store if one or
more PGs is not active+clean.

In this case we expected this and the MONs are each running on a 1TB
Intel DC-series SSD to make sure we do not run out of space before
the backfill finishes.

The cluster is spread out over racks and in CRUSH we replicate over
racks. Rack by rack we are wiping/destroying the OSDs and bringing
them back as BlueStore OSDs and letting the backfill handle everything.

In between we wait for the cluster to become HEALTH_OK (all PGs
active+clean) so that the Monitors can trim their database before we
start with the next rack.

I just want to warn and inform people about this. Under normal
circumstances a MON database isn't that big, but if you have a very
long period of backfills/recoveries and also have a large number of
OSDs you'll see the DB grow quite big.

This has improved significantly going to Jewel and Luminous, but it
is still something to watch out for.

Make sure your MONs have enough free space to handle this!

Wido



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu 
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] New Ceph-cluster and performance "questions"

2018-02-05 Thread Patrik Martinsson
Hello, 

I'm not a "storage-guy" so please excuse me if I'm missing /
overlooking something obvious. 

My question is in the area "what kind of performance am I to expect
with this setup". We have bought servers, disks and networking for our
future ceph-cluster and are now in our "testing-phase" and I simply
want to understand if our numbers line up, or if we are missing
something obvious. 

Background, 
- cephmon1, DELL R730, 1 x E5-2643, 64 GB 
- cephosd1-6, DELL R730, 1 x E5-2697, 64 GB
- each server is connected to a dedicated 50 Gbe network, with
Mellanox-4 Lx cards (teamed into one interface, team0).  

In our test we only have one monitor. This will of course not be the
case later on. 

Each OSD, has the following SSD's configured as pass-through (not raid
0 through the raid-controller),

- 2 x Dell 1.6TB 2.5" SATA MLC MU 6Gbs SSD (THNSF81D60CSE), only spec I
can find on Dell's homepage says "Data Transfer Rate 600 Mbps"
- 4 x Intel SSD DC S3700 (https://ark.intel.com/products/71916/Intel-SS
D-DC-S3700-Series-800GB-2_5in-SATA-6Gbs-25nm-MLC)
- 3 HDD's, which is uninteresting here. At the moment I'm only
interested in the performance of the SSD-pool.

Ceph-cluster is created with ceph-ansible with "default params" (ie.
have not added / changed anything except the necessary). 

When ceph-cluster is up, we have 54 OSD's (36 SSD, 18HDD). 
The min_size is 3 on the pool.

Rules are created as follows, 

$ > ceph osd crush rule create-replicated ssd-rule default host ssd
$ > ceph osd crush rule create-replicated hdd-rule default host hdd

Testing is done on a separate node (same nic and network though), 

$ > ceph osd pool create ssd-bench 512 512 replicated ssd-rule

$ > ceph osd pool application enable ssd-bench rbd

$ > rbd create ssd-image --size 1T --pool ssd-pool

$ > rbd map ssd-image --pool ssd-bench

$ > mkfs.xfs /dev/rbd/ssd-bench/ssd-image

$ > mount /dev/rbd/ssd-bench/ssd-image /ssd-bench

Fio is then run like this, 
$ > 
actions="read randread write randwrite"
blocksizes="4k 128k 8m"
tmp_dir="/tmp/"

for blocksize in ${blocksizes}; do
  for action in ${actions}; do
rm -f ${tmp_dir}${action}_${blocksize}_${suffix}
fio --directory=/ssd-bench \
--time_based \ 
--direct=1 \
--rw=${action} \
--bs=$blocksize \
--size=1G \
--numjobs=100 \
--runtime=120 \
--group_reporting \
--name=testfile \
--output=${tmp_dir}${action}_${blocksize}_${suffix}
  done
done

After running this, we end up with these numbers 

read_4k iops : 159266 throughput : 622MB / sec
randread_4k iops : 151887 throughput : 593MB / sec

read_128k   iops : 31705  throughput : 3963.3 MB / sec
randread_128k   iops : 31664  throughput : 3958.5 MB / sec

read_8m iops : 470throughput : 3765.5 MB / sec
randread_8m iops : 463throughput : 3705.4 MB / sec

write_4kiops : 50486  throughput : 197MB / sec
randwrite_4kiops : 42491  throughput : 165MB / sec

write_128k  iops : 15907  throughput : 1988.5 MB / sec
randwrite_128k  iops : 15558  throughput : 1944.9 MB / sec

write_8miops : 347throughput : 2781.2 MB / sec
randwrite
_8miops : 347throughput : 2777.2 MB / sec


Ok, if you read all way here, the million dollar question is of course
if the numbers above are in the ballpark of what to expect, or if they
are low. 

The main reason I'm a bit uncertain on the numbers above are, and this
may sound fuzzy but, because we did POC a couple of months ago with (if
I remember the configuration correctly, unfortunately we only saved the
numbers, not the *exact* configuration *sigh* (networking still the
same though)) with fewer OSD's and those numbers were

read 4k  iops : 282303   throughput : 1102.8MB / sec
(b)
randread 4k  iops : 253453   throughput : 990.52MB / sec
(b)

read 128kiops : 31298throughput : 3912  MB / sec (w)
randread 128kiops : 9013 throughput : 1126.8MB /
sec (w)

read 8m  iops : 405  throughput : 3241.4MB /
sec (w)
randread 8m  iops : 369  throughput : 2957.8MB / sec
(w)

write 4k iops : 80644throughput : 315   MB / sec (b)
randwrite 4k iops : 53178throughput : 207   MB / sec
(b)

write 128k   iops : 17126throughput : 2140.8MB / sec
(b)
randwrite 128k   iops : 11654throughput : 2015.9MB /
sec (b)

write 8m iops : 258  throughput : 2067.1MB / sec
(w)
randwrite 8m iops : 251  throughput : 1456.9MB / sec
(w)

Where (b) is higher number and (w) is lower. What I would expect since
adding more OSD's was an increase on *all* numbers. The read_4k_
throughput and iops number in current setup is not even close to the
POC which makes me wonder if these "new" numbers are what they "are
suppose to" or if I'm missing something obvious. 

Ehm, in this new setup we are running with 

Re: [ceph-users] High apply latency

2018-02-05 Thread Frédéric Nass

Hi Jakub,

Le 05/02/2018 à 12:26, Jakub Jaszewski a écrit :

Hi Frederic,

Many thanks for your contribution to the topic!

I've just set logging level 20 for filestore via

ceph tell osd.0 config set debug_filestore 20

but so far
​found
 nothing by keyword 'split'
​ in ​/var/log/ceph/ceph-osd.0.log



So, if you're running ceph > 12.2.1, that means splitting is not 
happening. Did you check during writes ? Did you check other OSDs logs ?


Actually, splitting should not happen now that you've increased 
​​filestore_merge_threshold and filestore_split_multiple values.




​I've also run your script across the cluster nodes, results as follows

id=3, pool=volumes, objects=10454548, avg=160.28
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.2344
id=3, pool=volumes, objects=10454548, avg=159.22
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.9994
id=3, pool=volumes, objects=10454548, avg=159.843
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7435
id=3, pool=volumes, objects=10454548, avg=159.695
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.0579
id=3, pool=volumes, objects=10454548, avg=160.594
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=34.7757
id=3, pool=volumes, objects=10454548, avg=160.099
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=33.8517
id=3, pool=volumes, objects=10454548, avg=159.912
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=37.5698
id=3, pool=volumes, objects=10454548, avg=159.407
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.4991
id=3, pool=volumes, objects=10454548, avg=160.075
id=20, pool=default.rgw.buckets.data, objects=22419862, avg=35.481

Looks like there is nothing to be handled by split, am I right? But 
what about merging ? Avg is less than 40

 ​, should directories structure be reduced now?


It should, I guess. But then you'd see blocked requests on every object 
deletion. If you do, you might want to set ​​filestore_merge_threshold 
to -40 (negative value) so merging does not happen anymore.

Splitting would still happen over 5120 files per subdirectory.



    "
​​
filestore_merge_threshold": "40",
"filestore_split_multiple": "8",
"filestore_split_rand_factor": "20",

M
​ay I ask for the link to documentation where I can read more about 
OSD underlying directory structure?


I'm not aware of any related documentation.

Do you still observe slow or blocked requests now that you've increased 
​​filestore_merge_threshold and filestore_split_multiple ?


Regards,

Frédéric.




​And just noticed log entries in /var/log/ceph/ceph-osd.0.log

​2018-02-05 11:22:03.346400 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 
 >> 10.212.14.17:6802/82845 
 conn(0xe254cca800 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 27 vs existing csq=27 
existing_state=STATE_STANDBY
2018-02-05 11:22:03.346583 7f3cc94fe700  0 -- 10.212.14.11:6818/4702 
 >> 10.212.14.17:6802/82845 
 conn(0xe254cca800 :6818 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 
l=0).handle_connect_msg accept connect_seq 28 vs existing csq=27 
existing_state=STATE_STANDBY

​


M
​
any thanks!​
​Jakub​
​

On Mon, Feb 5, 2018 at 9:56 AM, Frédéric Nass 
> wrote:


Hi,

In addition, starting with Luminous 12.2.1 (RHCS 3), splitting ops
should be loggued with default setting of debug level messages:
https://github.com/ceph/ceph/blob/v12.2.1/src/os/filestore/HashIndex.cc#L320


There's also a RFE for merging to be loggued as well as splitting:
https://bugzilla.redhat.com/show_bug.cgi?id=1523532


Regards,

Frédéric.


Le 02/02/2018 à 17:00, Frédéric Nass a écrit :


Hi,

Split and merge operations happen during writes only, splitting
on file creation and merging on file deletion.

As you don't see any blocked requests during reads I would guess
your issue happens during splitting. Now that you increased
filestore_merge_threshold and filestore_split_multiple, you
shouldn't expect any splitting operations to happen any soon, nor
any merging operations, unless your workload consists of writing
a huge number of files and removing them.

You should check how many files are in each lower directories of
pool 20's PGs. This would help to confirm that the blocked
requests come with the splitting.

We now use the below script (on one of the OSD nodes) to get an
average value of the number of files in some PGs of each pool and
run this script every 5 minutes with Munin to get a graph out of
the values.
This way, we can anticipate the 

Re: [ceph-users] pgs down after adding 260 OSDs & increasing PGs

2018-02-05 Thread Jake Grimmett

Dear Nick & Wido,

Many thanks for your helpful advice; our cluster has returned to HEALTH_OK

One caveat is that a small number of pgs remained at "activating".

By increasing mon_max_pg_per_osd from 500 to 1000 these few osds 
activated, allowing the cluster to rebalance fully.


i.e. this was needed
mon_max_pg_per_osd = 1000

once the cluster returned to HEALTH_OK the mon_max_pg_per_osd setting 
was removed.


again, many thanks...

Jake

On 29/01/18 13:07, Nick Fisk wrote:

Hi Jake,

I suspect you have hit an issue that me and a few others have hit in
Luminous. By increasing the number of PG's before all the data has
re-balanced, you have probably exceeded hard PG per OSD limit.

See this thread
https://www.spinics.net/lists/ceph-users/msg41231.html

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client with uid

2018-02-05 Thread Keane Wolter
Hi Patrick,

Thanks for the info. Looking at the fuse options in the man page, I should
be able to pass "-o uid=$(id -u)" at the end of the ceph-fuse command.
However, when I do, it returns with an unknown option for fuse and
segfaults. Any pointers would be greatly appreciated. This is the result I
get:

daemoneye@wolterk:~$ ceph-fuse --id=kwolter_test1 -r /user/kwolter/
/home/daemoneye/ceph/ --client-die-on-failed-remount=false -o uid=$(id -u)
ceph-fuse[25156]: starting ceph client
fuse: unknown option `uid=1000'
ceph-fuse[25156]: fuse failed to start
*** Caught signal (Segmentation fault) **
 in thread 7efc7da86100 thread_name:ceph-fuse
 ceph version 12.2.2 (cf0baba3b47f9427c6c97e2144b094b7e5ba) luminous
(stable)
 1: (()+0x6a8784) [0x5583372d8784]
 2: (()+0x12180) [0x7efc7bb4f180]
 3: (Client::_ll_drop_pins()+0x67) [0x558336e5dea7]
 4: (Client::unmount()+0x943) [0x558336e67323]
 5: (main()+0x7ed) [0x558336e02b0d]
 6: (__libc_start_main()+0xea) [0x7efc7a892f2a]
 7: (_start()+0x2a) [0x558336e0b73a]
ceph-fuse [25154]: (33) Numerical argument out of domain
daemoneye@wolterk:~$

Thanks,
Keane

On Thu, Jan 25, 2018 at 5:50 PM, Patrick Donnelly 
wrote:

> On Wed, Jan 24, 2018 at 7:47 AM, Keane Wolter  wrote:
> > Hello all,
> >
> > I was looking at the Client Config Reference page
> > (http://docs.ceph.com/docs/master/cephfs/client-config-ref/) and there
> was
> > mention of a flag --client_with_uid. The way I read it is that you can
> > specify the UID of a user on a cephfs and the user mounting the
> filesystem
> > will act as the same UID. I am using the flags --client_mount_uid and
> > --client_mount_gid set equal to my UID and GID values on the cephfs when
> > running ceph-fuse. Is this the correct action for the flags or am I
> > misunderstanding the flags?
>
> These options are no longer used (with the exception of some bugs
> [1,2]). The uid/gid should be provided by FUSE so you don't need to do
> anything. If you're using the client library, you provide the uid/gid
> via the UserPerm struct to each operation.
>
> [1] http://tracker.ceph.com/issues/22802
> [2] http://tracker.ceph.com/issues/22801
>
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code ruleset for small cluster

2018-02-05 Thread Sage Weil
On Mon, 5 Feb 2018, Gregory Farnum wrote:
> On Mon, Feb 5, 2018 at 3:23 AM Caspar Smit  wrote:
> 
> > Hi Gregory,
> >
> > Thanks for your answer.
> >
> > I had to add another step emit to your suggestion to make it work:
> >
> > step take default
> > step chooseleaf indep 4 type host
> > step emit
> > step take default
> > step chooseleaf indep 4 type host
> > step emit
> >
> > However, now the same OSD is chosen twice for every PG:
> >
> > # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> > --num-rep 8
> > CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
> >
> 
> Oh, that must be because it has the exact same inputs on every run.
> Hrmmm...Sage, is there a way to seed them differently? Or do you have any
> other ideas? :/

Nope.  The CRUSH rule isn't meant to work like that..

> > I'm wondering why something like this won't work (crushtool test ends up
> > empty):
> >
> > step take default
> > step chooseleaf indep 4 type host

Yeah, s/chooseleaf/choose/ and it should work!
s

> > step choose indep 2 type osd
> > step emit
> >
> 
> Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
> not quite sure what happens when you then tell it to pick OSDs again but
> obviously it’s failing (as the instruction is nonsense) and emitting an
> empty list.
> 
> 
> 
> >
> > # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> > --num-rep 8
> > CRUSH rule 1 x 1 []
> >
> > Kind regards,
> > Caspar Smit
> >
> > 2018-02-02 19:09 GMT+01:00 Gregory Farnum :
> >
> >> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit 
> >> wrote:
> >> > Hi all,
> >> >
> >> > I'd like to setup a small cluster (5 nodes) using erasure coding. I
> >> would
> >> > like to use k=5 and m=3.
> >> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
> >> > this.
> >> >
> >> > Then i found this blog:
> >> > https://ceph.com/planet/erasure-code-on-small-clusters/
> >> >
> >> > This sounded ideal to me so i started building a test setup using the
> >> 5+3
> >> > profile
> >> >
> >> > Changed the erasure ruleset to:
> >> >
> >> > rule erasure_ruleset {
> >> >   ruleset X
> >> >   type erasure
> >> >   min_size 8
> >> >   max_size 8
> >> >   step take default
> >> >   step choose indep 4 type host
> >> >   step choose indep 2 type osd
> >> >   step emit
> >> > }
> >> >
> >> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
> >> each,
> >> > perfect.
> >> >
> >> > But then i tested a node failure, no problem again, all PG's stay active
> >> > (most undersized+degraded, but still active). Then after 10 minutes the
> >> > OSD's on the failed node were all marked as out, as expected.
> >> >
> >> > I waited for the data to be recovered to the other (fifth) node but that
> >> > doesn't happen, there is no recovery whatsoever.
> >> >
> >> > Only when i completely remove the down+out OSD's from the cluster the
> >> data
> >> > is recovered.
> >> >
> >> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
> >> > beforehand to store data on.
> >>
> >> Hmm, basically, yes. The basic process is:
> >>
> >> >   step take default
> >>
> >> take the default root.
> >>
> >> >   step choose indep 4 type host
> >>
> >> Choose four hosts that exist under the root. *Note that at this layer,
> >> it has no idea what OSDs exist under the hosts.*
> >>
> >> >   step choose indep 2 type osd
> >>
> >> Within the host chosen above, choose two OSDs.
> >>
> >>
> >> Marking out an OSD does not change the weight of its host, because
> >> that causes massive data movement across the whole cluster on a single
> >> disk failure. The "chooseleaf" commands deal with this (because if
> >> they fail to pick an OSD within the host, they will back out and go
> >> for a different host), but that doesn't work when you're doing
> >> independent "choose" steps.
> >>
> >> I don't remember the implementation details well enough to be sure,
> >> but you *might* be able to do something like
> >>
> >> step take default
> >> step chooseleaf indep 4 type host
> >> step take default
> >> step chooseleaf indep 4 type host
> >> step emit
> >>
> >> And that will make sure you get at least 4 OSDs involved?
> >> -Greg
> >>
> >> >
> >> > Would it be possible to do something like this:
> >> >
> >> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
> >> hosts
> >> > are needed), in case of node failure -> recover data from failed node to
> >> > fifth node.
> >> >
> >> > Thank you in advance,
> >> > Caspar
> >> >
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 

Re: [ceph-users] Erasure code ruleset for small cluster

2018-02-05 Thread Gregory Farnum
On Mon, Feb 5, 2018 at 3:23 AM Caspar Smit  wrote:

> Hi Gregory,
>
> Thanks for your answer.
>
> I had to add another step emit to your suggestion to make it work:
>
> step take default
> step chooseleaf indep 4 type host
> step emit
> step take default
> step chooseleaf indep 4 type host
> step emit
>
> However, now the same OSD is chosen twice for every PG:
>
> # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> --num-rep 8
> CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]
>

Oh, that must be because it has the exact same inputs on every run.
Hrmmm...Sage, is there a way to seed them differently? Or do you have any
other ideas? :/




> I'm wondering why something like this won't work (crushtool test ends up
> empty):
>
> step take default
> step chooseleaf indep 4 type host
> step choose indep 2 type osd
> step emit
>

Chooseleaf is telling crush to go all the way down to individual OSDs. I’m
not quite sure what happens when you then tell it to pick OSDs again but
obviously it’s failing (as the instruction is nonsense) and emitting an
empty list.



>
> # crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
> --num-rep 8
> CRUSH rule 1 x 1 []
>
> Kind regards,
> Caspar Smit
>
> 2018-02-02 19:09 GMT+01:00 Gregory Farnum :
>
>> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit 
>> wrote:
>> > Hi all,
>> >
>> > I'd like to setup a small cluster (5 nodes) using erasure coding. I
>> would
>> > like to use k=5 and m=3.
>> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
>> > this.
>> >
>> > Then i found this blog:
>> > https://ceph.com/planet/erasure-code-on-small-clusters/
>> >
>> > This sounded ideal to me so i started building a test setup using the
>> 5+3
>> > profile
>> >
>> > Changed the erasure ruleset to:
>> >
>> > rule erasure_ruleset {
>> >   ruleset X
>> >   type erasure
>> >   min_size 8
>> >   max_size 8
>> >   step take default
>> >   step choose indep 4 type host
>> >   step choose indep 2 type osd
>> >   step emit
>> > }
>> >
>> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
>> each,
>> > perfect.
>> >
>> > But then i tested a node failure, no problem again, all PG's stay active
>> > (most undersized+degraded, but still active). Then after 10 minutes the
>> > OSD's on the failed node were all marked as out, as expected.
>> >
>> > I waited for the data to be recovered to the other (fifth) node but that
>> > doesn't happen, there is no recovery whatsoever.
>> >
>> > Only when i completely remove the down+out OSD's from the cluster the
>> data
>> > is recovered.
>> >
>> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
>> > beforehand to store data on.
>>
>> Hmm, basically, yes. The basic process is:
>>
>> >   step take default
>>
>> take the default root.
>>
>> >   step choose indep 4 type host
>>
>> Choose four hosts that exist under the root. *Note that at this layer,
>> it has no idea what OSDs exist under the hosts.*
>>
>> >   step choose indep 2 type osd
>>
>> Within the host chosen above, choose two OSDs.
>>
>>
>> Marking out an OSD does not change the weight of its host, because
>> that causes massive data movement across the whole cluster on a single
>> disk failure. The "chooseleaf" commands deal with this (because if
>> they fail to pick an OSD within the host, they will back out and go
>> for a different host), but that doesn't work when you're doing
>> independent "choose" steps.
>>
>> I don't remember the implementation details well enough to be sure,
>> but you *might* be able to do something like
>>
>> step take default
>> step chooseleaf indep 4 type host
>> step take default
>> step chooseleaf indep 4 type host
>> step emit
>>
>> And that will make sure you get at least 4 OSDs involved?
>> -Greg
>>
>> >
>> > Would it be possible to do something like this:
>> >
>> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
>> hosts
>> > are needed), in case of node failure -> recover data from failed node to
>> > fifth node.
>> >
>> > Thank you in advance,
>> > Caspar
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous - performance IOPS vs throughput

2018-02-05 Thread Gregory Farnum
The tests are pretty clearly using different op sizes there. I believe the
default is 16*4MB, but the first one is using 32*4KB. So obviously the
curves are very different!
On Mon, Feb 5, 2018 at 6:47 AM Steven Vacaroaia  wrote:

> Hi,
>
> I noticed a severe inverse correlation between IOPS and throughput
>
> For example:
>
>  running rados bench write  with t=32 shows and average IOPS  1426
>  and bandwidth 5.5 MB/sec
>
> running it with default ( t = 16 ) average IOPS is 49 and bandwidth is 200
> MB/s
>
> Is this expected behavior ?
> How do I increase bandwidth without impacting IOPS ?
>
> Note
>
> Nothing "red" on atop
> No customization in ceph.conf or sysctl.conf
>
> The test I ran using replication = 2 and only 2 OSD s
>
> Steven
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Redirect for restful API in manager

2018-02-05 Thread John Spray
On Mon, Feb 5, 2018 at 5:06 PM, Hans van den Bogert
 wrote:
> Hi all,
>
> In the release notes of 12.2.2 the following is stated:
>
> > Standby ceph-mgr daemons now redirect requests to the active
> messenger, easing configuration for tools & users accessing the web
> dashboard, restful API, or other ceph-mgr module services.
>
> However, it doesn't seem to be the case that the restful API redirects
> the client. Can anybody verify that? If it doesn't redirect, will this
> be added in the near future?

No plans personally, but it would be really easy for someone to look
at how the dashboard does it and do the same thing for the restful
module.

John



> Regards,
>
> Hans
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Redirect for restful API in manager

2018-02-05 Thread Hans van den Bogert
Hi all,

In the release notes of 12.2.2 the following is stated:

> Standby ceph-mgr daemons now redirect requests to the active
messenger, easing configuration for tools & users accessing the web
dashboard, restful API, or other ceph-mgr module services.

However, it doesn't seem to be the case that the restful API redirects
the client. Can anybody verify that? If it doesn't redirect, will this
be added in the near future?

Regards,

Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] restrict user access to certain rbd image

2018-02-05 Thread knawnd
Thanks a lot who shared thoughts and own experience on that topic! It seems that Frédéric's input is 
exactly I've been looking for. Thanks Frédéric!


Jason Dillaman wrote on 02/02/18 19:24:

Concur that it's technically feasible by restricting access to
"rbd_id.", "rbd_header..",
"rbd_object_map..", and "rbd_data.." objects using
the prefix restriction in the OSD caps. However, this really won't
scale beyond a small number of images per user since every IO will
need to traverse the list of caps to verify the user can touch the
object.

On Fri, Feb 2, 2018 at 11:05 AM, Gregory Farnum  wrote:

I don't think it's well-integrated with the tooling, but check out the cephx
docs for the "prefix" level of access. It lets you grant access only to
objects whose name matches a prefix, which for rbd would be the rbd volume
ID (or name? Something easy to identify).
-Greg


On Fri, Feb 2, 2018 at 7:42 AM  wrote:


Hello!

I wonder if it's possible in ceph Luminous to manage user access to rbd
images on per image (but not
the whole rbd pool) basis?
I need to provide rbd images for my users but would like to disable their
ability to list all images
in a pool as well as to somehow access/use ones if a ceph admin didn't
authorize that.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing your MON storage with a large cluster

2018-02-05 Thread Wes Dillingham
Good data point on not trimming when non active+clean PGs are present. So
am I reading this correct? It grew to 32GB? Did it end up growing beyond
that, what was the max? Also is only ~18PGs per OSD a reasonable amount of
PGs per OSD? I would think about quadruple that would be ideal. Is this an
artifact of a steadily growing cluster or a design choice?

On Sat, Feb 3, 2018 at 10:50 AM, Wido den Hollander  wrote:

> Hi,
>
> I just wanted to inform people about the fact that Monitor databases can
> grow quite big when you have a large cluster which is performing a very
> long rebalance.
>
> I'm posting this on ceph-users and ceph-large as it applies to both, but
> you'll see this sooner on a cluster with a lof of OSDs.
>
> Some information:
>
> - Version: Luminous 12.2.2
> - Number of OSDs: 2175
> - Data used: ~2PB
>
> We are in the middle of migrating from FileStore to BlueStore and this is
> causing a lot of PGs to backfill at the moment:
>
>  33488 active+clean
>  4802  active+undersized+degraded+remapped+backfill_wait
>  1670  active+remapped+backfill_wait
>  263   active+undersized+degraded+remapped+backfilling
>  250   active+recovery_wait+degraded
>  54active+recovery_wait+degraded+remapped
>  27active+remapped+backfilling
>  13active+recovery_wait+undersized+degraded+remapped
>  2 active+recovering+degraded
>
> This has been running for a few days now and it has caused this warning:
>
> MON_DISK_BIG mons srv-zmb03-05,srv-zmb04-05,srv-
> zmb05-05,srv-zmb06-05,srv-zmb07-05 are using a lot of disk space
> mon.srv-zmb03-05 is 31666 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb04-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb05-05 is 31670 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb06-05 is 31897 MB >= mon_data_size_warn (15360 MB)
> mon.srv-zmb07-05 is 31891 MB >= mon_data_size_warn (15360 MB)
>
> This is to be expected as MONs do not trim their store if one or more PGs
> is not active+clean.
>
> In this case we expected this and the MONs are each running on a 1TB Intel
> DC-series SSD to make sure we do not run out of space before the backfill
> finishes.
>
> The cluster is spread out over racks and in CRUSH we replicate over racks.
> Rack by rack we are wiping/destroying the OSDs and bringing them back as
> BlueStore OSDs and letting the backfill handle everything.
>
> In between we wait for the cluster to become HEALTH_OK (all PGs
> active+clean) so that the Monitors can trim their database before we start
> with the next rack.
>
> I just want to warn and inform people about this. Under normal
> circumstances a MON database isn't that big, but if you have a very long
> period of backfills/recoveries and also have a large number of OSDs you'll
> see the DB grow quite big.
>
> This has improved significantly going to Jewel and Luminous, but it is
> still something to watch out for.
>
> Make sure your MONs have enough free space to handle this!
>
> Wido
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 204
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph luminous - performance IOPS vs throughput

2018-02-05 Thread Steven Vacaroaia
Hi,

I noticed a severe inverse correlation between IOPS and throughput

For example:

 running rados bench write  with t=32 shows and average IOPS  1426
 and bandwidth 5.5 MB/sec

running it with default ( t = 16 ) average IOPS is 49 and bandwidth is 200
MB/s

Is this expected behavior ?
How do I increase bandwidth without impacting IOPS ?

Note

Nothing "red" on atop
No customization in ceph.conf or sysctl.conf

The test I ran using replication = 2 and only 2 OSD s

Steven
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW default.rgw.meta pool

2018-02-05 Thread Thomas Bennett
Hi Orit,

Thanks for the reply, much appreciated.

 You cannot see the omap size using rados ls but need to use rados omap
> commands.

You can use this script to calculate the bucket index size:
> https://github.com/mkogan1/ceph-utils/blob/master/
> scripts/get_omap_kv_size.sh


Great. I had not even thought of that - thanks for the script!


> you probably meant default.rgw.meta.
> It is a namespace not a pool try using:
> rados ls -p default.rgw.meta --all
>

I see that the '-all' switch does the trick. (sorry, I meant
default.rgw.meta)

I now see the 'POOL NAMESPACES' in the rgw docs and I get what it's trying
to describe. It's all starting to fall in place.

Thanks for the info :)

Regards,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW default.rgw.meta pool

2018-02-05 Thread Orit Wasserman
On Mon, Feb 5, 2018 at 12:45 PM, Thomas Bennett  wrote:
> Hi,
>
> In trying to understand RGW pool usage I've noticed the pool called
> default.rgw.meta pool has a large number of objects in it. Suspiciously
> about twice as many objects in my default.rgw.buckets.index pool.
>
> As I delete and add buckets, the number of objects in both pools decrease
> and increase proportionally.
>
> However when I try to list the objects in the default.rgw.meta pool, it
> returns nothing.
>
> I.e  'rados -p default.rgw.buckets.index ls' returns nothing.
>
> Is this expected behaviour for this pool?
>
> What are all those objects and why can I not list them?
>

default.rgw.bucket.index store the bucket index objects which are omap objects.
You cannot see the omap size using rados ls but need to use rados omap commands.
You can use this script to calculate the bucket index size:
https://github.com/mkogan1/ceph-utils/blob/master/scripts/get_omap_kv_size.sh

> From my understanding default.rgw.buckets.index should contain thinks like:
you probably meant default.rgw.meta.
It is a namespace not a pool try using:
rados ls -p default.rgw.meta --all

Regards,
Orit

> domain_root, user_keys_pool, user_email_pool, user_swift_pool,
> user_uid_pool.
>


> Cheers,
> Tom
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inactive PGs rebuild is not priorized

2018-02-05 Thread Bartlomiej Swiecki
Hi Nico,

What Ceph version are you running? There were changes in recovery priorities 
merged into jewel 10.2.7+ and luminous which should cover exactly this case.

Regards,
Bartek


> Wiadomość napisana przez Nico Schottelius  w 
> dniu 03.02.2018, o godz. 12:55:
> 
> 
> Good morning,
> 
> after another disk failure, we currently have 7 inactive pgs [1], which
> are stalling IO from the affected VMs.
> 
> It seems that ceph, when rebuilding does not focus on repairing
> the inactive PGs first, which surprised us quite a lot:
> 
> It does not repair the inactive first, but mixes inactive with
> active+undersized+degraded+remapped+backfill_wait.
> 
> Is this a misconfiguration on our side or a design aspect of ceph?
> 
> I have attached ceph -s from three times while rebuilding below.
> 
> First the number of active+undersized+degraded+remapped+backfill_wait.
> decreases and much later then
> undersized+degraded+remapped+backfill_wait+peered decreases
> 
> If anyone could comment on this, I would be very thankful to know how to
> progress here, as we had 6 disk failures this week and each time we had
> inactive pgs that stalled the VM i/o.
> 
> Best,
> 
> Nico
> 
> 
> [1]
>  cluster:
>id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>health: HEALTH_WARN
>108752/3920931 objects misplaced (2.774%)
>Reduced data availability: 7 pgs inactive
>Degraded data redundancy: 419786/3920931 objects degraded 
> (10.706%), 147 pgs unclean, 140 pgs degraded, 140 pgs und
> ersized
> 
>  services:
>mon: 3 daemons, quorum server5,server3,server2
>mgr: server5(active), standbys: server3, server2
>osd: 53 osds: 52 up, 52 in; 147 remapped pgs
> 
>  data:
>pools:   2 pools, 1280 pgs
>objects: 1276k objects, 4997 GB
>usage:   13481 GB used, 26853 GB / 40334 GB avail
>pgs: 0.547% pgs not active
> 419786/3920931 objects degraded (10.706%)
> 108752/3920931 objects misplaced (2.774%)
> 1133 active+clean
> 108  active+undersized+degraded+remapped+backfill_wait
> 25   active+undersized+degraded+remapped+backfilling
> 7active+remapped+backfill_wait
> 6undersized+degraded+remapped+backfilling+peered
> 1undersized+degraded+remapped+backfill_wait+peered
> 
>  io:
>client:   29980 B/s rd,  kB/s wr, 17 op/s rd, 74 op/s wr
>recovery: 71727 kB/s, 17 objects/s
> 
> [2]
> 
> [11:20:15] server3:~# ceph -s
>  cluster:
>id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>health: HEALTH_WARN
>103908/3920967 objects misplaced (2.650%)
>Reduced data availability: 7 pgs inactive
>Degraded data redundancy: 380860/3920967 objects degraded 
> (9.713%), 144 pgs unclean, 137 pgs degraded, 137 pgs undersized
> 
>  services:
>mon: 3 daemons, quorum server5,server3,server2
>mgr: server5(active), standbys: server3, server2
>osd: 53 osds: 52 up, 52 in; 144 remapped pgs
> 
>  data:
>pools:   2 pools, 1280 pgs
>objects: 1276k objects, 4997 GB
>usage:   13630 GB used, 26704 GB / 40334 GB avail
>pgs: 0.547% pgs not active
> 380860/3920967 objects degraded (9.713%)
> 103908/3920967 objects misplaced (2.650%)
> 1136 active+clean
> 105  active+undersized+degraded+remapped+backfill_wait
> 25   active+undersized+degraded+remapped+backfilling
> 7active+remapped+backfill_wait
> 6undersized+degraded+remapped+backfilling+peered
> 1undersized+degraded+remapped+backfill_wait+peered
> 
>  io:
>client:   40201 B/s rd, 1189 kB/s wr, 16 op/s rd, 74 op/s wr
>recovery: 54519 kB/s, 13 objects/s
> 
> 
> [3]
> 
> 
>  cluster:
>id: 26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab
>health: HEALTH_WARN
>88382/3921066 objects misplaced (2.254%)
>Reduced data availability: 4 pgs inactive
>Degraded data redundancy: 285528/3921066 objects degraded 
> (7.282%), 127 pgs unclean
> , 121 pgs degraded, 115 pgs undersized
>14 slow requests are blocked > 32 sec
> 
>  services:
>mon: 3 daemons, quorum server5,server3,server2
>mgr: server5(active), standbys: server3, server2
>osd: 53 osds: 52 up, 52 in; 121 remapped pgs
> 
>  data:
>pools:   2 pools, 1280 pgs
>objects: 1276k objects, 4997 GB
>usage:   14014 GB used, 26320 GB / 40334 GB avail
>pgs: 0.313% pgs not active
> 285528/3921066 objects degraded (7.282%)
> 88382/3921066 objects misplaced (2.254%)
> 1153 active+clean
> 78   active+undersized+degraded+remapped+backfill_wait
> 33   active+undersized+degraded+remapped+backfilling
> 6active+recovery_wait+degraded
> 6active+remapped+backfill_wait
> 2undersized+degraded+remapped+backfill_wait+peered
> 2 

Re: [ceph-users] Erasure code ruleset for small cluster

2018-02-05 Thread Caspar Smit
Hi Gregory,

Thanks for your answer.

I had to add another step emit to your suggestion to make it work:

step take default
step chooseleaf indep 4 type host
step emit
step take default
step chooseleaf indep 4 type host
step emit

However, now the same OSD is chosen twice for every PG:

# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 [5,9,3,12,5,9,3,12]

I'm wondering why something like this won't work (crushtool test ends up
empty):

step take default
step chooseleaf indep 4 type host
step choose indep 2 type osd
step emit


# crushtool --test -i compiled-crushmap-new --rule 1 --show-mappings --x 1
--num-rep 8
CRUSH rule 1 x 1 []

Kind regards,
Caspar Smit

2018-02-02 19:09 GMT+01:00 Gregory Farnum :

> On Fri, Feb 2, 2018 at 8:13 AM, Caspar Smit 
> wrote:
> > Hi all,
> >
> > I'd like to setup a small cluster (5 nodes) using erasure coding. I would
> > like to use k=5 and m=3.
> > Normally you would need a minimum of 8 nodes (preferably 9 or more) for
> > this.
> >
> > Then i found this blog:
> > https://ceph.com/planet/erasure-code-on-small-clusters/
> >
> > This sounded ideal to me so i started building a test setup using the 5+3
> > profile
> >
> > Changed the erasure ruleset to:
> >
> > rule erasure_ruleset {
> >   ruleset X
> >   type erasure
> >   min_size 8
> >   max_size 8
> >   step take default
> >   step choose indep 4 type host
> >   step choose indep 2 type osd
> >   step emit
> > }
> >
> > Created a pool and now every PG has 8 shards in 4 hosts with 2 shards
> each,
> > perfect.
> >
> > But then i tested a node failure, no problem again, all PG's stay active
> > (most undersized+degraded, but still active). Then after 10 minutes the
> > OSD's on the failed node were all marked as out, as expected.
> >
> > I waited for the data to be recovered to the other (fifth) node but that
> > doesn't happen, there is no recovery whatsoever.
> >
> > Only when i completely remove the down+out OSD's from the cluster the
> data
> > is recovered.
> >
> > My guess is that the "step choose indep 4 type host" chooses 4 hosts
> > beforehand to store data on.
>
> Hmm, basically, yes. The basic process is:
>
> >   step take default
>
> take the default root.
>
> >   step choose indep 4 type host
>
> Choose four hosts that exist under the root. *Note that at this layer,
> it has no idea what OSDs exist under the hosts.*
>
> >   step choose indep 2 type osd
>
> Within the host chosen above, choose two OSDs.
>
>
> Marking out an OSD does not change the weight of its host, because
> that causes massive data movement across the whole cluster on a single
> disk failure. The "chooseleaf" commands deal with this (because if
> they fail to pick an OSD within the host, they will back out and go
> for a different host), but that doesn't work when you're doing
> independent "choose" steps.
>
> I don't remember the implementation details well enough to be sure,
> but you *might* be able to do something like
>
> step take default
> step chooseleaf indep 4 type host
> step take default
> step chooseleaf indep 4 type host
> step emit
>
> And that will make sure you get at least 4 OSDs involved?
> -Greg
>
> >
> > Would it be possible to do something like this:
> >
> > Create a 5+3 EC profile, every hosts has a maximum of 2 shards (so 4
> hosts
> > are needed), in case of node failure -> recover data from failed node to
> > fifth node.
> >
> > Thank you in advance,
> > Caspar
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW default.rgw.meta pool

2018-02-05 Thread Thomas Bennett
Hi,

In trying to understand RGW pool usage I've noticed the pool called
*default.rgw.meta* pool has a large number of objects in it. Suspiciously
about twice as many objects in my *default.rgw.buckets.index* pool.

As I delete and add buckets, the number of objects in both pools decrease
and increase proportionally.

However when I try to list the objects in the default.rgw.meta pool, it
returns nothing.

I.e  '*rados -p default.rgw.buckets.index** ls*' returns nothing.

Is this expected behaviour for this pool?

What are all those objects and why can I not list them?

>From my understanding *default.rgw.buckets.index* should contain thinks
like: domain_root, user_keys_pool, user_email_pool, user_swift_pool,
user_uid_pool.

Cheers,
Tom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com