[ceph-users] Ceph cluster works UNTIL the OSDs are rebooted

2019-11-14 Thread Richard Geoffrion
I had a working ceph cluster running nautilus in a test lab just a few 
months ago.   Now that I'm trying to take ceph live on production 
hardware, I can't seem to get the cluster to stay up and available even 
though all three OSDs are UP and IN.


I believe the problem is that the OSDs don't mount their volumes after a 
reboot.   The ceph-deploy routine can install an OSD node, format the 
disk and bring it online, and it can get all the OSD nodes UP and IN and 
reach a quorum BUT, once an OSD gets rebooted, all the PGs related to 
that OSD go "stuck inactive...current state unknown, last acting".


I've found and resolved all my hostname and firewall errors, and I'm 
comfortable that I've ruled out network issues. For grins and giggles, I 
reconfigured the OSDs to be on the same 'public' network with the MON 
servers and the OSDs still drop their disks from the cluster after a reboot.


What do I need to do next?

Below is a pastebin link to some log file data where you can see some 
traceback errors.



[2019-10-30 14:52:10,201][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", 
line 59, in newfunc

return f(*a, **kw)


Some of these errors might be due to the system seeing the three other 
setup attempts that are no longer available. A 'ceph-deploy purge' and 
'ceph-deploy purgedata' doesn't seem to get rid of EVERYTHING. I've 
learned since that /var/lib/ceph retains some data.  I'll be sure to 
remove the data from that directory when I next attempt to start fresh.


What do I need to be looking at to correct this "OSD not remounting it's 
disk" issue?


https://pastebin.com/NMXvYBcZ
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cluster Replication / Disaster Recovery

2019-06-12 Thread DHilsbos
All;

I'm testing and evaluating Ceph for the next generation of storage architecture 
for our company, and so far I'm fairly impressed, but I've got a couple of 
questions around cluster replication and disaster recovery.

First; intended uses.
Ceph Object Gateway will be used to support new software projects presently in 
the works.
CephFS behind Samba will be used for Windows file shares both during 
transition, and to support long term needs.
The ISCSi gateway and RADOS Block Devices will be used to support 
virtualization.

My research suggests that native replication isn't available within the Ceph 
Cluster (i.e. have a cluster replicate all objects to a second cluster).  
RADOSgw support replicating objects into more than one Ceph cluster, but I 
can't find information on multi-site / replication for ISCSigw or CephFS.

So... How do you plan / manage major event disaster recovery with your Ceph 
Clusters (i.e. loss of the entire cluster)?
What backup solutions do you use / recommend with your Ceph clusters?  Are you 
doing any off-site backup?
Anyone backing up to the cloud?  What kind of bandwidth are you using for this?

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster available to clients with 2 different VLANs ?

2019-05-03 Thread solarflow99
How is this better than using a single public network, routing through a L3
switch?

If I understand the scenario right, this way would require the switch to be
a trunk port containing all the public vlans, and you can bridge directly
through the switch so L3 wouldn't be necessary?



On Fri, May 3, 2019 at 11:43 AM EDH - Manuel Rios Fernandez <
mrios...@easydatahost.com> wrote:

> You can put multiple networks in ceph.conf with commas
>
>
>
> public network = 172.16.2.0/24, 192.168.0/22
>
>
>
> But remember your servers must be able to reach it. L3 , FW needed.
>
>
>
> Regards
>
> Manuel
>
>
>
>
>
> *De:* ceph-users  *En nombre de *Martin
> Verges
> *Enviado el:* viernes, 3 de mayo de 2019 11:36
> *Para:* Hervé Ballans 
> *CC:* ceph-users 
> *Asunto:* Re: [ceph-users] Ceph cluster available to clients with 2
> different VLANs ?
>
>
>
> Hello,
>
>
>
> configure a gateway on your router or use a good rack switch that can
> provide such features and use layer3 routing to connect different vlans /
> ip zones.
>
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
>
>
>
> Am Fr., 3. Mai 2019 um 10:21 Uhr schrieb Hervé Ballans <
> herve.ball...@ias.u-psud.fr>:
>
> Hi all,
>
> I have a Ceph cluster on Luminous 12.2.10 with 3 mon and 6 osd servers.
> My current network settings is a separated public and cluster (private
> IP) network.
>
> I would like my cluster available to clients on another VLAN than the
> default one (which is the public network on ceph.conf)
>
> Is it possible ? How can I achieve that ?
> For information, each node still has two unused network cards.
>
> Thanks for any suggestions,
>
> Hervé
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster available to clients with 2 different VLANs ?

2019-05-03 Thread EDH - Manuel Rios Fernandez
You can put multiple networks in ceph.conf with commas

 

public network = 172.16.2.0/24, 192.168.0/22

 

But remember your servers must be able to reach it. L3 , FW needed.

 

Regards

Manuel

 

 

De: ceph-users  En nombre de Martin Verges
Enviado el: viernes, 3 de mayo de 2019 11:36
Para: Hervé Ballans 
CC: ceph-users 
Asunto: Re: [ceph-users] Ceph cluster available to clients with 2 different 
VLANs ?

 

Hello,

 

configure a gateway on your router or use a good rack switch that can provide 
such features and use layer3 routing to connect different vlans / ip zones.




--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io <mailto:martin.ver...@croit.io> 
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

 

 

Am Fr., 3. Mai 2019 um 10:21 Uhr schrieb Hervé Ballans 
mailto:herve.ball...@ias.u-psud.fr> >:

Hi all,

I have a Ceph cluster on Luminous 12.2.10 with 3 mon and 6 osd servers.
My current network settings is a separated public and cluster (private 
IP) network.

I would like my cluster available to clients on another VLAN than the 
default one (which is the public network on ceph.conf)

Is it possible ? How can I achieve that ?
For information, each node still has two unused network cards.

Thanks for any suggestions,

Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster available to clients with 2 different VLANs ?

2019-05-03 Thread Martin Verges
Hello,

configure a gateway on your router or use a good rack switch that can
provide such features and use layer3 routing to connect different vlans /
ip zones.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 3. Mai 2019 um 10:21 Uhr schrieb Hervé Ballans <
herve.ball...@ias.u-psud.fr>:

> Hi all,
>
> I have a Ceph cluster on Luminous 12.2.10 with 3 mon and 6 osd servers.
> My current network settings is a separated public and cluster (private
> IP) network.
>
> I would like my cluster available to clients on another VLAN than the
> default one (which is the public network on ceph.conf)
>
> Is it possible ? How can I achieve that ?
> For information, each node still has two unused network cards.
>
> Thanks for any suggestions,
>
> Hervé
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster available to clients with 2 different VLANs ?

2019-05-03 Thread Hervé Ballans

Hi all,

I have a Ceph cluster on Luminous 12.2.10 with 3 mon and 6 osd servers.
My current network settings is a separated public and cluster (private 
IP) network.


I would like my cluster available to clients on another VLAN than the 
default one (which is the public network on ceph.conf)


Is it possible ? How can I achieve that ?
For information, each node still has two unused network cards.

Thanks for any suggestions,

Hervé

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Christian Balzer
On Tue, 5 Mar 2019 10:39:14 -0600 Mark Nelson wrote:

> On 3/5/19 10:20 AM, Darius Kasparavičius wrote:
> > Thank you for your response.
> >
> > I was planning to use a 100GbE or 45GbE bond for this cluster. It was
> > acceptable for our use case to lose sequential/larger I/O speed for
> > it.  Dual socket would be and option, but I do not want to touch numa,
> > cgroups and the rest settings. Most of the time is just easier to add
> > a higher clock CPU or more cores. The plan is currently for 2xosd per
> > nvme device, but if testing shows that it’s better to use one. We will
> > stick with one. Which RocksDB settings would you recommend tweaking? I
> > haven’t had the chance to test them yet. Most of the clusters I have
> > access to are using leveldb and are still running filestore.  
> 
> 
> Yeah, numa makes everything more complicated.  I'd just consider jumping 
> up to the 7601 then if IOPS is a concern and know that you might still 
> be CPU bound (though it's also possible you could also hit some other 
> bottleneck before it becomes an issue).  Given that the cores aren't 
> clocked super high it's possible that you might see a benefit to 2x 
> OSDs/device.
> 
With EPYC CPUs and their rather studly interconnect NUMA feels less of an
issue than previous generations. 
Of course pinning would still be beneficial.

That said, avoiding it altogether if you can (afford it) is of course the
easiest thing to do.

Christian

> 
> RocksDB is tough.  Right now we are heavily tuned to favor reducing 
> write amplification but eat CPU to do it.  That can help performance 
> when write throughput is a bottleneck and also reduces wear on the drive 
> (which is always good, but especially with low write endurance drives).  
> Reducing the size of the WAL buffers will (probably) reduce CPU usage 
> and also reduce the amount of memory used by the OSD, but we've observed 
> higher write-amplification on our test nodes.  I suspect that might be a 
> worthwhile trade-off for nvdimms or optane, but I'm not sure it's a good 
> idea for typical NVMe drives.
> 
> 
> Mark
> 
> 
> >
> > On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson  wrote:  
> >> Hi,
> >>
> >>
> >> I've got a ryzen7 1700 box that I regularly run tests on along with the
> >> upstream community performance test nodes that have Intel Xeon E5-2650v3
> >> processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons
> >> are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the
> >> tests I've done with Ceph. Typically I see a single OSD using fewer
> >> cores on the Xeon processors vs Ryzen to hit similar performance numbers
> >> despite being clocked lower (though I haven't verified the turbo
> >> frequencies of both under load).  On the other hand, the Ryzen processor
> >> is significantly cheaper per core.  If you only looked at cores you'd
> >> think something like Ryzen would be the way to go, but there are other
> >> things to consider.  The number of PCIE lanes, memory configuration,
> >> cache configuration, and CPU interconnect (in multi-socket
> >> configurations) all start becoming really important if you are targeting
> >> multiple NVMe drives like what you are talking about below.  The EPYC
> >> processors give you more of all of that, but also costs a lot more than
> >> Ryzen.  Ultimately the CPU is only a small part of the price for nodes
> >> like this so I wouldn't skimp if your goal is to maximize IOPS.
> >>
> >>
> >> With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
> >> going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
> >> will be network bound for large IO workloads unless you are sticking
> >> 2x100GbE in.  You might want to consider jumping up to the 7601.  That
> >> would get you closer to where you want to be for 10 NVMe drives
> >> (3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:
> >>
> >> https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm
> >>
> >>
> >> Figure that with sufficient client parallelism/load you'll get about
> >> 3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before
> >> replication) with OSDs typically topping out at a max of about 6-8 cores
> >> each.  Doubling up OSDs on each NVMe drive might improve or hurt
> >> performance depending on what the limitations are (typically it seems to
> >> help most when the kv sync thread is the primary bottleneck in
> >> bluestore, which most likely happens with tons of slow cores and very
> >> fast NVMe drives).  Those are all very rough hand-wavy numbers and
> >> depend on a huge variety of factors so take them with a grain of salt.
> >> Doing things like disabling authentication, disabling logging, forcing
> >> high level P/C states, tweaking RocksDB WAL and compaction settings, the
> >> number of osd shards/threads, and the system numa configuration might
> >> get you higher performance/core, though it's all pretty hard to predict
> >> without outright testing 

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson


On 3/5/19 10:20 AM, Darius Kasparavičius wrote:

Thank you for your response.

I was planning to use a 100GbE or 45GbE bond for this cluster. It was
acceptable for our use case to lose sequential/larger I/O speed for
it.  Dual socket would be and option, but I do not want to touch numa,
cgroups and the rest settings. Most of the time is just easier to add
a higher clock CPU or more cores. The plan is currently for 2xosd per
nvme device, but if testing shows that it’s better to use one. We will
stick with one. Which RocksDB settings would you recommend tweaking? I
haven’t had the chance to test them yet. Most of the clusters I have
access to are using leveldb and are still running filestore.



Yeah, numa makes everything more complicated.  I'd just consider jumping 
up to the 7601 then if IOPS is a concern and know that you might still 
be CPU bound (though it's also possible you could also hit some other 
bottleneck before it becomes an issue).  Given that the cores aren't 
clocked super high it's possible that you might see a benefit to 2x 
OSDs/device.



RocksDB is tough.  Right now we are heavily tuned to favor reducing 
write amplification but eat CPU to do it.  That can help performance 
when write throughput is a bottleneck and also reduces wear on the drive 
(which is always good, but especially with low write endurance drives).  
Reducing the size of the WAL buffers will (probably) reduce CPU usage 
and also reduce the amount of memory used by the OSD, but we've observed 
higher write-amplification on our test nodes.  I suspect that might be a 
worthwhile trade-off for nvdimms or optane, but I'm not sure it's a good 
idea for typical NVMe drives.



Mark




On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson  wrote:

Hi,


I've got a ryzen7 1700 box that I regularly run tests on along with the
upstream community performance test nodes that have Intel Xeon E5-2650v3
processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons
are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the
tests I've done with Ceph. Typically I see a single OSD using fewer
cores on the Xeon processors vs Ryzen to hit similar performance numbers
despite being clocked lower (though I haven't verified the turbo
frequencies of both under load).  On the other hand, the Ryzen processor
is significantly cheaper per core.  If you only looked at cores you'd
think something like Ryzen would be the way to go, but there are other
things to consider.  The number of PCIE lanes, memory configuration,
cache configuration, and CPU interconnect (in multi-socket
configurations) all start becoming really important if you are targeting
multiple NVMe drives like what you are talking about below.  The EPYC
processors give you more of all of that, but also costs a lot more than
Ryzen.  Ultimately the CPU is only a small part of the price for nodes
like this so I wouldn't skimp if your goal is to maximize IOPS.


With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
will be network bound for large IO workloads unless you are sticking
2x100GbE in.  You might want to consider jumping up to the 7601.  That
would get you closer to where you want to be for 10 NVMe drives
(3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:

https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm


Figure that with sufficient client parallelism/load you'll get about
3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before
replication) with OSDs typically topping out at a max of about 6-8 cores
each.  Doubling up OSDs on each NVMe drive might improve or hurt
performance depending on what the limitations are (typically it seems to
help most when the kv sync thread is the primary bottleneck in
bluestore, which most likely happens with tons of slow cores and very
fast NVMe drives).  Those are all very rough hand-wavy numbers and
depend on a huge variety of factors so take them with a grain of salt.
Doing things like disabling authentication, disabling logging, forcing
high level P/C states, tweaking RocksDB WAL and compaction settings, the
number of osd shards/threads, and the system numa configuration might
get you higher performance/core, though it's all pretty hard to predict
without outright testing it.


Though you didn't ask about it, probably the most important thing you
can spend money on with NVMe drives is getting high write endurance
(DWPD) if you expect even a moderately high write workload.


Mark


On 3/5/19 3:49 AM, Darius Kasparavičius wrote:

Hello,


I was thinking of using AMD based system for my new nvme based
cluster. In particular I'm looking at
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
anyone tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.

Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Darius Kasparavičius
Thank you for your response.

I was planning to use a 100GbE or 45GbE bond for this cluster. It was
acceptable for our use case to lose sequential/larger I/O speed for
it.  Dual socket would be and option, but I do not want to touch numa,
cgroups and the rest settings. Most of the time is just easier to add
a higher clock CPU or more cores. The plan is currently for 2xosd per
nvme device, but if testing shows that it’s better to use one. We will
stick with one. Which RocksDB settings would you recommend tweaking? I
haven’t had the chance to test them yet. Most of the clusters I have
access to are using leveldb and are still running filestore.

On Tue, Mar 5, 2019 at 5:35 PM Mark Nelson  wrote:
>
> Hi,
>
>
> I've got a ryzen7 1700 box that I regularly run tests on along with the
> upstream community performance test nodes that have Intel Xeon E5-2650v3
> processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons
> are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the
> tests I've done with Ceph. Typically I see a single OSD using fewer
> cores on the Xeon processors vs Ryzen to hit similar performance numbers
> despite being clocked lower (though I haven't verified the turbo
> frequencies of both under load).  On the other hand, the Ryzen processor
> is significantly cheaper per core.  If you only looked at cores you'd
> think something like Ryzen would be the way to go, but there are other
> things to consider.  The number of PCIE lanes, memory configuration,
> cache configuration, and CPU interconnect (in multi-socket
> configurations) all start becoming really important if you are targeting
> multiple NVMe drives like what you are talking about below.  The EPYC
> processors give you more of all of that, but also costs a lot more than
> Ryzen.  Ultimately the CPU is only a small part of the price for nodes
> like this so I wouldn't skimp if your goal is to maximize IOPS.
>
>
> With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is
> going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but
> will be network bound for large IO workloads unless you are sticking
> 2x100GbE in.  You might want to consider jumping up to the 7601.  That
> would get you closer to where you want to be for 10 NVMe drives
> (3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:
>
> https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm
>
>
> Figure that with sufficient client parallelism/load you'll get about
> 3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before
> replication) with OSDs typically topping out at a max of about 6-8 cores
> each.  Doubling up OSDs on each NVMe drive might improve or hurt
> performance depending on what the limitations are (typically it seems to
> help most when the kv sync thread is the primary bottleneck in
> bluestore, which most likely happens with tons of slow cores and very
> fast NVMe drives).  Those are all very rough hand-wavy numbers and
> depend on a huge variety of factors so take them with a grain of salt.
> Doing things like disabling authentication, disabling logging, forcing
> high level P/C states, tweaking RocksDB WAL and compaction settings, the
> number of osd shards/threads, and the system numa configuration might
> get you higher performance/core, though it's all pretty hard to predict
> without outright testing it.
>
>
> Though you didn't ask about it, probably the most important thing you
> can spend money on with NVMe drives is getting high write endurance
> (DWPD) if you expect even a moderately high write workload.
>
>
> Mark
>
>
> On 3/5/19 3:49 AM, Darius Kasparavičius wrote:
> > Hello,
> >
> >
> > I was thinking of using AMD based system for my new nvme based
> > cluster. In particular I'm looking at
> > https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
> > and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
> > anyone tried running it on this particular hardware?
> >
> > General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Mark Nelson

Hi,


I've got a ryzen7 1700 box that I regularly run tests on along with the 
upstream community performance test nodes that have Intel Xeon E5-2650v3 
processors in them.  The Ryzen is 3.0GHz/3.7GHz turbo while the Xeons 
are 2.3GHz/3.0GHz.  The Xeons are quite a bit faster clock/clock in the 
tests I've done with Ceph. Typically I see a single OSD using fewer 
cores on the Xeon processors vs Ryzen to hit similar performance numbers 
despite being clocked lower (though I haven't verified the turbo 
frequencies of both under load).  On the other hand, the Ryzen processor 
is significantly cheaper per core.  If you only looked at cores you'd 
think something like Ryzen would be the way to go, but there are other 
things to consider.  The number of PCIE lanes, memory configuration, 
cache configuration, and CPU interconnect (in multi-socket 
configurations) all start becoming really important if you are targeting 
multiple NVMe drives like what you are talking about below.  The EPYC 
processors give you more of all of that, but also costs a lot more than 
Ryzen.  Ultimately the CPU is only a small part of the price for nodes 
like this so I wouldn't skimp if your goal is to maximize IOPS.



With 10 NVMe drives per node, I'm guessing that a single EPYC 7451 is 
going to be CPU bound for small IO workloads (2.4c/4.8t per OSD), but 
will be network bound for large IO workloads unless you are sticking 
2x100GbE in.  You might want to consider jumping up to the 7601.  That 
would get you closer to where you want to be for 10 NVMe drives 
(3.2c/6.4t per OSD).  Another option might be dual 7351s in this chassis:


https://www.supermicro.com/Aplus/system/1U/1123/AS-1123US-TN10RT.cfm


Figure that with sufficient client parallelism/load you'll get about 
3000-6000 read IOPS/core and about 1500-3000 write IOPS/core (before 
replication) with OSDs typically topping out at a max of about 6-8 cores 
each.  Doubling up OSDs on each NVMe drive might improve or hurt 
performance depending on what the limitations are (typically it seems to 
help most when the kv sync thread is the primary bottleneck in 
bluestore, which most likely happens with tons of slow cores and very 
fast NVMe drives).  Those are all very rough hand-wavy numbers and 
depend on a huge variety of factors so take them with a grain of salt.  
Doing things like disabling authentication, disabling logging, forcing 
high level P/C states, tweaking RocksDB WAL and compaction settings, the 
number of osd shards/threads, and the system numa configuration might 
get you higher performance/core, though it's all pretty hard to predict 
without outright testing it.



Though you didn't ask about it, probably the most important thing you 
can spend money on with NVMe drives is getting high write endurance 
(DWPD) if you expect even a moderately high write workload.



Mark


On 3/5/19 3:49 AM, Darius Kasparavičius wrote:

Hello,


I was thinking of using AMD based system for my new nvme based
cluster. In particular I'm looking at
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
anyone tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Ashley Merrick
If your crushmap is set to replicate by host you would only ever have one
copy on a single host, no matter how many OSD’s you placed on a single
NVME/disk.

But yes you would not want to mix OSD based rules and multiple OSD per a
physical disk.

On Tue, 5 Mar 2019 at 7:54 PM, Marc Roos  wrote:

>
> I see indeed lately people writing about putting 2 osd on a nvme, but
> does this not undermine the idea of having 3 copies on different
> osds/drives? In theory you could loose 2 copies when one disk fails???
>
>
>
>
> -Original Message-
> From: Darius Kasparaviius [mailto:daz...@gmail.com]
> Sent: 05 March 2019 10:50
> To: ceph-users
> Subject: [ceph-users] Ceph cluster on AMD based system.
>
> Hello,
>
>
> I was thinking of using AMD based system for my new nvme based cluster.
> In particular I'm looking at
> https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
> and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have anyone
> tried running it on this particular hardware?
>
> General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Marc Roos
 
I see indeed lately people writing about putting 2 osd on a nvme, but 
does this not undermine the idea of having 3 copies on different 
osds/drives? In theory you could loose 2 copies when one disk fails???




-Original Message-
From: Darius Kasparaviius [mailto:daz...@gmail.com] 
Sent: 05 March 2019 10:50
To: ceph-users
Subject: [ceph-users] Ceph cluster on AMD based system.

Hello,


I was thinking of using AMD based system for my new nvme based cluster. 
In particular I'm looking at 
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have anyone 
tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Paul Emmerich
Not with this particular server, but we've played around with two
EPYCs system with 10 NVMe in each and 100 Gbit/s network between them.
Make sure to use a recent Linux kernel, but other than that it works fine.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Tue, Mar 5, 2019 at 10:50 AM Darius Kasparavičius  wrote:
>
> Hello,
>
>
> I was thinking of using AMD based system for my new nvme based
> cluster. In particular I'm looking at
> https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
> and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
> anyone tried running it on this particular hardware?
>
> General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster on AMD based system.

2019-03-05 Thread Darius Kasparavičius
Hello,


I was thinking of using AMD based system for my new nvme based
cluster. In particular I'm looking at
https://www.supermicro.com/Aplus/system/1U/1113/AS-1113S-WN10RT.cfm
and https://www.amd.com/en/products/cpu/amd-epyc-7451 CPU's. Have
anyone tried running it on this particular hardware?

General idea is 6 nodes with 10 nvme drives and 2 osds per nvme drive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-25 Thread Darius Kasparavičius
I think this should give you a bit of isight on using large scale clusters.
https://www.youtube.com/watch?v=NdGHE-yq1gU and
https://www.youtube.com/watch?v=WpMzAFH6Mc4 . Watch the second video I
think it more relates to your problem.


On Mon, Feb 25, 2019, 11:33 M Ranga Swami Reddy 
wrote:

> We have taken care all HW recommendations, but missing that ceph mons
> are VMs with good configuration (4 core, 64G RAM + 500G disk)...
> Is this ceph-mon configuration might cause issues?
>
> On Sat, Feb 23, 2019 at 6:31 AM Anthony D'Atri  wrote:
> >
> >
> > ? Did we start recommending that production mons run on a VM?  I'd be
> very hesitant to do that, though probably some folks do.
> >
> > I can say for sure that in the past (Firefly) I experienced outages
> related to mons running on HDDs.  That was a cluster of 450 HDD OSDs with
> colo journals and hundreds of RBD clients.  Something obscure about running
> out of "global IDs" and not being able to create new ones fast enough.  We
> had to work around with a combo of lease settings on the mons and clients,
> though with Hammer and later I would not expect that exact situation to
> arise.  Still it left me paranoid about mon DBs and HDDs.
> >
> > -- aad
> >
> >
> > >
> > > But ceph recommendation is to use VM (not even the  HW node
> > > recommended). will try to change the mon disk as SSD and HW node.
> > >
> > > On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius 
> wrote:
> > >>
> > >> If your using hdd for monitor servers. Check their load. It might be
> > >> the issue there.
> > >>
> > >> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
> > >>  wrote:
> > >>>
> > >>> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> > >>> folder on FS on a disk
> > >>>
> > >>> On Fri, Feb 22, 2019 at 5:13 PM David Turner 
> wrote:
> > 
> >  Mon disks don't have journals, they're just a folder on a
> filesystem on a disk.
> > 
> >  On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > >
> > > ceph mons looks fine during the recovery.  Using  HDD with SSD
> > > journals. with recommeded CPU and RAM numbers.
> > >
> > > On Fri, Feb 22, 2019 at 4:40 PM David Turner <
> drakonst...@gmail.com> wrote:
> > >>
> > >> What about the system stats on your mons during recovery? If they
> are having a hard time keeping up with requests during a recovery, I could
> see that impacting client io. What disks are they running on? CPU? Etc.
> > >>
> > >> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > >>>
> > >>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> > >>> Shall I try with 0 for all debug settings?
> > >>>
> > >>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius <
> daz...@gmail.com> wrote:
> > 
> >  Hello,
> > 
> > 
> >  Check your CPU usage when you are doing those kind of
> operations. We
> >  had a similar issue where our CPU monitoring was reporting fine
> < 40%
> >  usage, but our load on the nodes was high mid 60-80. If it's
> possible
> >  try disabling ht and see the actual cpu usage.
> >  If you are hitting CPU limits you can try disabling crc on
> messages.
> >  ms_nocrc
> >  ms_crc_data
> >  ms_crc_header
> > 
> >  And setting all your debug messages to 0.
> >  If you haven't done you can also lower your recovery settings a
> little.
> >  osd recovery max active
> >  osd max backfills
> > 
> >  You can also lower your file store threads.
> >  filestore op threads
> > 
> > 
> >  If you can also switch to bluestore from filestore. This will
> also
> >  lower your CPU usage. I'm not sure that this is bluestore that
> does
> >  it, but I'm seeing lower cpu usage when moving to bluestore +
> rocksdb
> >  compared to filestore + leveldb .
> > 
> > 
> >  On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >   wrote:
> > >
> > > Thats expected from Ceph by design. But in our case, we are
> using all
> > > recommendation like rack failure domain, replication n/w,etc,
> still
> > > face client IO performance issues during one OSD down..
> > >
> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <
> drakonst...@gmail.com> wrote:
> > >>
> > >> With a RACK failure domain, you should be able to have an
> entire rack powered down without noticing any major impact on the clients.
> I regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> > >>
> > >> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > >>>
> > >>> Hello - I have a couple of questions on ceph cluster
> stability, 

Re: [ceph-users] Ceph cluster stability

2019-02-25 Thread M Ranga Swami Reddy
We have taken care all HW recommendations, but missing that ceph mons
are VMs with good configuration (4 core, 64G RAM + 500G disk)...
Is this ceph-mon configuration might cause issues?

On Sat, Feb 23, 2019 at 6:31 AM Anthony D'Atri  wrote:
>
>
> ? Did we start recommending that production mons run on a VM?  I'd be very 
> hesitant to do that, though probably some folks do.
>
> I can say for sure that in the past (Firefly) I experienced outages related 
> to mons running on HDDs.  That was a cluster of 450 HDD OSDs with colo 
> journals and hundreds of RBD clients.  Something obscure about running out of 
> "global IDs" and not being able to create new ones fast enough.  We had to 
> work around with a combo of lease settings on the mons and clients, though 
> with Hammer and later I would not expect that exact situation to arise.  
> Still it left me paranoid about mon DBs and HDDs.
>
> -- aad
>
>
> >
> > But ceph recommendation is to use VM (not even the  HW node
> > recommended). will try to change the mon disk as SSD and HW node.
> >
> > On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius  
> > wrote:
> >>
> >> If your using hdd for monitor servers. Check their load. It might be
> >> the issue there.
> >>
> >> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
> >>  wrote:
> >>>
> >>> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> >>> folder on FS on a disk
> >>>
> >>> On Fri, Feb 22, 2019 at 5:13 PM David Turner  
> >>> wrote:
> 
>  Mon disks don't have journals, they're just a folder on a filesystem on 
>  a disk.
> 
>  On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
>  wrote:
> >
> > ceph mons looks fine during the recovery.  Using  HDD with SSD
> > journals. with recommeded CPU and RAM numbers.
> >
> > On Fri, Feb 22, 2019 at 4:40 PM David Turner  
> > wrote:
> >>
> >> What about the system stats on your mons during recovery? If they are 
> >> having a hard time keeping up with requests during a recovery, I could 
> >> see that impacting client io. What disks are they running on? CPU? Etc.
> >>
> >> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> >>  wrote:
> >>>
> >>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >>> Shall I try with 0 for all debug settings?
> >>>
> >>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> >>>  wrote:
> 
>  Hello,
> 
> 
>  Check your CPU usage when you are doing those kind of operations. We
>  had a similar issue where our CPU monitoring was reporting fine < 40%
>  usage, but our load on the nodes was high mid 60-80. If it's possible
>  try disabling ht and see the actual cpu usage.
>  If you are hitting CPU limits you can try disabling crc on messages.
>  ms_nocrc
>  ms_crc_data
>  ms_crc_header
> 
>  And setting all your debug messages to 0.
>  If you haven't done you can also lower your recovery settings a 
>  little.
>  osd recovery max active
>  osd max backfills
> 
>  You can also lower your file store threads.
>  filestore op threads
> 
> 
>  If you can also switch to bluestore from filestore. This will also
>  lower your CPU usage. I'm not sure that this is bluestore that does
>  it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
>  compared to filestore + leveldb .
> 
> 
>  On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>   wrote:
> >
> > Thats expected from Ceph by design. But in our case, we are using 
> > all
> > recommendation like rack failure domain, replication n/w,etc, still
> > face client IO performance issues during one OSD down..
> >
> > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> >  wrote:
> >>
> >> With a RACK failure domain, you should be able to have an entire 
> >> rack powered down without noticing any major impact on the 
> >> clients.  I regularly take down OSDs and nodes for maintenance and 
> >> upgrades without seeing any problems with client IO.
> >>
> >> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> >>  wrote:
> >>>
> >>> Hello - I have a couple of questions on ceph cluster stability, 
> >>> even
> >>> we follow all recommendations as below:
> >>> - Having separate replication n/w and data n/w
> >>> - RACK is the failure domain
> >>> - Using SSDs for journals (1:4ratio)
> >>>
> >>> Q1 - If one OSD down, cluster IO down drastically and customer 
> >>> Apps impacted.
> >>> Q2 - what is stability ratio, like with above, is ceph cluster
> >>> workable condition, if one osd down 

Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread Anthony D'Atri

? Did we start recommending that production mons run on a VM?  I'd be very 
hesitant to do that, though probably some folks do.

I can say for sure that in the past (Firefly) I experienced outages related to 
mons running on HDDs.  That was a cluster of 450 HDD OSDs with colo journals 
and hundreds of RBD clients.  Something obscure about running out of "global 
IDs" and not being able to create new ones fast enough.  We had to work around 
with a combo of lease settings on the mons and clients, though with Hammer and 
later I would not expect that exact situation to arise.  Still it left me 
paranoid about mon DBs and HDDs. 

-- aad


> 
> But ceph recommendation is to use VM (not even the  HW node
> recommended). will try to change the mon disk as SSD and HW node.
> 
> On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius  wrote:
>> 
>> If your using hdd for monitor servers. Check their load. It might be
>> the issue there.
>> 
>> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
>>  wrote:
>>> 
>>> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
>>> folder on FS on a disk
>>> 
>>> On Fri, Feb 22, 2019 at 5:13 PM David Turner  wrote:
 
 Mon disks don't have journals, they're just a folder on a filesystem on a 
 disk.
 
 On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
 wrote:
> 
> ceph mons looks fine during the recovery.  Using  HDD with SSD
> journals. with recommeded CPU and RAM numbers.
> 
> On Fri, Feb 22, 2019 at 4:40 PM David Turner  
> wrote:
>> 
>> What about the system stats on your mons during recovery? If they are 
>> having a hard time keeping up with requests during a recovery, I could 
>> see that impacting client io. What disks are they running on? CPU? Etc.
>> 
>> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy  
>> wrote:
>>> 
>>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
>>> Shall I try with 0 for all debug settings?
>>> 
>>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  
>>> wrote:
 
 Hello,
 
 
 Check your CPU usage when you are doing those kind of operations. We
 had a similar issue where our CPU monitoring was reporting fine < 40%
 usage, but our load on the nodes was high mid 60-80. If it's possible
 try disabling ht and see the actual cpu usage.
 If you are hitting CPU limits you can try disabling crc on messages.
 ms_nocrc
 ms_crc_data
 ms_crc_header
 
 And setting all your debug messages to 0.
 If you haven't done you can also lower your recovery settings a little.
 osd recovery max active
 osd max backfills
 
 You can also lower your file store threads.
 filestore op threads
 
 
 If you can also switch to bluestore from filestore. This will also
 lower your CPU usage. I'm not sure that this is bluestore that does
 it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
 compared to filestore + leveldb .
 
 
 On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
  wrote:
> 
> Thats expected from Ceph by design. But in our case, we are using all
> recommendation like rack failure domain, replication n/w,etc, still
> face client IO performance issues during one OSD down..
> 
> On Tue, Feb 19, 2019 at 10:56 PM David Turner  
> wrote:
>> 
>> With a RACK failure domain, you should be able to have an entire 
>> rack powered down without noticing any major impact on the clients.  
>> I regularly take down OSDs and nodes for maintenance and upgrades 
>> without seeing any problems with client IO.
>> 
>> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
>>  wrote:
>>> 
>>> Hello - I have a couple of questions on ceph cluster stability, even
>>> we follow all recommendations as below:
>>> - Having separate replication n/w and data n/w
>>> - RACK is the failure domain
>>> - Using SSDs for journals (1:4ratio)
>>> 
>>> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
>>> impacted.
>>> Q2 - what is stability ratio, like with above, is ceph cluster
>>> workable condition, if one osd down or one node down,etc.
>>> 
>>> Thanks
>>> Swami
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
Opps...is this really impact...will righ-away change this and test it.

On Fri, Feb 22, 2019 at 5:29 PM Janne Johansson  wrote:
>
> Den fre 22 feb. 2019 kl 12:35 skrev M Ranga Swami Reddy 
> :
>>
>> No seen the CPU limitation because we are using the 4 cores per osd daemon.
>> But still using "ms_crc_data = true and ms_crc_header = true". Will
>> disable these and try the performance.
>
>
> I am a bit sceptical to crc being so heavy that it would impact a CPU made 
> after 1990..
>
> --
> May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
But ceph recommendation is to use VM (not even the  HW node
recommended). will try to change the mon disk as SSD and HW node.

On Fri, Feb 22, 2019 at 5:25 PM Darius Kasparavičius  wrote:
>
> If your using hdd for monitor servers. Check their load. It might be
> the issue there.
>
> On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
>  wrote:
> >
> > ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> > folder on FS on a disk
> >
> > On Fri, Feb 22, 2019 at 5:13 PM David Turner  wrote:
> > >
> > > Mon disks don't have journals, they're just a folder on a filesystem on a 
> > > disk.
> > >
> > > On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
> > > wrote:
> > >>
> > >> ceph mons looks fine during the recovery.  Using  HDD with SSD
> > >> journals. with recommeded CPU and RAM numbers.
> > >>
> > >> On Fri, Feb 22, 2019 at 4:40 PM David Turner  
> > >> wrote:
> > >> >
> > >> > What about the system stats on your mons during recovery? If they are 
> > >> > having a hard time keeping up with requests during a recovery, I could 
> > >> > see that impacting client io. What disks are they running on? CPU? Etc.
> > >> >
> > >> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> > >> >  wrote:
> > >> >>
> > >> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> > >> >> Shall I try with 0 for all debug settings?
> > >> >>
> > >> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> > >> >>  wrote:
> > >> >> >
> > >> >> > Hello,
> > >> >> >
> > >> >> >
> > >> >> > Check your CPU usage when you are doing those kind of operations. We
> > >> >> > had a similar issue where our CPU monitoring was reporting fine < 
> > >> >> > 40%
> > >> >> > usage, but our load on the nodes was high mid 60-80. If it's 
> > >> >> > possible
> > >> >> > try disabling ht and see the actual cpu usage.
> > >> >> > If you are hitting CPU limits you can try disabling crc on messages.
> > >> >> > ms_nocrc
> > >> >> > ms_crc_data
> > >> >> > ms_crc_header
> > >> >> >
> > >> >> > And setting all your debug messages to 0.
> > >> >> > If you haven't done you can also lower your recovery settings a 
> > >> >> > little.
> > >> >> > osd recovery max active
> > >> >> > osd max backfills
> > >> >> >
> > >> >> > You can also lower your file store threads.
> > >> >> > filestore op threads
> > >> >> >
> > >> >> >
> > >> >> > If you can also switch to bluestore from filestore. This will also
> > >> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> > >> >> > it, but I'm seeing lower cpu usage when moving to bluestore + 
> > >> >> > rocksdb
> > >> >> > compared to filestore + leveldb .
> > >> >> >
> > >> >> >
> > >> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> > >> >> >  wrote:
> > >> >> > >
> > >> >> > > Thats expected from Ceph by design. But in our case, we are using 
> > >> >> > > all
> > >> >> > > recommendation like rack failure domain, replication n/w,etc, 
> > >> >> > > still
> > >> >> > > face client IO performance issues during one OSD down..
> > >> >> > >
> > >> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> > >> >> > >  wrote:
> > >> >> > > >
> > >> >> > > > With a RACK failure domain, you should be able to have an 
> > >> >> > > > entire rack powered down without noticing any major impact on 
> > >> >> > > > the clients.  I regularly take down OSDs and nodes for 
> > >> >> > > > maintenance and upgrades without seeing any problems with 
> > >> >> > > > client IO.
> > >> >> > > >
> > >> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >> >> > > >  wrote:
> > >> >> > > >>
> > >> >> > > >> Hello - I have a couple of questions on ceph cluster 
> > >> >> > > >> stability, even
> > >> >> > > >> we follow all recommendations as below:
> > >> >> > > >> - Having separate replication n/w and data n/w
> > >> >> > > >> - RACK is the failure domain
> > >> >> > > >> - Using SSDs for journals (1:4ratio)
> > >> >> > > >>
> > >> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer 
> > >> >> > > >> Apps impacted.
> > >> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> >> > > >> workable condition, if one osd down or one node down,etc.
> > >> >> > > >>
> > >> >> > > >> Thanks
> > >> >> > > >> Swami
> > >> >> > > >> ___
> > >> >> > > >> ceph-users mailing list
> > >> >> > > >> ceph-users@lists.ceph.com
> > >> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> >> > > ___
> > >> >> > > ceph-users mailing list
> > >> >> > > ceph-users@lists.ceph.com
> > >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread Janne Johansson
Den fre 22 feb. 2019 kl 12:35 skrev M Ranga Swami Reddy <
swamire...@gmail.com>:

> No seen the CPU limitation because we are using the 4 cores per osd daemon.
> But still using "ms_crc_data = true and ms_crc_header = true". Will
> disable these and try the performance.
>

I am a bit sceptical to crc being so heavy that it would impact a CPU made
after 1990..

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread Darius Kasparavičius
If your using hdd for monitor servers. Check their load. It might be
the issue there.

On Fri, Feb 22, 2019 at 1:50 PM M Ranga Swami Reddy
 wrote:
>
> ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
> folder on FS on a disk
>
> On Fri, Feb 22, 2019 at 5:13 PM David Turner  wrote:
> >
> > Mon disks don't have journals, they're just a folder on a filesystem on a 
> > disk.
> >
> > On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
> > wrote:
> >>
> >> ceph mons looks fine during the recovery.  Using  HDD with SSD
> >> journals. with recommeded CPU and RAM numbers.
> >>
> >> On Fri, Feb 22, 2019 at 4:40 PM David Turner  wrote:
> >> >
> >> > What about the system stats on your mons during recovery? If they are 
> >> > having a hard time keeping up with requests during a recovery, I could 
> >> > see that impacting client io. What disks are they running on? CPU? Etc.
> >> >
> >> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy  
> >> > wrote:
> >> >>
> >> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >> >> Shall I try with 0 for all debug settings?
> >> >>
> >> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  
> >> >> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> >
> >> >> > Check your CPU usage when you are doing those kind of operations. We
> >> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
> >> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
> >> >> > try disabling ht and see the actual cpu usage.
> >> >> > If you are hitting CPU limits you can try disabling crc on messages.
> >> >> > ms_nocrc
> >> >> > ms_crc_data
> >> >> > ms_crc_header
> >> >> >
> >> >> > And setting all your debug messages to 0.
> >> >> > If you haven't done you can also lower your recovery settings a 
> >> >> > little.
> >> >> > osd recovery max active
> >> >> > osd max backfills
> >> >> >
> >> >> > You can also lower your file store threads.
> >> >> > filestore op threads
> >> >> >
> >> >> >
> >> >> > If you can also switch to bluestore from filestore. This will also
> >> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> >> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >> >> > compared to filestore + leveldb .
> >> >> >
> >> >> >
> >> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >> >> >  wrote:
> >> >> > >
> >> >> > > Thats expected from Ceph by design. But in our case, we are using 
> >> >> > > all
> >> >> > > recommendation like rack failure domain, replication n/w,etc, still
> >> >> > > face client IO performance issues during one OSD down..
> >> >> > >
> >> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> >> >> > >  wrote:
> >> >> > > >
> >> >> > > > With a RACK failure domain, you should be able to have an entire 
> >> >> > > > rack powered down without noticing any major impact on the 
> >> >> > > > clients.  I regularly take down OSDs and nodes for maintenance 
> >> >> > > > and upgrades without seeing any problems with client IO.
> >> >> > > >
> >> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> >> >> > > >  wrote:
> >> >> > > >>
> >> >> > > >> Hello - I have a couple of questions on ceph cluster stability, 
> >> >> > > >> even
> >> >> > > >> we follow all recommendations as below:
> >> >> > > >> - Having separate replication n/w and data n/w
> >> >> > > >> - RACK is the failure domain
> >> >> > > >> - Using SSDs for journals (1:4ratio)
> >> >> > > >>
> >> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer 
> >> >> > > >> Apps impacted.
> >> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> >> > > >> workable condition, if one osd down or one node down,etc.
> >> >> > > >>
> >> >> > > >> Thanks
> >> >> > > >> Swami
> >> >> > > >> ___
> >> >> > > >> ceph-users mailing list
> >> >> > > >> ceph-users@lists.ceph.com
> >> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> > > ___
> >> >> > > ceph-users mailing list
> >> >> > > ceph-users@lists.ceph.com
> >> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
ceph-mon disk with 500G with HDD (not journals/SSDs).  Yes, mon use
folder on FS on a disk

On Fri, Feb 22, 2019 at 5:13 PM David Turner  wrote:
>
> Mon disks don't have journals, they're just a folder on a filesystem on a 
> disk.
>
> On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy  
> wrote:
>>
>> ceph mons looks fine during the recovery.  Using  HDD with SSD
>> journals. with recommeded CPU and RAM numbers.
>>
>> On Fri, Feb 22, 2019 at 4:40 PM David Turner  wrote:
>> >
>> > What about the system stats on your mons during recovery? If they are 
>> > having a hard time keeping up with requests during a recovery, I could see 
>> > that impacting client io. What disks are they running on? CPU? Etc.
>> >
>> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy  
>> > wrote:
>> >>
>> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
>> >> Shall I try with 0 for all debug settings?
>> >>
>> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  
>> >> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> >
>> >> > Check your CPU usage when you are doing those kind of operations. We
>> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
>> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
>> >> > try disabling ht and see the actual cpu usage.
>> >> > If you are hitting CPU limits you can try disabling crc on messages.
>> >> > ms_nocrc
>> >> > ms_crc_data
>> >> > ms_crc_header
>> >> >
>> >> > And setting all your debug messages to 0.
>> >> > If you haven't done you can also lower your recovery settings a little.
>> >> > osd recovery max active
>> >> > osd max backfills
>> >> >
>> >> > You can also lower your file store threads.
>> >> > filestore op threads
>> >> >
>> >> >
>> >> > If you can also switch to bluestore from filestore. This will also
>> >> > lower your CPU usage. I'm not sure that this is bluestore that does
>> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
>> >> > compared to filestore + leveldb .
>> >> >
>> >> >
>> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>> >> >  wrote:
>> >> > >
>> >> > > Thats expected from Ceph by design. But in our case, we are using all
>> >> > > recommendation like rack failure domain, replication n/w,etc, still
>> >> > > face client IO performance issues during one OSD down..
>> >> > >
>> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner  
>> >> > > wrote:
>> >> > > >
>> >> > > > With a RACK failure domain, you should be able to have an entire 
>> >> > > > rack powered down without noticing any major impact on the clients. 
>> >> > > >  I regularly take down OSDs and nodes for maintenance and upgrades 
>> >> > > > without seeing any problems with client IO.
>> >> > > >
>> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
>> >> > > >  wrote:
>> >> > > >>
>> >> > > >> Hello - I have a couple of questions on ceph cluster stability, 
>> >> > > >> even
>> >> > > >> we follow all recommendations as below:
>> >> > > >> - Having separate replication n/w and data n/w
>> >> > > >> - RACK is the failure domain
>> >> > > >> - Using SSDs for journals (1:4ratio)
>> >> > > >>
>> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer 
>> >> > > >> Apps impacted.
>> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
>> >> > > >> workable condition, if one osd down or one node down,etc.
>> >> > > >>
>> >> > > >> Thanks
>> >> > > >> Swami
>> >> > > >> ___
>> >> > > >> ceph-users mailing list
>> >> > > >> ceph-users@lists.ceph.com
>> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> > > ___
>> >> > > ceph-users mailing list
>> >> > > ceph-users@lists.ceph.com
>> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner
Mon disks don't have journals, they're just a folder on a filesystem on a
disk.

On Fri, Feb 22, 2019, 6:40 AM M Ranga Swami Reddy 
wrote:

> ceph mons looks fine during the recovery.  Using  HDD with SSD
> journals. with recommeded CPU and RAM numbers.
>
> On Fri, Feb 22, 2019 at 4:40 PM David Turner 
> wrote:
> >
> > What about the system stats on your mons during recovery? If they are
> having a hard time keeping up with requests during a recovery, I could see
> that impacting client io. What disks are they running on? CPU? Etc.
> >
> > On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
> wrote:
> >>
> >> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> >> Shall I try with 0 for all debug settings?
> >>
> >> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >> >
> >> > Hello,
> >> >
> >> >
> >> > Check your CPU usage when you are doing those kind of operations. We
> >> > had a similar issue where our CPU monitoring was reporting fine < 40%
> >> > usage, but our load on the nodes was high mid 60-80. If it's possible
> >> > try disabling ht and see the actual cpu usage.
> >> > If you are hitting CPU limits you can try disabling crc on messages.
> >> > ms_nocrc
> >> > ms_crc_data
> >> > ms_crc_header
> >> >
> >> > And setting all your debug messages to 0.
> >> > If you haven't done you can also lower your recovery settings a
> little.
> >> > osd recovery max active
> >> > osd max backfills
> >> >
> >> > You can also lower your file store threads.
> >> > filestore op threads
> >> >
> >> >
> >> > If you can also switch to bluestore from filestore. This will also
> >> > lower your CPU usage. I'm not sure that this is bluestore that does
> >> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> >> > compared to filestore + leveldb .
> >> >
> >> >
> >> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >> >  wrote:
> >> > >
> >> > > Thats expected from Ceph by design. But in our case, we are using
> all
> >> > > recommendation like rack failure domain, replication n/w,etc, still
> >> > > face client IO performance issues during one OSD down..
> >> > >
> >> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner <
> drakonst...@gmail.com> wrote:
> >> > > >
> >> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> >> > > >
> >> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> >> > > >>
> >> > > >> Hello - I have a couple of questions on ceph cluster stability,
> even
> >> > > >> we follow all recommendations as below:
> >> > > >> - Having separate replication n/w and data n/w
> >> > > >> - RACK is the failure domain
> >> > > >> - Using SSDs for journals (1:4ratio)
> >> > > >>
> >> > > >> Q1 - If one OSD down, cluster IO down drastically and customer
> Apps impacted.
> >> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> > > >> workable condition, if one osd down or one node down,etc.
> >> > > >>
> >> > > >> Thanks
> >> > > >> Swami
> >> > > >> ___
> >> > > >> ceph-users mailing list
> >> > > >> ceph-users@lists.ceph.com
> >> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> > > ___
> >> > > ceph-users mailing list
> >> > > ceph-users@lists.ceph.com
> >> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
ceph mons looks fine during the recovery.  Using  HDD with SSD
journals. with recommeded CPU and RAM numbers.

On Fri, Feb 22, 2019 at 4:40 PM David Turner  wrote:
>
> What about the system stats on your mons during recovery? If they are having 
> a hard time keeping up with requests during a recovery, I could see that 
> impacting client io. What disks are they running on? CPU? Etc.
>
> On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy  
> wrote:
>>
>> Debug setting defaults are using..like 1/5 and 0/5 for almost..
>> Shall I try with 0 for all debug settings?
>>
>> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  
>> wrote:
>> >
>> > Hello,
>> >
>> >
>> > Check your CPU usage when you are doing those kind of operations. We
>> > had a similar issue where our CPU monitoring was reporting fine < 40%
>> > usage, but our load on the nodes was high mid 60-80. If it's possible
>> > try disabling ht and see the actual cpu usage.
>> > If you are hitting CPU limits you can try disabling crc on messages.
>> > ms_nocrc
>> > ms_crc_data
>> > ms_crc_header
>> >
>> > And setting all your debug messages to 0.
>> > If you haven't done you can also lower your recovery settings a little.
>> > osd recovery max active
>> > osd max backfills
>> >
>> > You can also lower your file store threads.
>> > filestore op threads
>> >
>> >
>> > If you can also switch to bluestore from filestore. This will also
>> > lower your CPU usage. I'm not sure that this is bluestore that does
>> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
>> > compared to filestore + leveldb .
>> >
>> >
>> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>> >  wrote:
>> > >
>> > > Thats expected from Ceph by design. But in our case, we are using all
>> > > recommendation like rack failure domain, replication n/w,etc, still
>> > > face client IO performance issues during one OSD down..
>> > >
>> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner  
>> > > wrote:
>> > > >
>> > > > With a RACK failure domain, you should be able to have an entire rack 
>> > > > powered down without noticing any major impact on the clients.  I 
>> > > > regularly take down OSDs and nodes for maintenance and upgrades 
>> > > > without seeing any problems with client IO.
>> > > >
>> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
>> > > >  wrote:
>> > > >>
>> > > >> Hello - I have a couple of questions on ceph cluster stability, even
>> > > >> we follow all recommendations as below:
>> > > >> - Having separate replication n/w and data n/w
>> > > >> - RACK is the failure domain
>> > > >> - Using SSDs for journals (1:4ratio)
>> > > >>
>> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
>> > > >> impacted.
>> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
>> > > >> workable condition, if one osd down or one node down,etc.
>> > > >>
>> > > >> Thanks
>> > > >> Swami
>> > > >> ___
>> > > >> ceph-users mailing list
>> > > >> ceph-users@lists.ceph.com
>> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > ___
>> > > ceph-users mailing list
>> > > ceph-users@lists.ceph.com
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread David Turner
What about the system stats on your mons during recovery? If they are
having a hard time keeping up with requests during a recovery, I could see
that impacting client io. What disks are they running on? CPU? Etc.

On Fri, Feb 22, 2019, 6:01 AM M Ranga Swami Reddy 
wrote:

> Debug setting defaults are using..like 1/5 and 0/5 for almost..
> Shall I try with 0 for all debug settings?
>
> On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius 
> wrote:
> >
> > Hello,
> >
> >
> > Check your CPU usage when you are doing those kind of operations. We
> > had a similar issue where our CPU monitoring was reporting fine < 40%
> > usage, but our load on the nodes was high mid 60-80. If it's possible
> > try disabling ht and see the actual cpu usage.
> > If you are hitting CPU limits you can try disabling crc on messages.
> > ms_nocrc
> > ms_crc_data
> > ms_crc_header
> >
> > And setting all your debug messages to 0.
> > If you haven't done you can also lower your recovery settings a little.
> > osd recovery max active
> > osd max backfills
> >
> > You can also lower your file store threads.
> > filestore op threads
> >
> >
> > If you can also switch to bluestore from filestore. This will also
> > lower your CPU usage. I'm not sure that this is bluestore that does
> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> > compared to filestore + leveldb .
> >
> >
> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Thats expected from Ceph by design. But in our case, we are using all
> > > recommendation like rack failure domain, replication n/w,etc, still
> > > face client IO performance issues during one OSD down..
> > >
> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner 
> wrote:
> > > >
> > > > With a RACK failure domain, you should be able to have an entire
> rack powered down without noticing any major impact on the clients.  I
> regularly take down OSDs and nodes for maintenance and upgrades without
> seeing any problems with client IO.
> > > >
> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy <
> swamire...@gmail.com> wrote:
> > > >>
> > > >> Hello - I have a couple of questions on ceph cluster stability, even
> > > >> we follow all recommendations as below:
> > > >> - Having separate replication n/w and data n/w
> > > >> - RACK is the failure domain
> > > >> - Using SSDs for journals (1:4ratio)
> > > >>
> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > > >> workable condition, if one osd down or one node down,etc.
> > > >>
> > > >> Thanks
> > > >> Swami
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
Debug setting defaults are using..like 1/5 and 0/5 for almost..
Shall I try with 0 for all debug settings?

On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  wrote:
>
> Hello,
>
>
> Check your CPU usage when you are doing those kind of operations. We
> had a similar issue where our CPU monitoring was reporting fine < 40%
> usage, but our load on the nodes was high mid 60-80. If it's possible
> try disabling ht and see the actual cpu usage.
> If you are hitting CPU limits you can try disabling crc on messages.
> ms_nocrc
> ms_crc_data
> ms_crc_header
>
> And setting all your debug messages to 0.
> If you haven't done you can also lower your recovery settings a little.
> osd recovery max active
> osd max backfills
>
> You can also lower your file store threads.
> filestore op threads
>
>
> If you can also switch to bluestore from filestore. This will also
> lower your CPU usage. I'm not sure that this is bluestore that does
> it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> compared to filestore + leveldb .
>
>
> On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>  wrote:
> >
> > Thats expected from Ceph by design. But in our case, we are using all
> > recommendation like rack failure domain, replication n/w,etc, still
> > face client IO performance issues during one OSD down..
> >
> > On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> > >
> > > With a RACK failure domain, you should be able to have an entire rack 
> > > powered down without noticing any major impact on the clients.  I 
> > > regularly take down OSDs and nodes for maintenance and upgrades without 
> > > seeing any problems with client IO.
> > >
> > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >  wrote:
> > >>
> > >> Hello - I have a couple of questions on ceph cluster stability, even
> > >> we follow all recommendations as below:
> > >> - Having separate replication n/w and data n/w
> > >> - RACK is the failure domain
> > >> - Using SSDs for journals (1:4ratio)
> > >>
> > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> > >> impacted.
> > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> workable condition, if one osd down or one node down,etc.
> > >>
> > >> Thanks
> > >> Swami
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
No seen the CPU limitation because we are using the 4 cores per osd daemon.
But still using "ms_crc_data = true and ms_crc_header = true". Will
disable these and try the performance.

And using the filestore + leveldB only. filestore_op_threads = 2.

Rest of recovery and backfill settings done minimum already.

Thanks
Swami

On Wed, Feb 20, 2019 at 9:17 PM Darius Kasparavičius  wrote:
>
> Hello,
>
>
> Check your CPU usage when you are doing those kind of operations. We
> had a similar issue where our CPU monitoring was reporting fine < 40%
> usage, but our load on the nodes was high mid 60-80. If it's possible
> try disabling ht and see the actual cpu usage.
> If you are hitting CPU limits you can try disabling crc on messages.
> ms_nocrc
> ms_crc_data
> ms_crc_header
>
> And setting all your debug messages to 0.
> If you haven't done you can also lower your recovery settings a little.
> osd recovery max active
> osd max backfills
>
> You can also lower your file store threads.
> filestore op threads
>
>
> If you can also switch to bluestore from filestore. This will also
> lower your CPU usage. I'm not sure that this is bluestore that does
> it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> compared to filestore + leveldb .
>
>
> On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>  wrote:
> >
> > Thats expected from Ceph by design. But in our case, we are using all
> > recommendation like rack failure domain, replication n/w,etc, still
> > face client IO performance issues during one OSD down..
> >
> > On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> > >
> > > With a RACK failure domain, you should be able to have an entire rack 
> > > powered down without noticing any major impact on the clients.  I 
> > > regularly take down OSDs and nodes for maintenance and upgrades without 
> > > seeing any problems with client IO.
> > >
> > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >  wrote:
> > >>
> > >> Hello - I have a couple of questions on ceph cluster stability, even
> > >> we follow all recommendations as below:
> > >> - Having separate replication n/w and data n/w
> > >> - RACK is the failure domain
> > >> - Using SSDs for journals (1:4ratio)
> > >>
> > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> > >> impacted.
> > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> workable condition, if one osd down or one node down,etc.
> > >>
> > >> Thanks
> > >> Swami
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-22 Thread M Ranga Swami Reddy
Yep...these are setting already in place. And also followed all
recommendations to get performance, but still impacts with osd
down..even we have 2000+ osd.
And using 3 pools with diff. HW nodes for each pool. One pool's OSD
down, also impacts other pools performance...
which not expected with Ceph (here are using the separate NICs for
data and replication)..

On Wed, Feb 20, 2019 at 9:25 PM Alexandru Cucu  wrote:
>
> Hi,
>
> I would decrese max active recovery processes per osd and increase
> recovery sleep.
> osd recovery max active = 1 (default is 3)
> osd recovery sleep = 1 (default is 0 or 0.1)
>
> osd max backfills defaults to 1 so that should be OK if he's using the
> default :D
>
> Disabling scrubbing during recovery should also help:
> osd scrub during recovery = false
>
> On Wed, Feb 20, 2019 at 5:47 PM Darius Kasparavičius  wrote:
> >
> > Hello,
> >
> >
> > Check your CPU usage when you are doing those kind of operations. We
> > had a similar issue where our CPU monitoring was reporting fine < 40%
> > usage, but our load on the nodes was high mid 60-80. If it's possible
> > try disabling ht and see the actual cpu usage.
> > If you are hitting CPU limits you can try disabling crc on messages.
> > ms_nocrc
> > ms_crc_data
> > ms_crc_header
> >
> > And setting all your debug messages to 0.
> > If you haven't done you can also lower your recovery settings a little.
> > osd recovery max active
> > osd max backfills
> >
> > You can also lower your file store threads.
> > filestore op threads
> >
> >
> > If you can also switch to bluestore from filestore. This will also
> > lower your CPU usage. I'm not sure that this is bluestore that does
> > it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> > compared to filestore + leveldb .
> >
> >
> > On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
> >  wrote:
> > >
> > > Thats expected from Ceph by design. But in our case, we are using all
> > > recommendation like rack failure domain, replication n/w,etc, still
> > > face client IO performance issues during one OSD down..
> > >
> > > On Tue, Feb 19, 2019 at 10:56 PM David Turner  
> > > wrote:
> > > >
> > > > With a RACK failure domain, you should be able to have an entire rack 
> > > > powered down without noticing any major impact on the clients.  I 
> > > > regularly take down OSDs and nodes for maintenance and upgrades without 
> > > > seeing any problems with client IO.
> > > >
> > > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > > >  wrote:
> > > >>
> > > >> Hello - I have a couple of questions on ceph cluster stability, even
> > > >> we follow all recommendations as below:
> > > >> - Having separate replication n/w and data n/w
> > > >> - RACK is the failure domain
> > > >> - Using SSDs for journals (1:4ratio)
> > > >>
> > > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> > > >> impacted.
> > > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > > >> workable condition, if one osd down or one node down,etc.
> > > >>
> > > >> Thanks
> > > >> Swami
> > > >> ___
> > > >> ceph-users mailing list
> > > >> ceph-users@lists.ceph.com
> > > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread Alexandru Cucu
Hi,

I would decrese max active recovery processes per osd and increase
recovery sleep.
osd recovery max active = 1 (default is 3)
osd recovery sleep = 1 (default is 0 or 0.1)

osd max backfills defaults to 1 so that should be OK if he's using the
default :D

Disabling scrubbing during recovery should also help:
osd scrub during recovery = false

On Wed, Feb 20, 2019 at 5:47 PM Darius Kasparavičius  wrote:
>
> Hello,
>
>
> Check your CPU usage when you are doing those kind of operations. We
> had a similar issue where our CPU monitoring was reporting fine < 40%
> usage, but our load on the nodes was high mid 60-80. If it's possible
> try disabling ht and see the actual cpu usage.
> If you are hitting CPU limits you can try disabling crc on messages.
> ms_nocrc
> ms_crc_data
> ms_crc_header
>
> And setting all your debug messages to 0.
> If you haven't done you can also lower your recovery settings a little.
> osd recovery max active
> osd max backfills
>
> You can also lower your file store threads.
> filestore op threads
>
>
> If you can also switch to bluestore from filestore. This will also
> lower your CPU usage. I'm not sure that this is bluestore that does
> it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> compared to filestore + leveldb .
>
>
> On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>  wrote:
> >
> > Thats expected from Ceph by design. But in our case, we are using all
> > recommendation like rack failure domain, replication n/w,etc, still
> > face client IO performance issues during one OSD down..
> >
> > On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> > >
> > > With a RACK failure domain, you should be able to have an entire rack 
> > > powered down without noticing any major impact on the clients.  I 
> > > regularly take down OSDs and nodes for maintenance and upgrades without 
> > > seeing any problems with client IO.
> > >
> > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >  wrote:
> > >>
> > >> Hello - I have a couple of questions on ceph cluster stability, even
> > >> we follow all recommendations as below:
> > >> - Having separate replication n/w and data n/w
> > >> - RACK is the failure domain
> > >> - Using SSDs for journals (1:4ratio)
> > >>
> > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> > >> impacted.
> > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> workable condition, if one osd down or one node down,etc.
> > >>
> > >> Thanks
> > >> Swami
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread Darius Kasparavičius
Hello,


Check your CPU usage when you are doing those kind of operations. We
had a similar issue where our CPU monitoring was reporting fine < 40%
usage, but our load on the nodes was high mid 60-80. If it's possible
try disabling ht and see the actual cpu usage.
If you are hitting CPU limits you can try disabling crc on messages.
ms_nocrc
ms_crc_data
ms_crc_header

And setting all your debug messages to 0.
If you haven't done you can also lower your recovery settings a little.
osd recovery max active
osd max backfills

You can also lower your file store threads.
filestore op threads


If you can also switch to bluestore from filestore. This will also
lower your CPU usage. I'm not sure that this is bluestore that does
it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
compared to filestore + leveldb .


On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
 wrote:
>
> Thats expected from Ceph by design. But in our case, we are using all
> recommendation like rack failure domain, replication n/w,etc, still
> face client IO performance issues during one OSD down..
>
> On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> >
> > With a RACK failure domain, you should be able to have an entire rack 
> > powered down without noticing any major impact on the clients.  I regularly 
> > take down OSDs and nodes for maintenance and upgrades without seeing any 
> > problems with client IO.
> >
> > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy  
> > wrote:
> >>
> >> Hello - I have a couple of questions on ceph cluster stability, even
> >> we follow all recommendations as below:
> >> - Having separate replication n/w and data n/w
> >> - RACK is the failure domain
> >> - Using SSDs for journals (1:4ratio)
> >>
> >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> >> impacted.
> >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> workable condition, if one osd down or one node down,etc.
> >>
> >> Thanks
> >> Swami
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread M Ranga Swami Reddy
Thats expected from Ceph by design. But in our case, we are using all
recommendation like rack failure domain, replication n/w,etc, still
face client IO performance issues during one OSD down..

On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
>
> With a RACK failure domain, you should be able to have an entire rack powered 
> down without noticing any major impact on the clients.  I regularly take down 
> OSDs and nodes for maintenance and upgrades without seeing any problems with 
> client IO.
>
> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy  
> wrote:
>>
>> Hello - I have a couple of questions on ceph cluster stability, even
>> we follow all recommendations as below:
>> - Having separate replication n/w and data n/w
>> - RACK is the failure domain
>> - Using SSDs for journals (1:4ratio)
>>
>> Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
>> Q2 - what is stability ratio, like with above, is ceph cluster
>> workable condition, if one osd down or one node down,etc.
>>
>> Thanks
>> Swami
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-19 Thread David Turner
With a RACK failure domain, you should be able to have an entire rack
powered down without noticing any major impact on the clients.  I regularly
take down OSDs and nodes for maintenance and upgrades without seeing any
problems with client IO.

On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
wrote:

> Hello - I have a couple of questions on ceph cluster stability, even
> we follow all recommendations as below:
> - Having separate replication n/w and data n/w
> - RACK is the failure domain
> - Using SSDs for journals (1:4ratio)
>
> Q1 - If one OSD down, cluster IO down drastically and customer Apps
> impacted.
> Q2 - what is stability ratio, like with above, is ceph cluster
> workable condition, if one osd down or one node down,etc.
>
> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster stability

2019-02-12 Thread M Ranga Swami Reddy
Hello - I have a couple of questions on ceph cluster stability, even
we follow all recommendations as below:
- Having separate replication n/w and data n/w
- RACK is the failure domain
- Using SSDs for journals (1:4ratio)

Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
Q2 - what is stability ratio, like with above, is ceph cluster
workable condition, if one osd down or one node down,etc.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Pardhiv Karri
Thank You Dwyeni for the quick response. We have 2 Hammer which are due for
upgrade to Luminous next month and 1 Luminous 12.2.8. Will try this on
Luminous and if it works then will apply the same once the Hammer clusters
are upgraded rather than adjusting the weights.

Thanks,
Pardhiv Karri

On Fri, Dec 21, 2018 at 1:05 PM Dyweni - Ceph-Users <6exbab4fy...@dyweni.com>
wrote:

> Hi,
>
>
> If you are running Ceph Luminous or later, use the Ceph Manager Daemon's
> Balancer module.  (http://docs.ceph.com/docs/luminous/mgr/balancer/).
>
>
> Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you
> achieve uniformity.  (You should be able to get under 1 STDDEV).  I would
> adjust in small amounts to not overload your cluster.
>
>
> Example:
>
> ceph osd reweight osd.X  y.yyy
>
>
>
>
> On 2018-12-21 14:56, Pardhiv Karri wrote:
>
> Hi,
>
> We have Ceph clusters which are greater than 1PB. We are using tree
> algorithm. The issue is with the data placement. If the cluster utilization
> percentage is at 65% then some of the OSDs are already above 87%. We had to
> change the near_full ratio to 0.90 to circumvent warnings and to get back
> the Health to OK state.
>
> How can we keep the OSDs utilization to be in sync with cluster
> utilization (both percentages to be close enough) as we want to utilize the
> cluster to the max (above 80%) without unnecessarily adding too many
> nodes/osd's. Right now we are losing close to 400TB of the disk space
> unused as some OSDs are above 87% and some are below 50%. If the above 87%
> OSDs reach 95% then the cluster will have issues. What is the best way to
> mitigate this issue?
>
> Thanks,
>
> *Pardhiv Karri*
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 
*Pardhiv Karri*
"Rise and Rise again until LAMBS become LIONS"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Dyweni - Ceph-Users
Hi, 

If you are running Ceph Luminous or later, use the Ceph Manager Daemon's
Balancer module.  (http://docs.ceph.com/docs/luminous/mgr/balancer/). 

Otherwise, tweak the OSD weights (not the OSD CRUSH weights) until you
achieve uniformity.  (You should be able to get under 1 STDDEV).  I
would adjust in small amounts to not overload your cluster. 

Example: 

ceph osd reweight osd.X  y.yyy 

On 2018-12-21 14:56, Pardhiv Karri wrote:

> Hi, 
> 
> We have Ceph clusters which are greater than 1PB. We are using tree 
> algorithm. The issue is with the data placement. If the cluster utilization 
> percentage is at 65% then some of the OSDs are already above 87%. We had to 
> change the near_full ratio to 0.90 to circumvent warnings and to get back the 
> Health to OK state. 
> 
> How can we keep the OSDs utilization to be in sync with cluster utilization 
> (both percentages to be close enough) as we want to utilize the cluster to 
> the max (above 80%) without unnecessarily adding too many nodes/osd's. Right 
> now we are losing close to 400TB of the disk space unused as some OSDs are 
> above 87% and some are below 50%. If the above 87% OSDs reach 95% then the 
> cluster will have issues. What is the best way to mitigate this issue? 
> 
> Thanks, 
> Pardhiv Karri
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cluster to OSD Utilization not in Sync

2018-12-21 Thread Pardhiv Karri
Hi,

We have Ceph clusters which are greater than 1PB. We are using tree
algorithm. The issue is with the data placement. If the cluster utilization
percentage is at 65% then some of the OSDs are already above 87%. We had to
change the near_full ratio to 0.90 to circumvent warnings and to get back
the Health to OK state.

How can we keep the OSDs utilization to be in sync with cluster utilization
(both percentages to be close enough) as we want to utilize the cluster to
the max (above 80%) without unnecessarily adding too many nodes/osd's.
Right now we are losing close to 400TB of the disk space unused as some
OSDs are above 87% and some are below 50%. If the above 87% OSDs reach 95%
then the cluster will have issues. What is the best way to mitigate this
issue?

Thanks,

*Pardhiv Karri*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif

If you simply multiply number of objects and rbd object size
you will get 7611672*4M ~= 29T and that is what you should see in USED
field, and 29/2*3=43.5T of raw space.
Unfortunately no idea why they consume less; probably because not all
objects are fully written.


It seems some objects correspond to snapshots and Bluestore is smart and 
uses copy-on-write (virtual clone) on them, so they aren't provisioned 
at all...


...UNTIL REBALANCE


What ceph version?


Mimic 13.2.2


Can you show osd config, "ceph daemon osd.0 config show"?


See the attachment. But it mostly contains defaults, only the following 
variables are overridden in /etc/ceph/ceph.conf:


[osd]
rbd_op_threads = 4
osd_op_queue = mclock_opclass
osd_max_backfills = 2
bluestore_prefer_deferred_size_ssd = 1
bdev_enable_discard = true


Can you show some "rbd info ecpool_hdd/rbd_name"?


[root@sill-01 ~]# rbd info rpool_hdd/rms-201807-golden
rbd image 'rms-201807-golden':
size 14 TiB in 3670016 objects
order 22 (4 MiB objects)
id: 3d3e1d6b8b4567
data_pool: ecpool_hdd
block_name_prefix: rbd_data.15.3d3e1d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Tue Aug  7 13:00:10 2018
{
"name": "osd.0",
"cluster": "ceph",
"admin_socket": "/var/run/ceph/ceph-osd.0.asok",
"admin_socket_mode": "",
"auth_client_required": "cephx",
"auth_cluster_required": "cephx",
"auth_debug": "false",
"auth_mon_ticket_ttl": "43200.00",
"auth_service_required": "cephx",
"auth_service_ticket_ttl": "3600.00",
"auth_supported": "",
"bdev_aio": "true",
"bdev_aio_max_queue_depth": "1024",
"bdev_aio_poll_ms": "250",
"bdev_aio_reap_max": "16",
"bdev_async_discard": "false",
"bdev_block_size": "4096",
"bdev_debug_aio": "false",
"bdev_debug_aio_suicide_timeout": "60.00",
"bdev_debug_inflight_ios": "false",
"bdev_enable_discard": "true",
"bdev_inject_crash": "0",
"bdev_inject_crash_flush_delay": "2",
"bdev_nvme_retry_count": "-1",
"bdev_nvme_unbind_from_kernel": "false",
"bluefs_alloc_size": "1048576",
"bluefs_allocator": "stupid",
"bluefs_buffered_io": "true",
"bluefs_compact_log_sync": "false",
"bluefs_log_compact_min_ratio": "5.00",
"bluefs_log_compact_min_size": "16777216",
"bluefs_max_log_runway": "4194304",
"bluefs_max_prefetch": "1048576",
"bluefs_min_flush_size": "524288",
"bluefs_min_log_runway": "1048576",
"bluefs_preextend_wal_files": "false",
"bluefs_sync_write": "false",
"bluestore_2q_cache_kin_ratio": "0.50",
"bluestore_2q_cache_kout_ratio": "0.50",
"bluestore_allocator": "stupid",
"bluestore_bitmapallocator_blocks_per_zone": "1024",
"bluestore_bitmapallocator_span_size": "1024",
"bluestore_blobid_prealloc": "10240",
"bluestore_block_create": "true",
"bluestore_block_db_create": "false",
"bluestore_block_db_path": "",
"bluestore_block_db_size": "0",
"bluestore_block_path": "",
"bluestore_block_preallocate_file": "false",
"bluestore_block_size": "10737418240",
"bluestore_block_wal_create": "false",
"bluestore_block_wal_path": "",
"bluestore_block_wal_size": "100663296",
"bluestore_bluefs": "true",
"bluestore_bluefs_balance_failure_dump_interval": "0.00",
"bluestore_bluefs_balance_interval": "1.00",
"bluestore_bluefs_env_mirror": "false",
"bluestore_bluefs_gift_ratio": "0.02",
"bluestore_bluefs_max_ratio": "0.90",
"bluestore_bluefs_min": "1073741824",
"bluestore_bluefs_min_free": "1073741824",
"bluestore_bluefs_min_ratio": "0.02",
"bluestore_bluefs_reclaim_ratio": "0.20",
"bluestore_cache_kv_min": "536870912",
"bluestore_cache_kv_ratio": "0.50",
"bluestore_cache_meta_ratio": "0.50",
"bluestore_cache_size": "0",
"bluestore_cache_size_hdd": "1073741824",
"bluestore_cache_size_ssd": "3221225472",
"bluestore_cache_trim_interval": "0.05",
"bluestore_cache_trim_max_skip_pinned": "64",
"bluestore_cache_type": "2q",
"bluestore_clone_cow": "true",
"bluestore_compression_algorithm": "snappy",
"bluestore_compression_max_blob_size": "0",
"bluestore_compression_max_blob_size_hdd": "524288",
"bluestore_compression_max_blob_size_ssd": "65536",
"bluestore_compression_min_blob_size": "0",
"bluestore_compression_min_blob_size_hdd": "131072",
"bluestore_compression_min_blob_size_ssd": "8192",
"bluestore_compression_mode": "none",
"bluestore_compression_required_ratio": "0.875000",
"bluestore_csum_type": "crc32c",
"bluestore_debug_freelist": "false",
"bluestore_debug_fsck_abort": "false",
"bluestore_debug_inject_bug21040": "false",
"bluestore_debug_inject_read_err": 

Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread Aleksei Gutikov

If you simply multiply number of objects and rbd object size
you will get 7611672*4M ~= 29T and that is what you should see in USED 
field, and 29/2*3=43.5T of raw space.
Unfortunately no idea why they consume less; probably because not all 
objects are fully written.

What ceph version?
Can you show osd config, "ceph daemon osd.0 config show"?
Can you show some "rbd info ecpool_hdd/rbd_name"?
For example if bluestore_min_alloc_size_hdd is greater than
/ can cause additional space consumption.


On 10/29/2018 11:50 PM, Виталий Филиппов wrote:

Is there a way to force OSDs to remove old data?


Hi

After I recreated one OSD + increased pg count of my erasure-coded 
(2+1) pool (which was way too low, only 100 for 9 osds) the cluster 
started to eat additional disk space.


First I thought that was caused by the moved PGs using additional 
space during unfinished backfills. I pinned most of new PGs to old 
OSDs via `pg-upmap` and indeed it freed some space in the cluster.


Then I reduced osd_max_backfills to 1 and started to remove upmap pins 
in small portions which allowed Ceph to finish backfills for these PGs.


HOWEVER, used capacity still grows! It drops after moving each PG, but 
still grows overall.


It has grown +1.3TB yesterday. In the same period of time clients have 
written only ~200 new objects (~800 MB, there are RBD images only).


Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were 
created by RBD clients yesterday, but used raw space increased +1.3 TB.


Additional question is why ceph df / rados df tells there is only 16 
TB actual data written, but it uses 29.8 TB (now 31 TB) of raw disk 
space. Shouldn't it be 16 / 2*3 = 24 TB ?


ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED
38 TiB 6.9 TiB   32 TiB 82.03
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672
rpool_hdd  15 9.2 MiB 0   515 GiB  92
fs_meta44  20 KiB 0   515 GiB  23
fs_data45 0 B 0   1.0 TiB   0

How to heal it?





--

Best regards,
Aleksei Gutikov
Software Engineer | synesis.ru | Minsk. BY
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-11-02 Thread vitalif

Hi again.

It seems I've found the problem, although I don't understand the root 
cause.


I looked into OSD datastore using ceph-objectstore-tool and I see that 
for almost every object there are two copies, like:


2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:28#
2#13:080008d8:::rbd_data.15.3d3e1d6b8b4567.00361a96:head#

And more interesting is the fact that these two copies don't differ (!).

So the space is taken up by the unneeded snapshot copies.

rbd_data.15.3d3e1d6b8b4567 is the prefix of the biggest (14 TB) base 
image we have. This image has 1 snapshot:


[root@sill-01 ~]# rbd info rpool_hdd/rms-201807-golden
rbd image 'rms-201807-golden':
size 14 TiB in 3670016 objects
order 22 (4 MiB objects)
id: 3d3e1d6b8b4567
data_pool: ecpool_hdd
block_name_prefix: rbd_data.15.3d3e1d6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool

op_features:
flags:
create_timestamp: Tue Aug  7 13:00:10 2018
[root@sill-01 ~]# rbd snap ls rpool_hdd/rms-201807-golden
SNAPID NAME  SIZE TIMESTAMP
37 initial 14 TiB Tue Aug 14 12:42:48 2018

The problem is this image has NEVER been written to after importing it 
to Ceph with RBD. All writes go only to its clones.


So I have 2.. no, 5 questions:

1) Why base image snapshot is "provisioned" while the image isn't 
written to? May it be related to `rbd snap revert`? (i.e. does rbd snap 
revert just copy all snapshot data into the image itself?)


2) If all parent snapshots seem to be forcefully provisioned on write: 
Is there a way to disable this behaviour? Maybe if I make the base image 
readonly its snapshots will stop to be "provisioned"?


3) Even if there is no way to disable it: why does Ceph create extra 
copy of equal snapshot data during rebalance?


4) What's ":28" in rados objects? Snapshot id is 37. Even in hex 0x28 = 
40, not 37. Or does RADOS snapshot id not need to be equal to RBD 
snapshot ID?


5) Am I safe to "unprovision" the snapshot? (for example, by doing `rbd 
snap revert`?)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-10-29 Thread Виталий Филиппов

Is there a way to force OSDs to remove old data?


Hi

After I recreated one OSD + increased pg count of my erasure-coded (2+1)  
pool (which was way too low, only 100 for 9 osds) the cluster started to  
eat additional disk space.


First I thought that was caused by the moved PGs using additional space  
during unfinished backfills. I pinned most of new PGs to old OSDs via  
`pg-upmap` and indeed it freed some space in the cluster.


Then I reduced osd_max_backfills to 1 and started to remove upmap pins  
in small portions which allowed Ceph to finish backfills for these PGs.


HOWEVER, used capacity still grows! It drops after moving each PG, but  
still grows overall.


It has grown +1.3TB yesterday. In the same period of time clients have  
written only ~200 new objects (~800 MB, there are RBD images only).


Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were created  
by RBD clients yesterday, but used raw space increased +1.3 TB.


Additional question is why ceph df / rados df tells there is only 16 TB  
actual data written, but it uses 29.8 TB (now 31 TB) of raw disk space.  
Shouldn't it be 16 / 2*3 = 24 TB ?


ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED
38 TiB 6.9 TiB   32 TiB 82.03
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672
rpool_hdd  15 9.2 MiB 0   515 GiB  92
fs_meta44  20 KiB 0   515 GiB  23
fs_data45 0 B 0   1.0 TiB   0

How to heal it?



--
С наилучшими пожеланиями,
  Виталий Филиппов
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster uses substantially more disk space after rebalancing

2018-10-29 Thread Виталий Филиппов
Hi

After I recreated one OSD + increased pg count of my erasure-coded (2+1) pool 
(which was way too low, only 100 for 9 osds) the cluster started to eat 
additional disk space.

First I thought that was caused by the moved PGs using additional space during 
unfinished backfills. I pinned most of new PGs to old OSDs via `pg-upmap` and 
indeed it freed some space in the cluster.

Then I reduced osd_max_backfills to 1 and started to remove upmap pins in small 
portions which allowed Ceph to finish backfills for these PGs.

HOWEVER, used capacity still grows! It drops after moving each PG, but still 
grows overall.

It has grown +1.3TB yesterday. In the same period of time clients have written 
only ~200 new objects (~800 MB, there are RBD images only).

Why, what's using such big amount of additional space?

Graphs from our prometheus are attached. Only ~200 objects were created by RBD 
clients yesterday, but used raw space increased +1.3 TB.

Additional question is why ceph df / rados df tells there is only 16 TB actual 
data written, but it uses 29.8 TB (now 31 TB) of raw disk space. Shouldn't it 
be 16 / 2*3 = 24 TB ?

ceph df output:

[root@sill-01 ~]# ceph df
GLOBAL:
SIZE   AVAIL   RAW USED %RAW USED 
38 TiB 6.9 TiB   32 TiB 82.03 
POOLS:
NAME   ID USED%USED MAX AVAIL OBJECTS 
ecpool_hdd 13  16 TiB 93.94   1.0 TiB 7611672 
rpool_hdd  15 9.2 MiB 0   515 GiB  92 
fs_meta44  20 KiB 0   515 GiB  23 
fs_data45 0 B 0   1.0 TiB   0 

How to heal it?
-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Sergey Malinin
It is just a block size and it has no impact on data safety except that OSDs 
need to be redeployed in order for them to create bluefs with given block size.


> On 21.10.2018, at 19:04, Waterbly, Dan  wrote:
> 
> Thanks Sergey!
> 
> Do you know where I can find details on the repercussions of adjusting this 
> value? Performance (read/writes), for once, not critical for us, data 
> durability and disaster recovery is our focus.
> 
> -Dan
> 
> Get Outlook for iOS 
> 
> 
> On Sun, Oct 21, 2018 at 8:37 AM -0700, "Sergey Malinin"  > wrote:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html 
> 
> 
> 
>> On 21.10.2018, at 16:12, Waterbly, Dan > > wrote:
>> 
>> Awesome! Thanks Serian!
>> 
>> Do you know where the 64KB comes from? Can that be tuned down for a cluster 
>> holding smaller objects?
>> 
>> Get Outlook for iOS 
>> 
>> 
>> On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban" 
>> mailto:cobanser...@gmail.com>> wrote:
>> 
>> you have 24M objects, not 2.4M.
>> Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
>> Add 3x replication to that, it is 4.5TB
>> 
>> On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>> >
>> > Hi Jakub,
>> >
>> > No, my setup seems to be the same as yours. Our system is mainly for 
>> > archiving loads of data. This data has to be stored forever and allow 
>> > reads, albeit seldom considering the number of objects we will store vs 
>> > the number of objects that ever will be requested.
>> >
>> > It just really seems odd that the metadata surrounding the 25M objects is 
>> > so high.
>> >
>> > We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but 
>> > I’d like to know why we are seeing what we are and how it all adds up.
>> >
>> > Thanks!
>> > Dan
>> >
>> > Get Outlook for iOS
>> >
>> >
>> >
>> > On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
>> >
>> >> Hi Dan,
>> >>
>> >> Did you configure block.wal/block.db as separate devices/partition 
>> >> (osd_scenario: non-collocated or lvm for clusters installed using 
>> >> ceph-ansbile playbooks )?
>> >>
>> >> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> >> situation - the sum of block.db partitions' size is displayed as RAW USED 
>> >> in ceph df.
>> >> Perhaps it is not the case for collocated block.db/wal.
>> >>
>> >> Jakub
>> >>
>> >> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
>> >>>
>> >>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These 
>> >>> numbers seem very high to me.
>> >>>
>> >>> Get Outlook for iOS
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
>> >>>
>>  4.65TiB includes size of wal and db partitions too.
>>  On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>>  >
>>  > Hello,
>>  >
>>  >
>>  >
>>  > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>>  > replication).
>>  >
>>  >
>>  >
>>  > I am confused by the usage ceph df is reporting and am hoping someone 
>>  > can shed some light on this. Here is what I see when I run ceph df
>>  >
>>  >
>>  >
>>  > GLOBAL:
>>  >
>>  > SIZEAVAIL   RAW USED %RAW USED
>>  >
>>  > 1.02PiB 1.02PiB  4.65TiB  0.44
>>  >
>>  > POOLS:
>>  >
>>  > NAME   ID USED
>>  > %USED MAX AVAIL OBJECTS
>>  >
>>  > .rgw.root  1  3.30KiB 
>>  > 0330TiB   17
>>  >
>>  > .rgw.buckets.data  2  22.9GiB 0330TiB 
>>  > 24550943
>>  >
>>  > default.rgw.control3   0B 
>>  > 0330TiB8
>>  >
>>  > default.rgw.meta   4 373B 
>>  > 0330TiB3
>>  >
>>  > default.rgw.log5   0B 
>>  > 0330TiB0
>>  >
>>  > .rgw.control   6   0B 0330TiB 
>>  >8
>>  >
>>  > .rgw.meta  7  2.18KiB 0330TiB 
>>  >   12
>>  >
>>  > .rgw.log   8   0B 0330TiB 
>>  >  194
>>  >
>>  > .rgw.buckets.index 9   0B 0330TiB 
>>  > 2560
>>  >
>>  >
>>  >
>>  > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
>>  > whole is reporting 4.65TiB? There is nothing else on this cluster as 
>>  > it was just installed and 

Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Waterbly, Dan
Thanks Sergey!

Do you know where I can find details on the repercussions of adjusting this 
value? Performance (read/writes), for once, not critical for us, data 
durability and disaster recovery is our focus.

-Dan

Get Outlook for iOS



On Sun, Oct 21, 2018 at 8:37 AM -0700, "Sergey Malinin" 
mailto:h...@newmail.com>> wrote:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html


On 21.10.2018, at 16:12, Waterbly, Dan 
mailto:dan.water...@sos.wa.gov>> wrote:

Awesome! Thanks Serian!

Do you know where the 64KB comes from? Can that be tuned down for a cluster 
holding smaller objects?

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


you have 24M objects, not 2.4M.
Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
Add 3x replication to that, it is 4.5TB

On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>
> Hi Jakub,
>
> No, my setup seems to be the same as yours. Our system is mainly for 
> archiving loads of data. This data has to be stored forever and allow reads, 
> albeit seldom considering the number of objects we will store vs the number 
> of objects that ever will be requested.
>
> It just really seems odd that the metadata surrounding the 25M objects is so 
> high.
>
> We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> like to know why we are seeing what we are and how it all adds up.
>
> Thanks!
> Dan
>
> Get Outlook for iOS
>
>
>
> On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
>
>> Hi Dan,
>>
>> Did you configure block.wal/block.db as separate devices/partition 
>> (osd_scenario: non-collocated or lvm for clusters installed using 
>> ceph-ansbile playbooks )?
>>
>> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> situation - the sum of block.db partitions' size is displayed as RAW USED in 
>> ceph df.
>> Perhaps it is not the case for collocated block.db/wal.
>>
>> Jakub
>>
>> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
>>>
>>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers 
>>> seem very high to me.
>>>
>>> Get Outlook for iOS
>>>
>>>
>>>
>>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
>>>
 4.65TiB includes size of wal and db partitions too.
 On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
 >
 > Hello,
 >
 >
 >
 > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
 > replication).
 >
 >
 >
 > I am confused by the usage ceph df is reporting and am hoping someone 
 > can shed some light on this. Here is what I see when I run ceph df
 >
 >
 >
 > GLOBAL:
 >
 > SIZEAVAIL   RAW USED %RAW USED
 >
 > 1.02PiB 1.02PiB  4.65TiB  0.44
 >
 > POOLS:
 >
 > NAME   ID USED
 > %USED MAX AVAIL OBJECTS
 >
 > .rgw.root  1  3.30KiB
 >  0330TiB   17
 >
 > .rgw.buckets.data  2  22.9GiB 0330TiB 
 > 24550943
 >
 > default.rgw.control3   0B
 >  0330TiB8
 >
 > default.rgw.meta   4 373B
 >  0330TiB3
 >
 > default.rgw.log5   0B
 >  0330TiB0
 >
 > .rgw.control   6   0B 0330TiB
 > 8
 >
 > .rgw.meta  7  2.18KiB 0330TiB
 >12
 >
 > .rgw.log   8   0B 0330TiB
 >   194
 >
 > .rgw.buckets.index 9   0B 0330TiB
 >  2560
 >
 >
 >
 > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
 > whole is reporting 4.65TiB? There is nothing else on this cluster as it 
 > was just installed and configured.
 >
 >
 >
 > Thank you for your help with this.
 >
 >
 >
 > -Dan
 >
 >
 >
 > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
 > dan.water...@sos.wa.gov
 >
 > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
 >
 >
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> 

Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Sergey Malinin
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024589.html 



> On 21.10.2018, at 16:12, Waterbly, Dan  wrote:
> 
> Awesome! Thanks Serian!
> 
> Do you know where the 64KB comes from? Can that be tuned down for a cluster 
> holding smaller objects?
> 
> Get Outlook for iOS 
> 
> 
> On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban"  > wrote:
> 
> you have 24M objects, not 2.4M.
> Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
> Add 3x replication to that, it is 4.5TB
> 
> On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
> >
> > Hi Jakub,
> >
> > No, my setup seems to be the same as yours. Our system is mainly for 
> > archiving loads of data. This data has to be stored forever and allow 
> > reads, albeit seldom considering the number of objects we will store vs the 
> > number of objects that ever will be requested.
> >
> > It just really seems odd that the metadata surrounding the 25M objects is 
> > so high.
> >
> > We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> > like to know why we are seeing what we are and how it all adds up.
> >
> > Thanks!
> > Dan
> >
> > Get Outlook for iOS
> >
> >
> >
> > On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
> >
> >> Hi Dan,
> >>
> >> Did you configure block.wal/block.db as separate devices/partition 
> >> (osd_scenario: non-collocated or lvm for clusters installed using 
> >> ceph-ansbile playbooks )?
> >>
> >> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
> >> situation - the sum of block.db partitions' size is displayed as RAW USED 
> >> in ceph df.
> >> Perhaps it is not the case for collocated block.db/wal.
> >>
> >> Jakub
> >>
> >> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
> >>>
> >>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These 
> >>> numbers seem very high to me.
> >>>
> >>> Get Outlook for iOS
> >>>
> >>>
> >>>
> >>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
> >>>
>  4.65TiB includes size of wal and db partitions too.
>  On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>  >
>  > Hello,
>  >
>  >
>  >
>  > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>  > replication).
>  >
>  >
>  >
>  > I am confused by the usage ceph df is reporting and am hoping someone 
>  > can shed some light on this. Here is what I see when I run ceph df
>  >
>  >
>  >
>  > GLOBAL:
>  >
>  > SIZEAVAIL   RAW USED %RAW USED
>  >
>  > 1.02PiB 1.02PiB  4.65TiB  0.44
>  >
>  > POOLS:
>  >
>  > NAME   ID USED
>  > %USED MAX AVAIL OBJECTS
>  >
>  > .rgw.root  1  3.30KiB  
>  >0330TiB   17
>  >
>  > .rgw.buckets.data  2  22.9GiB 0330TiB 
>  > 24550943
>  >
>  > default.rgw.control3   0B  
>  >0330TiB8
>  >
>  > default.rgw.meta   4 373B  
>  >0330TiB3
>  >
>  > default.rgw.log5   0B  
>  >0330TiB0
>  >
>  > .rgw.control   6   0B 0330TiB  
>  >   8
>  >
>  > .rgw.meta  7  2.18KiB 0330TiB  
>  >  12
>  >
>  > .rgw.log   8   0B 0330TiB  
>  > 194
>  >
>  > .rgw.buckets.index 9   0B 0330TiB  
>  >2560
>  >
>  >
>  >
>  > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
>  > whole is reporting 4.65TiB? There is nothing else on this cluster as 
>  > it was just installed and configured.
>  >
>  >
>  >
>  > Thank you for your help with this.
>  >
>  >
>  >
>  > -Dan
>  >
>  >
>  >
>  > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
>  > dan.water...@sos.wa.gov
>  >
>  > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>  >
>  >
>  >
>  > ___
>  > ceph-users mailing list
>  > ceph-users@lists.ceph.com
>  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-21 Thread Waterbly, Dan
Awesome! Thanks Serian!

Do you know where the 64KB comes from? Can that be tuned down for a cluster 
holding smaller objects?

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:49 PM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


you have 24M objects, not 2.4M.
Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
Add 3x replication to that, it is 4.5TB

On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>
> Hi Jakub,
>
> No, my setup seems to be the same as yours. Our system is mainly for 
> archiving loads of data. This data has to be stored forever and allow reads, 
> albeit seldom considering the number of objects we will store vs the number 
> of objects that ever will be requested.
>
> It just really seems odd that the metadata surrounding the 25M objects is so 
> high.
>
> We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> like to know why we are seeing what we are and how it all adds up.
>
> Thanks!
> Dan
>
> Get Outlook for iOS
>
>
>
> On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski"  wrote:
>
>> Hi Dan,
>>
>> Did you configure block.wal/block.db as separate devices/partition 
>> (osd_scenario: non-collocated or lvm for clusters installed using 
>> ceph-ansbile playbooks )?
>>
>> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> situation - the sum of block.db partitions' size is displayed as RAW USED in 
>> ceph df.
>> Perhaps it is not the case for collocated block.db/wal.
>>
>> Jakub
>>
>> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  wrote:
>>>
>>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers 
>>> seem very high to me.
>>>
>>> Get Outlook for iOS
>>>
>>>
>>>
>>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban"  wrote:
>>>
 4.65TiB includes size of wal and db partitions too.
 On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
 >
 > Hello,
 >
 >
 >
 > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
 > replication).
 >
 >
 >
 > I am confused by the usage ceph df is reporting and am hoping someone 
 > can shed some light on this. Here is what I see when I run ceph df
 >
 >
 >
 > GLOBAL:
 >
 > SIZEAVAIL   RAW USED %RAW USED
 >
 > 1.02PiB 1.02PiB  4.65TiB  0.44
 >
 > POOLS:
 >
 > NAME   ID USED
 > %USED MAX AVAIL OBJECTS
 >
 > .rgw.root  1  3.30KiB
 >  0330TiB   17
 >
 > .rgw.buckets.data  2  22.9GiB 0330TiB 
 > 24550943
 >
 > default.rgw.control3   0B
 >  0330TiB8
 >
 > default.rgw.meta   4 373B
 >  0330TiB3
 >
 > default.rgw.log5   0B
 >  0330TiB0
 >
 > .rgw.control   6   0B 0330TiB
 > 8
 >
 > .rgw.meta  7  2.18KiB 0330TiB
 >12
 >
 > .rgw.log   8   0B 0330TiB
 >   194
 >
 > .rgw.buckets.index 9   0B 0330TiB
 >  2560
 >
 >
 >
 > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
 > whole is reporting 4.65TiB? There is nothing else on this cluster as it 
 > was just installed and configured.
 >
 >
 >
 > Thank you for your help with this.
 >
 >
 >
 > -Dan
 >
 >
 >
 > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
 > dan.water...@sos.wa.gov
 >
 > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
 >
 >
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Serkan Çoban
you have 24M objects, not 2.4M.
Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
Add 3x replication to that, it is 4.5TB

On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>
> Hi Jakub,
>
> No, my setup seems to be the same as yours. Our system is mainly for 
> archiving loads of data. This data has to be stored forever and allow reads, 
> albeit seldom considering the number of objects we will store vs the number 
> of objects that ever will be requested.
>
> It just really seems odd that the metadata surrounding the 25M objects is so 
> high.
>
> We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> like to know why we are seeing what we are and how it all adds up.
>
> Thanks!
> Dan
>
> Get Outlook for iOS
>
>
>
> On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski" 
>  wrote:
>
>> Hi Dan,
>>
>> Did you configure block.wal/block.db as separate devices/partition 
>> (osd_scenario: non-collocated or lvm for clusters installed using 
>> ceph-ansbile playbooks )?
>>
>> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> situation - the sum of block.db partitions' size is displayed as RAW USED in 
>> ceph df.
>> Perhaps it is not the case for collocated block.db/wal.
>>
>> Jakub
>>
>> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  
>> wrote:
>>>
>>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers 
>>> seem very high to me.
>>>
>>> Get Outlook for iOS
>>>
>>>
>>>
>>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
>>>  wrote:
>>>
 4.65TiB includes size of wal and db partitions too.
 On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
 >
 > Hello,
 >
 >
 >
 > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
 > replication).
 >
 >
 >
 > I am confused by the usage ceph df is reporting and am hoping someone 
 > can shed some light on this. Here is what I see when I run ceph df
 >
 >
 >
 > GLOBAL:
 >
 > SIZEAVAIL   RAW USED %RAW USED
 >
 > 1.02PiB 1.02PiB  4.65TiB  0.44
 >
 > POOLS:
 >
 > NAME   ID USED
 > %USED MAX AVAIL OBJECTS
 >
 > .rgw.root  1  3.30KiB
 >  0330TiB   17
 >
 > .rgw.buckets.data  2  22.9GiB 0330TiB 
 > 24550943
 >
 > default.rgw.control3   0B
 >  0330TiB8
 >
 > default.rgw.meta   4 373B
 >  0330TiB3
 >
 > default.rgw.log5   0B
 >  0330TiB0
 >
 > .rgw.control   6   0B 0330TiB
 > 8
 >
 > .rgw.meta  7  2.18KiB 0330TiB
 >12
 >
 > .rgw.log   8   0B 0330TiB
 >   194
 >
 > .rgw.buckets.index 9   0B 0330TiB
 >  2560
 >
 >
 >
 > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
 > whole is reporting 4.65TiB? There is nothing else on this cluster as it 
 > was just installed and configured.
 >
 >
 >
 > Thank you for your help with this.
 >
 >
 >
 > -Dan
 >
 >
 >
 > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
 > dan.water...@sos.wa.gov
 >
 > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
 >
 >
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
Hi Jakub,

No, my setup seems to be the same as yours. Our system is mainly for archiving 
loads of data. This data has to be stored forever and allow reads, albeit 
seldom considering the number of objects we will store vs the number of objects 
that ever will be requested.

It just really seems odd that the metadata surrounding the 25M objects is so 
high.

We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
like to know why we are seeing what we are and how it all adds up.

Thanks!
Dan

Get Outlook for iOS



On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski" 
mailto:jaszewski.ja...@gmail.com>> wrote:

Hi Dan,

Did you configure block.wal/block.db as separate devices/partition 
(osd_scenario: non-collocated or lvm for clusters installed using ceph-ansbile 
playbooks )?

I run Ceph version 13.2.1 with non-collocated data.db and have the same 
situation - the sum of block.db partitions' size is displayed as RAW USED in 
ceph df.
Perhaps it is not the case for collocated block.db/wal.

Jakub

On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan 
mailto:dan.water...@sos.wa.gov>> wrote:
I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers seem 
very high to me.

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Jakub Jaszewski
Hi Dan,

Did you configure block.wal/block.db as separate devices/partition
(osd_scenario: non-collocated or lvm for clusters installed using
ceph-ansbile playbooks )?

I run Ceph version 13.2.1 with non-collocated data.db and have the same
situation - the sum of block.db partitions' size is displayed as RAW USED
in ceph df.
Perhaps it is not the case for collocated block.db/wal.

Jakub

On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan 
wrote:

> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers
> seem very high to me.
>
> Get Outlook for iOS 
>
>
>
> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" <
> cobanser...@gmail.com> wrote:
>
> 4.65TiB includes size of wal and db partitions too.
>> On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>> > replication).
>> >
>> >
>> >
>> > I am confused by the usage ceph df is reporting and am hoping someone can 
>> > shed some light on this. Here is what I see when I run ceph df
>> >
>> >
>> >
>> > GLOBAL:
>> >
>> > SIZEAVAIL   RAW USED %RAW USED
>> >
>> > 1.02PiB 1.02PiB  4.65TiB  0.44
>> >
>> > POOLS:
>> >
>> > NAME   ID USED
>> > %USED MAX AVAIL OBJECTS
>> >
>> > .rgw.root  1  3.30KiB 
>> > 0330TiB   17
>> >
>> > .rgw.buckets.data  2  22.9GiB 0330TiB 
>> > 24550943
>> >
>> > default.rgw.control3   0B 
>> > 0330TiB8
>> >
>> > default.rgw.meta   4 373B 
>> > 0330TiB3
>> >
>> > default.rgw.log5   0B 
>> > 0330TiB0
>> >
>> > .rgw.control   6   0B 0330TiB  
>> >   8
>> >
>> > .rgw.meta  7  2.18KiB 0330TiB  
>> >  12
>> >
>> > .rgw.log   8   0B 0330TiB  
>> > 194
>> >
>> > .rgw.buckets.index 9   0B 0330TiB 
>> > 2560
>> >
>> >
>> >
>> > Why does my bucket pool report usage of 22.9GiB but my cluster as a whole 
>> > is reporting 4.65TiB? There is nothing else on this cluster as it was just 
>> > installed and configured.
>> >
>> >
>> >
>> > Thank you for your help with this.
>> >
>> >
>> >
>> > -Dan
>> >
>> >
>> >
>> > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
>> > dan.water...@sos.wa.gov
>> >
>> > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers seem 
very high to me.

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Serkan Çoban
4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
Hello,

I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
replication).

I am confused by the usage ceph df is reporting and am hoping someone can shed 
some light on this. Here is what I see when I run ceph df

GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
1.02PiB 1.02PiB  4.65TiB  0.44
POOLS:
NAME   ID USED%USED 
MAX AVAIL OBJECTS
.rgw.root  1  3.30KiB 0 
   330TiB   17
.rgw.buckets.data  2  22.9GiB 0330TiB 24550943
default.rgw.control3   0B 0 
   330TiB8
default.rgw.meta   4 373B 0 
   330TiB3
default.rgw.log5   0B 0 
   330TiB0
.rgw.control   6   0B 0330TiB8
.rgw.meta  7  2.18KiB 0330TiB   12
.rgw.log   8   0B 0330TiB  194
.rgw.buckets.index 9   0B 0330TiB 2560

Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
reporting 4.65TiB? There is nothing else on this cluster as it was just 
installed and configured.

Thank you for your help with this.

-Dan

Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
dan.water...@sos.wa.gov
WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster "hung" after node failure

2018-08-29 Thread Brett Chancellor
Hi All. I have a ceph cluster that's partially upgraded to Luminous. Last
night a host died and since then the cluster is failing to recover. It
finished backfilling, but was left with thousands of requests degraded,
inactive, or stale.  In order to move past the issue, I put the cluster in
noout,noscrub,nodeep-scrub and restarted all services one by one.

Here is the current state of the cluster, any idea how to get past the
stale and stuck pgs? Any help would be very appreciated. Thanks.

-Brett


## ceph -s output
###
$ sudo ceph -s
  cluster:
id:
health: HEALTH_ERR
165 pgs are stuck inactive for more than 60 seconds
243 pgs backfill_wait
144 pgs backfilling
332 pgs degraded
5 pgs peering
1 pgs recovery_wait
22 pgs stale
332 pgs stuck degraded
143 pgs stuck inactive
22 pgs stuck stale
531 pgs stuck unclean
330 pgs stuck undersized
330 pgs undersized
671 requests are blocked > 32 sec
603 requests are blocked > 4096 sec
recovery 3524906/412016682 objects degraded (0.856%)
recovery 2462252/412016682 objects misplaced (0.598%)
noout,noscrub,nodeep-scrub flag(s) set
mon.ceph0rdi-mon1-1-prd store is getting too big! 17612 MB >=
15360 MB
mon.ceph0rdi-mon2-1-prd store is getting too big! 17669 MB >=
15360 MB
mon.ceph0rdi-mon3-1-prd store is getting too big! 17586 MB >=
15360 MB

  services:
mon: 3 daemons, quorum
ceph0rdi-mon1-1-prd,ceph0rdi-mon2-1-prd,ceph0rdi-mon3-1-prd
mgr: ceph0rdi-mon3-1-prd(active), standbys: ceph0rdi-mon2-1-prd,
ceph0rdi-mon1-1-prd
osd: 222 osds: 218 up, 218 in; 428 remapped pgs
 flags noout,noscrub,nodeep-scrub

  data:
pools:   35 pools, 38144 pgs
objects: 130M objects, 172 TB
usage:   538 TB used, 337 TB / 875 TB avail
pgs: 0.375% pgs not active
 3524906/412016682 objects degraded (0.856%)
 2462252/412016682 objects misplaced (0.598%)
 37599 active+clean
 173   active+undersized+degraded+remapped+backfill_wait
 133   active+undersized+degraded+remapped+backfilling
 93activating
 68active+remapped+backfill_wait
 22activating+undersized+degraded+remapped
 13stale+active+clean
 11active+remapped+backfilling
 9 activating+remapped
 5 remapped
 5 stale+activating+remapped
 3 remapped+peering
 2 stale+remapped
 2 stale+remapped+peering
 1 activating+degraded+remapped
 1 active+clean+remapped
 1 active+degraded+remapped+backfill_wait
 1 active+undersized+remapped+backfill_wait
 1 activating+degraded
 1 active+recovery_wait+undersized+degraded+remapped

  io:
client:   187 kB/s rd, 2595 kB/s wr, 99 op/s rd, 343 op/s wr
recovery: 1509 MB/s, 1541 objects/s

## ceph pg dump_stuck stale (this number doesn't seem to decrease)

$ sudo ceph pg dump_stuck stale
ok
PG_STAT STATE UPUP_PRIMARY ACTING
ACTING_PRIMARY
17.6d7 stale+remapped[5,223,96]  5  [223,96,148]
223
2.5c5  stale+active+clean  [224,48,179]224  [224,48,179]
224
17.64e stale+active+clean  [224,84,109]224  [224,84,109]
224
19.5b4  stale+activating+remapped  [124,130,20]124   [124,20,11]
124
17.4c6 stale+active+clean  [224,216,95]224  [224,216,95]
224
73.413  stale+activating+remapped [117,130,189]117 [117,189,137]
117
2.431  stale+remapped+peering   [5,180,142]  5  [180,142,40]
180
69.1dc stale+active+clean[62,36,54] 62[62,36,54]
 62
14.790 stale+active+clean   [81,114,19] 81   [81,114,19]
 81
2.78e  stale+active+clean [224,143,124]224 [224,143,124]
224
73.37a stale+active+clean   [224,84,38]224   [224,84,38]
224
17.42d  stale+activating+remapped  [220,130,25]220  [220,25,137]
220
72.263 stale+active+clean [224,148,117]224 [224,148,117]
224
67.40  stale+active+clean   [62,170,71] 62   [62,170,71]
 62
67.16d stale+remapped+peering[3,147,22]  3   [147,22,29]
147
20.3de stale+active+clean [224,103,126]224 [224,103,126]
224
19.721 stale+remapped[3,34,179]  3  [34,179,128]
 34
19.2f1  stale+activating+remapped [126,130,178]126  [126,178,72]
126
74.28b stale+active+clean   [224,95,56]224 

Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Lenz Grimmer
On 07/24/2018 07:02 AM, Satish Patel wrote:

> My 5 node ceph cluster is ready for production, now i am looking for
> good monitoring tool (Open source), what majority of folks using in
> their production?

There are several, using Prometheus with the Ceph Exporter Manager
module is a popular choice for collecting the metrics. The ceph-metrics
project provide an exhaustive collection of dashboards for Grafana that
will help with the visualization and some alerting based on these metrics.

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Guilherme Steinmüller
Satish,

I'm currently working on monasca's roles for openstack-ansible.

We have plugins that monitors ceph as well and I use in production. Bellow
you can see an example:

https://imgur.com/a/6l6Q2K6



Em ter, 24 de jul de 2018 às 02:02, Satish Patel 
escreveu:

> My 5 node ceph cluster is ready for production, now i am looking for
> good monitoring tool (Open source), what majority of folks using in
> their production?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Matthew Vernon
Hi,

On 24/07/18 06:02, Satish Patel wrote:
> My 5 node ceph cluster is ready for production, now i am looking for
> good monitoring tool (Open source), what majority of folks using in
> their production?

This does come up from time to time, so it's worth checking the list
archives.

We use collected to collect metrics, graphite to store them (we've found
it much easier to look after than influxdb), and grafana to plot them, e.g.

https://cog.sanger.ac.uk/ceph_dashboard/ceph-dashboard-may2018.png

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Marc Roos


Just use collectd to start with. That is easiest with influxdb. However 
do not expect to much of the support on influxdb.


-Original Message-
From: Satish Patel [mailto:satish@gmail.com] 
Sent: dinsdag 24 juli 2018 7:02
To: ceph-users
Subject: [ceph-users] ceph cluster monitoring tool

My 5 node ceph cluster is ready for production, now i am looking for 
good monitoring tool (Open source), what majority of folks using in 
their production?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Robert Sander
On 24.07.2018 07:02, Satish Patel wrote:
> My 5 node ceph cluster is ready for production, now i am looking for
> good monitoring tool (Open source), what majority of folks using in
> their production?

Some people already use Prometheus and the exporter from the Ceph Mgr.

Some use more traditional monitoring systems (like me). I have written a
Ceph plugin for the Check_MK monitoring system:

https://github.com/HeinleinSupport/check_mk/tree/master/ceph

Caution: It will not scale to hundreds of OSDs as it invokes the Ceph
CLI tools to gather monitoring data on every node. This takes some time.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph cluster monitoring tool

2018-07-23 Thread Satish Patel
My 5 node ceph cluster is ready for production, now i am looking for
good monitoring tool (Open source), what majority of folks using in
their production?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster

2018-06-12 Thread Ronny Aasen

On 12. juni 2018 12:17, Muneendra Kumar M wrote:

conf file as shown below.

If I reconfigure my ipaddress from 10.xx.xx.xx to 192.xx.xx.xx and by 
changing the public network and mon_host filed in the ceph.conf


Will my cluster will work as it is ?

Below is my ceph.conf file details.

Any inputs will really help me to understand more on the same.



no. changing the subnet of the cluster is a complex operation.

sincxe you are using private ip addresses anyway i would reconsider 
changing, and only change it there was no other way.


this is the documentation for mimic on how to change.

http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph cluster

2018-06-12 Thread Muneendra Kumar M
Hi ,

I have created a ceph cluster with 3 osds and everything is running fine.

And our public network configuration parameter field was set to
10.xx.xx.0/24  in ceph.conf file as shown below.



If I reconfigure my ipaddress from 10.xx.xx.xx to 192.xx.xx.xx and by
changing the public network and mon_host filed in the ceph.conf

Will my cluster will work as it is ?



Below is my ceph.conf file details.



Any inputs will really help me to understand more on the same.

Ceph.conf:



[global]

fsid = 74cc4723-7ab9-4cc3-b8c8-182e138da955

mon_initial_members = TestNVMe2

mon_host = 10.xx.xx.xx

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

public network = 10.xx.xx.0/24

osd pool default size = 2

osd_max_object_name_len = 256

osd_max_object_namespace_len = 64



Regards,

Muneendra.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster with 3 Machines

2018-05-29 Thread David Turner
Using the kernel driver to map RBDs to a host with OSDs is known to cause
system locks.  The answer to avoiding this is to use rbd-nbd or rbd-fuse
instead of the kernel driver if you NEED to map the RBD to the same host as
any OSDs.

On Tue, May 29, 2018 at 7:34 AM Joshua Collins 
wrote:

> Hi
>
> I've had a go at setting up a Ceph cluster but I've ran into some issues.
>
> I have 3 physical machines to set up a Ceph cluster, and two of these
> machines will be part of a HA pair using corosync and Pacemaker.
>
> I keep running into filesystem lock issues on unmount when I have a
> machine running an OSD and monitor, while also mounting an RBD pool.
>
> Moving the OSD and monitor to a VM so that I could mount the RBD on the
> host hasn't fixed the issue.
>
> Is there a way to set this up to avoid the filesystem lock issues I'm
> encountering?
>
> Thanks in advance
>
> Josh
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Cluster with 3 Machines

2018-05-29 Thread Joshua Collins

Hi

I've had a go at setting up a Ceph cluster but I've ran into some issues.

I have 3 physical machines to set up a Ceph cluster, and two of these 
machines will be part of a HA pair using corosync and Pacemaker.


I keep running into filesystem lock issues on unmount when I have a 
machine running an OSD and monitor, while also mounting an RBD pool.


Moving the OSD and monitor to a VM so that I could mount the RBD on the 
host hasn't fixed the issue.


Is there a way to set this up to avoid the filesystem lock issues I'm 
encountering?


Thanks in advance

Josh
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-20 Thread Anthony Verevkin


> From: "John Spray" 
> Sent: Thursday, November 16, 2017 11:01:35 AM
> 
> On Thu, Nov 16, 2017 at 3:32 PM, David Turner 
> wrote:
> > That depends on another question.  Does the client write all 3
> > copies or
> > does the client send the copy to the primary OSD and then the
> > primary OSD
> > sends the write to the secondaries?  Someone asked this recently,
> > but I
> > don't recall if an answer was given.  I'm not actually certain
> > which is the
> > case.  If it's the latter then the 10Gb pipe from the client is all
> > you
> > need.
> 
> The client sends the write to the primary OSD (via the public
> network)
> and the primary OSD sends it on to the two replicas (via the cluster
> network).
> 
> John


Thank you John! Would you also know if the same is true for Erasure coding?
Is it the client or an OSD that is splitting the request into k+m chunks?
What about reads? Is it the client assembling the erasures or is the primary
OSD proxying each read request?

Also, for replicated sets people often forget that it's not just writes. When
the client is reading data, it comes from the primary OSD only and does not
generate extra traffic on the cluster network. So in the 50/50 read-write use 
case the public and cluster traffic would actually be balanced:
1x read + 1x write on public / 2x write replication on cluster.

Regards,
Anthony
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-16 Thread Blair Bethwaite
What type of SAS disks, spinners or SSD? You really need to specify
the sustained write throughput of your OSD nodes if you want to figure
out whether your network is sufficient/appropriate.

At 3x replication if you want to sustain e.g. 1 GB/s of write traffic
from clients then you will need 2 GB/s of cluster network capacity -
first write hits the primary OSD on the frontend/client network,
second and third replicas are sent from the primary to those other two
OSDs. So then the question is, do you have 2GB/s of cluster network
capacity? It's easy to get confused thinking about this if you are not
accustomed to cluster computing...

E.g. If you have a single 10GbE NIC per host then you can TX & RX at a
max of ~9.8Gb/s (bi-directional), so you might think you'll be limited
to 9.8/8 = 1.2GB/s on the cluster network and thus 1.2/2 = 0.6GB/s
from clients. However luckily your client/s can be writing in parallel
across multiple PGs (and thus to different primary OSDs). So the way
to work out your max Ceph network capacity is to first calculate your
average bisectional bandwidth. Let's assume for simplicity that you
have a single 10GbE ToR switch for this cluster, so bisectional
bandwidth is 10Gb/s between all 6 of your OSD hosts. Take that and
divide by #replica-1:  6x 9.8Gb/s / 2 = 29.4Gb/s / 8 = 3.6GB/s. That
means your theoretical 6 node 10GbE cluster can sustain up to 1.8GB/s
of client write throughput.

If you are talking about 10x HDD based OSDs per node then a single
10GbE network is probably ok: 6 (nodes) x 10 (OSDs) x 100MB/s
(optimistic max write throughput per HDD) / 3 (replica count) =
1.9GB/s

HTH.

On 16 November 2017 at 07:45, Sam Huracan  wrote:
> Hi,
>
> We intend build a new Ceph cluster with 6 Ceph OSD hosts, 10 SAS disks every
> host, using 10Gbps NIC for client network, object is replicated 3.
>
> So, how could I sizing the cluster network for best performance?
> As i have read, 3x replicate means 3x bandwidth client network = 30 Gbps, is
> it true? I think it is too much and make great cost
>
> Do you give me a suggestion?
>
> Thanks in advance.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-16 Thread John Spray
On Thu, Nov 16, 2017 at 3:32 PM, David Turner  wrote:
> That depends on another question.  Does the client write all 3 copies or
> does the client send the copy to the primary OSD and then the primary OSD
> sends the write to the secondaries?  Someone asked this recently, but I
> don't recall if an answer was given.  I'm not actually certain which is the
> case.  If it's the latter then the 10Gb pipe from the client is all you
> need.

The client sends the write to the primary OSD (via the public network)
and the primary OSD sends it on to the two replicas (via the cluster
network).

John

>
> If I had to guess, the client sends the writes to all OSDs, but that maxing
> the 10Gb pipe for 1 client isn't really your concern.  Few use cases would
> have a single client using 100% of the bandwidth.  For RGW, spin up a few
> more RGW daemons and balance them with an LB.  CephFS the clients
> communicate with the OSDs directly and you probably shouldn't use a network
> FS for a single client.  RBD is the likely place where this could happen,
> but few 6 server deployments are being used by a single client using all of
> the RBDs.  What I'm getting at is 3 clients with 10Gb can come pretty close
> to fully saturating the 10Gb ethernet on the cluster.  Likely at least to
> the point where the network pipe is not the bottleneck (OSD node CPU, OSD
> spindle speeds, etc).
>
> On Thu, Nov 16, 2017 at 9:46 AM Sam Huracan 
> wrote:
>>
>> Hi,
>>
>> We intend build a new Ceph cluster with 6 Ceph OSD hosts, 10 SAS disks
>> every host, using 10Gbps NIC for client network, object is replicated 3.
>>
>> So, how could I sizing the cluster network for best performance?
>> As i have read, 3x replicate means 3x bandwidth client network = 30 Gbps,
>> is it true? I think it is too much and make great cost
>>
>> Do you give me a suggestion?
>>
>> Thanks in advance.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-16 Thread David Turner
Another ML thread currently happening is "[ceph-users] Cluster network
slower than public network" And It has some good information that might be
useful for you.

On Thu, Nov 16, 2017 at 10:32 AM David Turner  wrote:

> That depends on another question.  Does the client write all 3 copies or
> does the client send the copy to the primary OSD and then the primary OSD
> sends the write to the secondaries?  Someone asked this recently, but I
> don't recall if an answer was given.  I'm not actually certain which is the
> case.  If it's the latter then the 10Gb pipe from the client is all you
> need.
>
> If I had to guess, the client sends the writes to all OSDs, but that
> maxing the 10Gb pipe for 1 client isn't really your concern.  Few use cases
> would have a single client using 100% of the bandwidth.  For RGW, spin up a
> few more RGW daemons and balance them with an LB.  CephFS the clients
> communicate with the OSDs directly and you probably shouldn't use a network
> FS for a single client.  RBD is the likely place where this could happen,
> but few 6 server deployments are being used by a single client using all of
> the RBDs.  What I'm getting at is 3 clients with 10Gb can come pretty close
> to fully saturating the 10Gb ethernet on the cluster.  Likely at least to
> the point where the network pipe is not the bottleneck (OSD node CPU, OSD
> spindle speeds, etc).
>
> On Thu, Nov 16, 2017 at 9:46 AM Sam Huracan 
> wrote:
>
>> Hi,
>>
>> We intend build a new Ceph cluster with 6 Ceph OSD hosts, 10 SAS disks
>> every host, using 10Gbps NIC for client network, object is replicated 3.
>>
>> So, how could I sizing the cluster network for best performance?
>> As i have read, 3x replicate means 3x bandwidth client network = 30 Gbps,
>> is it true? I think it is too much and make great cost
>>
>> Do you give me a suggestion?
>>
>> Thanks in advance.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster network bandwidth?

2017-11-16 Thread David Turner
That depends on another question.  Does the client write all 3 copies or
does the client send the copy to the primary OSD and then the primary OSD
sends the write to the secondaries?  Someone asked this recently, but I
don't recall if an answer was given.  I'm not actually certain which is the
case.  If it's the latter then the 10Gb pipe from the client is all you
need.

If I had to guess, the client sends the writes to all OSDs, but that maxing
the 10Gb pipe for 1 client isn't really your concern.  Few use cases would
have a single client using 100% of the bandwidth.  For RGW, spin up a few
more RGW daemons and balance them with an LB.  CephFS the clients
communicate with the OSDs directly and you probably shouldn't use a network
FS for a single client.  RBD is the likely place where this could happen,
but few 6 server deployments are being used by a single client using all of
the RBDs.  What I'm getting at is 3 clients with 10Gb can come pretty close
to fully saturating the 10Gb ethernet on the cluster.  Likely at least to
the point where the network pipe is not the bottleneck (OSD node CPU, OSD
spindle speeds, etc).

On Thu, Nov 16, 2017 at 9:46 AM Sam Huracan 
wrote:

> Hi,
>
> We intend build a new Ceph cluster with 6 Ceph OSD hosts, 10 SAS disks
> every host, using 10Gbps NIC for client network, object is replicated 3.
>
> So, how could I sizing the cluster network for best performance?
> As i have read, 3x replicate means 3x bandwidth client network = 30 Gbps,
> is it true? I think it is too much and make great cost
>
> Do you give me a suggestion?
>
> Thanks in advance.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster network bandwidth?

2017-11-16 Thread Sam Huracan
Hi,

We intend build a new Ceph cluster with 6 Ceph OSD hosts, 10 SAS disks
every host, using 10Gbps NIC for client network, object is replicated 3.

So, how could I sizing the cluster network for best performance?
As i have read, 3x replicate means 3x bandwidth client network = 30 Gbps,
is it true? I think it is too much and make great cost

Do you give me a suggestion?

Thanks in advance.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-09-12 Thread Christian Balzer


Please don't remove the ML.
I'm not a support channel and if I reply to mails it is so that
others hopefully will learn from that. 
ML re-added.

On Mon, 11 Sep 2017 16:30:18 +0530 M Ranga Swami Reddy wrote:

> >>> >> Here I have NVMes from Intel. but as the support of these NVMes not
> >>> >> there from Intel, we decided not to use these NVMes as a journal.  
> >>> >
> >>> > You again fail to provide with specific model numbers...  
> >>>
> >>> NEMe - Intel DC P3608  - 1.6TB  
> >>
> >> 3DWPD, so you could put this in front (journal~ of 30 or so of those
> >> Samsungs and it still would last longer.  
> >
> >
> > Sure, I will try this and update the results.
> >
> > Btw,  the "osd bench" showing very number with these SSD based (as
> > compared with HDD baed) OSDs). For ex: HDD based OSDs showing around
> > 500 MB/s and SSD based OSDs showing < 300 MB/s. Its strage to see this
> > results.
> > Any thing do I miss here?  
> 
OSD bench still is the wrong test tool for pretty much anything.
There are no HDDs that write 500MB/s.
So this is either a RAID or something behind a controller with HW cache,
not the 140MB/s or so I'd expect to see with a directly connected HDD.
OSD bench also only writes 1GB by default, something that's easily cached
in such a setup.

The 300MB/s for your EVO SSDs could be the result of how OSD bench works
(sync writes, does it use the journal?) or something simple and silly as
these SSDs hooked up to SATA-2 (3Gb/s aka 300MB/s) ports.

> 
> After adding the NVMe drives ad journal, I could see the osd bench
> improved results (showing > 600 MB/s (without NVMe < 300MB/s)..
> 
What exactly did you do?
How many journals per NVMe?
6 or 7 of the EVOs (at 300 MB/s) will saturate the P3608 in a bandwidth
test. 
4 of the EVOs if their 500MB/s write speed can be achieved.

And since you never mentioned any other details, your cluster could also
be network or CPU bound for all we know.

> But  volumes created from SSD pool, not showing any performance
> improvements (like dd o/p, fio, rbd map, rados bench etc)..

If you're using fio with the rbd io module, I found it to be horribly
buggy.
Best real life test is a fio with a VM.
And to test for IOPS (4k ones), bandwidth is most likely NOT what you will
lack in production. 

> Do I miss any ceph config. setting to above good performacne numbers?
>
Not likely, no.
 
> Thanks
> Swami
> 
> 
> >>> > No support from Intel suggests that these may be consumer models again.
> >>> >
> >>> > Samsung also makes DC grade SSDs and NVMEs, as Adrian pointed out.
> >>> >  
> >>> >> Btw, if we split this SSD with multiple OSD (for ex: 1 SSD with 4 or 2
> >>> >> OSDs), is  this help any performance numbers?
> >>> >>  
> >>> > Of course not, if anything it will make it worse due to the overhead
> >>> > outside the SSD itself.
> >>> >
> >>> > Christian
> >>> >  
> >>> >> On Sun, Aug 20, 2017 at 9:33 AM, Christian Balzer  
> >>> >> wrote:  
> >>> >> >
> >>> >> > Hello,
> >>> >> >
> >>> >> > On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:
> >>> >> >  
> >>> >> >> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
> >>> >> >> MZ-75E4T0B/AM | Samsung
> >>> >> >>  
> >>> >> > And there's your answer.
> >>> >> >
> >>> >> > A bit of googling in the archives here would have shown you that 
> >>> >> > these are
> >>> >> > TOTALLY unsuitable for use with Ceph.
> >>> >> > Not only because of the horrid speed when used with/for Ceph 
> >>> >> > journaling
> >>> >> > (direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 
> >>> >> > 5
> >>> >> > years.
> >>> >> > Or in other words 160GB/day, which after the Ceph journal double 
> >>> >> > writes
> >>> >> > and FS journals, other overhead and write amplification in general
> >>> >> > probably means less that effective 40GB/day.
> >>> >> >
> >>> >> > In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD 
> >>> >> > and
> >>> >> > more commonly 1 DWPD.
> >>> >> > And I'm not buying anything below 3 DWPD for use with Ceph.
> >>> >> >
> >>> >> > Your only chance to improve the speed here is to take the journals 
> >>> >> > off
> >>> >> > them and put them onto fast and durable enough NVMes like the Intel 
> >>> >> > DC P
> >>> >> > 3700 or at worst 3600 types.
> >>> >> >
> >>> >> > That still leaves you with their crappy endurance, only twice as 
> >>> >> > high than
> >>> >> > before with the journals offloaded.
> >>> >> >
> >>> >> > Christian
> >>> >> >  
> >>> >> >> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
> >>> >> >>  wrote:  
> >>> >> >> > Yes, Its in production and used the pg count as per the pg 
> >>> >> >> > calcuator @ ceph.com.
> >>> >> >> >
> >>> >> >> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:  
> >>> >> >> >> Which ssds are used? Are they in production? If so how is your 
> >>> >> >> >> PG Count?
> >>> >> >> >>
> >>> >> >> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> 

Re: [ceph-users] Ceph cluster with SSDs

2017-08-23 Thread Christian Balzer
On Wed, 23 Aug 2017 16:48:12 +0530 M Ranga Swami Reddy wrote:

> On Mon, Aug 21, 2017 at 5:37 PM, Christian Balzer  wrote:
> > On Mon, 21 Aug 2017 17:13:10 +0530 M Ranga Swami Reddy wrote:
> >  
> >> Thank you.
> >> Here I have NVMes from Intel. but as the support of these NVMes not
> >> there from Intel, we decided not to use these NVMes as a journal.  
> >
> > You again fail to provide with specific model numbers...  
> 
> NEMe - Intel DC P3608  - 1.6TB

3DWPD, so you could put this in front (journal~ of 30 or so of those
Samsungs and it still would last longer.

Christian

> 
> Thanks
> Swami
> 
> > No support from Intel suggests that these may be consumer models again.
> >
> > Samsung also makes DC grade SSDs and NVMEs, as Adrian pointed out.
> >  
> >> Btw, if we split this SSD with multiple OSD (for ex: 1 SSD with 4 or 2
> >> OSDs), is  this help any performance numbers?
> >>  
> > Of course not, if anything it will make it worse due to the overhead
> > outside the SSD itself.
> >
> > Christian
> >  
> >> On Sun, Aug 20, 2017 at 9:33 AM, Christian Balzer  wrote:  
> >> >
> >> > Hello,
> >> >
> >> > On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:
> >> >  
> >> >> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
> >> >> MZ-75E4T0B/AM | Samsung
> >> >>  
> >> > And there's your answer.
> >> >
> >> > A bit of googling in the archives here would have shown you that these 
> >> > are
> >> > TOTALLY unsuitable for use with Ceph.
> >> > Not only because of the horrid speed when used with/for Ceph journaling
> >> > (direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 5
> >> > years.
> >> > Or in other words 160GB/day, which after the Ceph journal double writes
> >> > and FS journals, other overhead and write amplification in general
> >> > probably means less that effective 40GB/day.
> >> >
> >> > In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD and
> >> > more commonly 1 DWPD.
> >> > And I'm not buying anything below 3 DWPD for use with Ceph.
> >> >
> >> > Your only chance to improve the speed here is to take the journals off
> >> > them and put them onto fast and durable enough NVMes like the Intel DC P
> >> > 3700 or at worst 3600 types.
> >> >
> >> > That still leaves you with their crappy endurance, only twice as high 
> >> > than
> >> > before with the journals offloaded.
> >> >
> >> > Christian
> >> >  
> >> >> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
> >> >>  wrote:  
> >> >> > Yes, Its in production and used the pg count as per the pg calcuator 
> >> >> > @ ceph.com.
> >> >> >
> >> >> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:  
> >> >> >> Which ssds are used? Are they in production? If so how is your PG 
> >> >> >> Count?
> >> >> >>
> >> >> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> >> >> :  
> >> >> >>>
> >> >> >>> Hello,
> >> >> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate 
> >> >> >>> pool for
> >> >> >>> each.
> >> >> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 
> >> >> >>> MB/s
> >> >> >>> and SSD's OSD show around 280MB/s.
> >> >> >>>
> >> >> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40% 
> >> >> >>> high
> >> >> >>> as compared with HDD's OSD bench.
> >> >> >>>
> >> >> >>> Did I miss anything here? Any hint is appreciated.
> >> >> >>>
> >> >> >>> Thanks
> >> >> >>> Swami
> >> >> >>> 
> >> >> >>>
> >> >> >>> ceph-users mailing list
> >> >> >>> ceph-users@lists.ceph.com
> >> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> >> >>
> >> >> >>
> >> >> >> ___
> >> >> >> ceph-users mailing list
> >> >> >> ceph-users@lists.ceph.com
> >> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >>  
> >> >> ___
> >> >> ceph-users mailing list
> >> >> ceph-users@lists.ceph.com
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>  
> >> >
> >> >
> >> > --
> >> > Christian BalzerNetwork/Systems Engineer
> >> > ch...@gol.com   Rakuten Communications  
> >>  
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Rakuten Communications  
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-23 Thread M Ranga Swami Reddy
On Mon, Aug 21, 2017 at 5:37 PM, Christian Balzer  wrote:
> On Mon, 21 Aug 2017 17:13:10 +0530 M Ranga Swami Reddy wrote:
>
>> Thank you.
>> Here I have NVMes from Intel. but as the support of these NVMes not
>> there from Intel, we decided not to use these NVMes as a journal.
>
> You again fail to provide with specific model numbers...

NEMe - Intel DC P3608  - 1.6TB


Thanks
Swami

> No support from Intel suggests that these may be consumer models again.
>
> Samsung also makes DC grade SSDs and NVMEs, as Adrian pointed out.
>
>> Btw, if we split this SSD with multiple OSD (for ex: 1 SSD with 4 or 2
>> OSDs), is  this help any performance numbers?
>>
> Of course not, if anything it will make it worse due to the overhead
> outside the SSD itself.
>
> Christian
>
>> On Sun, Aug 20, 2017 at 9:33 AM, Christian Balzer  wrote:
>> >
>> > Hello,
>> >
>> > On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:
>> >
>> >> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
>> >> MZ-75E4T0B/AM | Samsung
>> >>
>> > And there's your answer.
>> >
>> > A bit of googling in the archives here would have shown you that these are
>> > TOTALLY unsuitable for use with Ceph.
>> > Not only because of the horrid speed when used with/for Ceph journaling
>> > (direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 5
>> > years.
>> > Or in other words 160GB/day, which after the Ceph journal double writes
>> > and FS journals, other overhead and write amplification in general
>> > probably means less that effective 40GB/day.
>> >
>> > In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD and
>> > more commonly 1 DWPD.
>> > And I'm not buying anything below 3 DWPD for use with Ceph.
>> >
>> > Your only chance to improve the speed here is to take the journals off
>> > them and put them onto fast and durable enough NVMes like the Intel DC P
>> > 3700 or at worst 3600 types.
>> >
>> > That still leaves you with their crappy endurance, only twice as high than
>> > before with the journals offloaded.
>> >
>> > Christian
>> >
>> >> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>> >>  wrote:
>> >> > Yes, Its in production and used the pg count as per the pg calcuator @ 
>> >> > ceph.com.
>> >> >
>> >> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
>> >> >> Which ssds are used? Are they in production? If so how is your PG 
>> >> >> Count?
>> >> >>
>> >> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
>> >> >> :
>> >> >>>
>> >> >>> Hello,
>> >> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool 
>> >> >>> for
>> >> >>> each.
>> >> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>> >> >>> and SSD's OSD show around 280MB/s.
>> >> >>>
>> >> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>> >> >>> as compared with HDD's OSD bench.
>> >> >>>
>> >> >>> Did I miss anything here? Any hint is appreciated.
>> >> >>>
>> >> >>> Thanks
>> >> >>> Swami
>> >> >>> 
>> >> >>>
>> >> >>> ceph-users mailing list
>> >> >>> ceph-users@lists.ceph.com
>> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> >>
>> >> >> ___
>> >> >> ceph-users mailing list
>> >> >> ceph-users@lists.ceph.com
>> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > --
>> > Christian BalzerNetwork/Systems Engineer
>> > ch...@gol.com   Rakuten Communications
>>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-21 Thread Christian Balzer
On Mon, 21 Aug 2017 17:13:10 +0530 M Ranga Swami Reddy wrote:

> Thank you.
> Here I have NVMes from Intel. but as the support of these NVMes not
> there from Intel, we decided not to use these NVMes as a journal.

You again fail to provide with specific model numbers...
No support from Intel suggests that these may be consumer models again.

Samsung also makes DC grade SSDs and NVMEs, as Adrian pointed out.

> Btw, if we split this SSD with multiple OSD (for ex: 1 SSD with 4 or 2
> OSDs), is  this help any performance numbers?
> 
Of course not, if anything it will make it worse due to the overhead
outside the SSD itself.

Christian

> On Sun, Aug 20, 2017 at 9:33 AM, Christian Balzer  wrote:
> >
> > Hello,
> >
> > On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:
> >  
> >> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
> >> MZ-75E4T0B/AM | Samsung
> >>  
> > And there's your answer.
> >
> > A bit of googling in the archives here would have shown you that these are
> > TOTALLY unsuitable for use with Ceph.
> > Not only because of the horrid speed when used with/for Ceph journaling
> > (direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 5
> > years.
> > Or in other words 160GB/day, which after the Ceph journal double writes
> > and FS journals, other overhead and write amplification in general
> > probably means less that effective 40GB/day.
> >
> > In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD and
> > more commonly 1 DWPD.
> > And I'm not buying anything below 3 DWPD for use with Ceph.
> >
> > Your only chance to improve the speed here is to take the journals off
> > them and put them onto fast and durable enough NVMes like the Intel DC P
> > 3700 or at worst 3600 types.
> >
> > That still leaves you with their crappy endurance, only twice as high than
> > before with the journals offloaded.
> >
> > Christian
> >  
> >> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
> >>  wrote:  
> >> > Yes, Its in production and used the pg count as per the pg calcuator @ 
> >> > ceph.com.
> >> >
> >> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:  
> >> >> Which ssds are used? Are they in production? If so how is your PG Count?
> >> >>
> >> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> >> :  
> >> >>>
> >> >>> Hello,
> >> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool 
> >> >>> for
> >> >>> each.
> >> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
> >> >>> and SSD's OSD show around 280MB/s.
> >> >>>
> >> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
> >> >>> as compared with HDD's OSD bench.
> >> >>>
> >> >>> Did I miss anything here? Any hint is appreciated.
> >> >>>
> >> >>> Thanks
> >> >>> Swami
> >> >>> 
> >> >>>
> >> >>> ceph-users mailing list
> >> >>> ceph-users@lists.ceph.com
> >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >> >>
> >> >>
> >> >> ___
> >> >> ceph-users mailing list
> >> >> ceph-users@lists.ceph.com
> >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >>  
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Rakuten Communications  
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-21 Thread M Ranga Swami Reddy
Thank you.
Here I have NVMes from Intel. but as the support of these NVMes not
there from Intel, we decided not to use these NVMes as a journal.
Btw, if we split this SSD with multiple OSD (for ex: 1 SSD with 4 or 2
OSDs), is  this help any performance numbers?

On Sun, Aug 20, 2017 at 9:33 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:
>
>> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
>> MZ-75E4T0B/AM | Samsung
>>
> And there's your answer.
>
> A bit of googling in the archives here would have shown you that these are
> TOTALLY unsuitable for use with Ceph.
> Not only because of the horrid speed when used with/for Ceph journaling
> (direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 5
> years.
> Or in other words 160GB/day, which after the Ceph journal double writes
> and FS journals, other overhead and write amplification in general
> probably means less that effective 40GB/day.
>
> In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD and
> more commonly 1 DWPD.
> And I'm not buying anything below 3 DWPD for use with Ceph.
>
> Your only chance to improve the speed here is to take the journals off
> them and put them onto fast and durable enough NVMes like the Intel DC P
> 3700 or at worst 3600 types.
>
> That still leaves you with their crappy endurance, only twice as high than
> before with the journals offloaded.
>
> Christian
>
>> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>>  wrote:
>> > Yes, Its in production and used the pg count as per the pg calcuator @ 
>> > ceph.com.
>> >
>> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
>> >> Which ssds are used? Are they in production? If so how is your PG Count?
>> >>
>> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
>> >> :
>> >>>
>> >>> Hello,
>> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool for
>> >>> each.
>> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>> >>> and SSD's OSD show around 280MB/s.
>> >>>
>> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>> >>> as compared with HDD's OSD bench.
>> >>>
>> >>> Did I miss anything here? Any hint is appreciated.
>> >>>
>> >>> Thanks
>> >>> Swami
>> >>> 
>> >>>
>> >>> ceph-users mailing list
>> >>> ceph-users@lists.ceph.com
>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Christian Balzer
On Mon, 21 Aug 2017 01:48:49 + Adrian Saul wrote:

> > SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage - MZ-
> > 75E4T0B/AM | Samsung  
> 
> The performance difference between these and the SM or PM863 range is night 
> and day.  I would not use these for anything you care about with performance, 
> particularly IOPS or latency.
> Their write latency is highly variable and even at best is still 5x higher 
> than what the SM863 range does.  When we compared them we could not get them 
> below 6ms and they frequently spiked to much higher values (25-30ms).  With 
> the SM863s they were a constant sub 1ms and didn't fluctuate.  I believe it 
> was the garbage collection on the Evos that causes the issue.  Here was the 
> difference in average latencies from a pool made of half Evo and half SM863:
> 
> Write latency - Evo 7.64ms - SM863 0.55ms
> Read Latency - Evo 2.56ms - SM863  0.16ms
> 
Yup, you get these unpredictable (and thus unsuitable) randomness and
generally higher latency with nearly all consumer SSDs.
And yes, typically GC related.

The reason it's so slow with sync writes is with near certainty that their
large DRAM cache is useless with these, as said cache isn't protected
against power failures and thus needs to be bypassed. 
Other consumer SSDs (IIRC Intel 510s amongst them) used to blatantly lie
about sync writes and thus appeared fast while putting your data at
significant risk.

Christian

> Add to that Christian's remarks on the write endurance and they are only good 
> for desktops that wont exercise them that much.   You are far better 
> investing in DC/Enterprise grade devices.
> 
> 
> 
> 
> >
> > On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
> >  wrote:  
> > > Yes, Its in production and used the pg count as per the pg calcuator @  
> > ceph.com.  
> > >
> > > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:  
> > >> Which ssds are used? Are they in production? If so how is your PG Count?
> > >>
> > >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> > >> :  
> > >>>
> > >>> Hello,
> > >>> I am using the Ceph cluster with HDDs and SSDs. Created separate
> > >>> pool for each.
> > >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500
> > >>> MB/s and SSD's OSD show around 280MB/s.
> > >>>
> > >>> Ideally, what I expected was - SSD's OSDs should be at-least 40%
> > >>> high as compared with HDD's OSD bench.
> > >>>
> > >>> Did I miss anything here? Any hint is appreciated.
> > >>>
> > >>> Thanks
> > >>> Swami
> > >>> 
> > >>>
> > >>> ceph-users mailing list
> > >>> ceph-users@lists.ceph.com
> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> > >>
> > >>
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>  
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> Confidentiality: This email and any attachments are confidential and may be 
> subject to copyright, legal or some other professional privilege. They are 
> intended solely for the attention and use of the named addressee(s). They may 
> only be copied, distributed or disclosed with the consent of the copyright 
> owner. If you have received this email by mistake or by breach of the 
> confidentiality clause, please notify the sender immediately by return email 
> and delete or destroy all copies of the email. Any confidentiality, privilege 
> or copyright is not waived or lost because this email has been sent to you by 
> mistake.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Adrian Saul
> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage - MZ-
> 75E4T0B/AM | Samsung

The performance difference between these and the SM or PM863 range is night and 
day.  I would not use these for anything you care about with performance, 
particularly IOPS or latency.
Their write latency is highly variable and even at best is still 5x higher than 
what the SM863 range does.  When we compared them we could not get them below 
6ms and they frequently spiked to much higher values (25-30ms).  With the 
SM863s they were a constant sub 1ms and didn't fluctuate.  I believe it was the 
garbage collection on the Evos that causes the issue.  Here was the difference 
in average latencies from a pool made of half Evo and half SM863:

Write latency - Evo 7.64ms - SM863 0.55ms
Read Latency - Evo 2.56ms - SM863  0.16ms

Add to that Christian's remarks on the write endurance and they are only good 
for desktops that wont exercise them that much.   You are far better investing 
in DC/Enterprise grade devices.




>
> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>  wrote:
> > Yes, Its in production and used the pg count as per the pg calcuator @
> ceph.com.
> >
> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
> >> Which ssds are used? Are they in production? If so how is your PG Count?
> >>
> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> :
> >>>
> >>> Hello,
> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate
> >>> pool for each.
> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500
> >>> MB/s and SSD's OSD show around 280MB/s.
> >>>
> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40%
> >>> high as compared with HDD's OSD bench.
> >>>
> >>> Did I miss anything here? Any hint is appreciated.
> >>>
> >>> Thanks
> >>> Swami
> >>> 
> >>>
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Christian Balzer
On Sun, 20 Aug 2017 08:38:54 +0200 Sinan Polat wrote:

> What has DWPD to do with performance / IOPS? The SSD will just fail earlier, 
> but it should not have any affect on the performance, right?
> 
Nothing, I listed BOTH reasons why these are unsuitable.

You just don't buy something huge like 4TB SSDs and expect to just write
40GB/day to them in.

> Correct me if I am wrong, just want to learn.
>
Learning is easy if you're will to make a little effort.
Just like the OP you failed to search for "Samsung Evo Ceph" and find all
the bad news like in this result:

https://forum.proxmox.com/threads/slow-ceph-journal-on-samsung-850-pro.27733/
 
Christian
> 
> > Op 20 aug. 2017 om 06:03 heeft Christian Balzer  het 
> > volgende geschreven:
> > 
> > DWPD  
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-20 Thread Sinan Polat
What has DWPD to do with performance / IOPS? The SSD will just fail earlier, 
but it should not have any affect on the performance, right?

Correct me if I am wrong, just want to learn.


> Op 20 aug. 2017 om 06:03 heeft Christian Balzer  het volgende 
> geschreven:
> 
> DWPD

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-19 Thread Christian Balzer

Hello,

On Sat, 19 Aug 2017 23:22:11 +0530 M Ranga Swami Reddy wrote:

> SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
> MZ-75E4T0B/AM | Samsung
>
And there's your answer.

A bit of googling in the archives here would have shown you that these are
TOTALLY unsuitable for use with Ceph.
Not only because of the horrid speed when used with/for Ceph journaling
(direct/sync I/O) but also their abysmal endurance of 0.04 DWPD over 5
years.
Or in other words 160GB/day, which after the Ceph journal double writes
and FS journals, other overhead and write amplification in general
probably means less that effective 40GB/day.

In contrast the lowest endurance DC grade SSDs tend to be 0.3 DWPD and
more commonly 1 DWPD.
And I'm not buying anything below 3 DWPD for use with Ceph.

Your only chance to improve the speed here is to take the journals off
them and put them onto fast and durable enough NVMes like the Intel DC P
3700 or at worst 3600 types.

That still leaves you with their crappy endurance, only twice as high than
before with the journals offloaded.
 
Christian

> On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
>  wrote:
> > Yes, Its in production and used the pg count as per the pg calcuator @ 
> > ceph.com.
> >
> > On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:  
> >> Which ssds are used? Are they in production? If so how is your PG Count?
> >>
> >> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> >> :  
> >>>
> >>> Hello,
> >>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool for
> >>> each.
> >>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
> >>> and SSD's OSD show around 280MB/s.
> >>>
> >>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
> >>> as compared with HDD's OSD bench.
> >>>
> >>> Did I miss anything here? Any hint is appreciated.
> >>>
> >>> Thanks
> >>> Swami
> >>> 
> >>>
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>  
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-19 Thread M Ranga Swami Reddy
SSD make details : SSD 850 EVO 2.5" SATA III 4TB Memory & Storage -
MZ-75E4T0B/AM | Samsung

On Sat, Aug 19, 2017 at 10:44 PM, M Ranga Swami Reddy
 wrote:
> Yes, Its in production and used the pg count as per the pg calcuator @ 
> ceph.com.
>
> On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
>> Which ssds are used? Are they in production? If so how is your PG Count?
>>
>> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
>> :
>>>
>>> Hello,
>>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool for
>>> each.
>>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>>> and SSD's OSD show around 280MB/s.
>>>
>>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>>> as compared with HDD's OSD bench.
>>>
>>> Did I miss anything here? Any hint is appreciated.
>>>
>>> Thanks
>>> Swami
>>> 
>>>
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-19 Thread M Ranga Swami Reddy
I did not only "osd bench". Performed rbd image mapped and DD test on
it... here also got very less number with image on SSD pool as
compared with image on HDD pool.
As per SSD datasheet - they claim 500 MB/s, but I am getting somewhat
near 50 MB/s with dd cmd.


On Fri, Aug 18, 2017 at 6:32 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Fri, 18 Aug 2017 00:00:09 +0200 Mehmet wrote:
>
>> Which ssds are used? Are they in production? If so how is your PG Count?
>>
> What he wrote.
> W/o knowing which apples you're comparing to what oranges, this is
> pointless.
>
> Also testing osd bench is the LEAST relevant test you can do, as it only
> deals with local bandwidth, while what people nearly always want/need in
> the end is IOPS and low latency.
> Which you test best from a real client perspective.
>
> Christian
>
>> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy 
>> :
>> >Hello,
>> >I am using the Ceph cluster with HDDs and SSDs. Created separate pool
>> >for each.
>> >Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>> >and SSD's OSD show around 280MB/s.
>> >
>> >Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>> >as compared with HDD's OSD bench.
>> >
>> >Did I miss anything here? Any hint is appreciated.
>> >
>> >Thanks
>> >Swami
>> >___
>> >ceph-users mailing list
>> >ceph-users@lists.ceph.com
>> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-19 Thread M Ranga Swami Reddy
Yes, Its in production and used the pg count as per the pg calcuator @ ceph.com.

On Fri, Aug 18, 2017 at 3:30 AM, Mehmet  wrote:
> Which ssds are used? Are they in production? If so how is your PG Count?
>
> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy
> :
>>
>> Hello,
>> I am using the Ceph cluster with HDDs and SSDs. Created separate pool for
>> each.
>> Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>> and SSD's OSD show around 280MB/s.
>>
>> Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>> as compared with HDD's OSD bench.
>>
>> Did I miss anything here? Any hint is appreciated.
>>
>> Thanks
>> Swami
>> 
>>
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph Cluster attempt to access beyond end of device

2017-08-17 Thread Hauke Homburg
Am 15.08.2017 um 16:34 schrieb ZHOU Yuan:
> Hi Hauke,
>
> It's possibly the XFS issue as discussed in the previous thread. I
> also saw this issue in some JBOD setup, running with RHEL 7.3
>
>
> Sincerely, Yuan
>
> On Tue, Aug 15, 2017 at 7:38 PM, Hauke Homburg
> > wrote:
>
> Hello,
>
>
> I found some error in the Cluster with dmes -T:
>
> attempt to access beyond end of device
>
> I found the following Post:
>
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html
> 
>
> Is this a Problem with the Size of the Filesystem itself oder "only"
> eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a
> Hardware
> RAID 6 running. In this RAID we have the XFS Partition.
>
> Also we have one big Filesystem in 1 OSD in each Server instead of 1
> Filesystem per HDD at 8 HDD in each Server.
>
> greetings
>
> Hauke
>
>
> --
> www.w3-creative.de 
>
> www.westchat.de 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>
>

Hello,

I upgraded the CentOS7 kernel to Kernel 4.12

https://www.tecmint.com/install-upgrade-kernel-version-in-centos-7/

after the Upgrade the Error are gone actual.

Thanks for help.

Greeting


Hauke

-- 
www.w3-creative.de

www.westchat.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-17 Thread Christian Balzer

Hello,

On Fri, 18 Aug 2017 00:00:09 +0200 Mehmet wrote:

> Which ssds are used? Are they in production? If so how is your PG Count?
>
What he wrote.
W/o knowing which apples you're comparing to what oranges, this is
pointless.

Also testing osd bench is the LEAST relevant test you can do, as it only
deals with local bandwidth, while what people nearly always want/need in
the end is IOPS and low latency.
Which you test best from a real client perspective.

Christian
 
> Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy 
> :
> >Hello,
> >I am using the Ceph cluster with HDDs and SSDs. Created separate pool
> >for each.
> >Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
> >and SSD's OSD show around 280MB/s.
> >
> >Ideally, what I expected was - SSD's OSDs should be at-least 40% high
> >as compared with HDD's OSD bench.
> >
> >Did I miss anything here? Any hint is appreciated.
> >
> >Thanks
> >Swami
> >___
> >ceph-users mailing list
> >ceph-users@lists.ceph.com
> >http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster with SSDs

2017-08-17 Thread Mehmet
Which ssds are used? Are they in production? If so how is your PG Count?

Am 17. August 2017 20:04:25 MESZ schrieb M Ranga Swami Reddy 
:
>Hello,
>I am using the Ceph cluster with HDDs and SSDs. Created separate pool
>for each.
>Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
>and SSD's OSD show around 280MB/s.
>
>Ideally, what I expected was - SSD's OSDs should be at-least 40% high
>as compared with HDD's OSD bench.
>
>Did I miss anything here? Any hint is appreciated.
>
>Thanks
>Swami
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph cluster with SSDs

2017-08-17 Thread M Ranga Swami Reddy
Hello,
I am using the Ceph cluster with HDDs and SSDs. Created separate pool for each.
Now, when I ran the "ceph osd bench", HDD's OSDs show around 500 MB/s
and SSD's OSD show around 280MB/s.

Ideally, what I expected was - SSD's OSDs should be at-least 40% high
as compared with HDD's OSD bench.

Did I miss anything here? Any hint is appreciated.

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Mandar Naik
Thanks a lot for the reply. To eliminate issue of root not being present
and duplicate entries
in crush map I have updated my crush map. Now I have default root and I
have crush hierarchy
without duplicate entries.

I have now created one pool local to host "ip-10-0-9-233" while other pool
local to host "ip-10-0-9-126"
using respective crush rules as pasted below. After host "ip-10-0-9-233"
gets full, requests to write new
keys to pool from host "ip-10-0-9-126" timed out.  From the "ceph pg dump"
output I see PGs only getting
stored at respective hosts. So pg interference across pools does not seem
to be an issue to me at least.

Purpose of keeping one pool local to host is not for the locality. With the
use case of single point of solution
for both local as well as replicated data clients need to know only the
pool name during read/write operations.

I am not sure if this use case fits with the ceph. So I am trying to
determine if there is any option in ceph to
make ceph understand that only one host is full and it could still serve
new write requests as long as they do
not touch the OSD that is full.


Test output:


#ceph osd dump


epoch 93

fsid 7a238d99-67ed-4610-540a-449043b3c24e

created 2017-08-16 09:34:15.580112

modified 2017-08-16 11:55:40.676234

flags sortbitwise,require_jewel_osds

pool 7 'ip-10-0-9-233-pool' replicated size 1 min_size 1 crush_ruleset 1
object_hash rjenkins pg_num 128 pgp_num 128 last_change 87 flags hashpspool
stripe_width 0

pool 8 'ip-10-0-9-126-pool' replicated size 1 min_size 1 crush_ruleset 2
object_hash rjenkins pg_num 128 pgp_num 128 last_change 92 flags hashpspool
stripe_width 0

max_osd 3


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_OK

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e93: 3 osds: 3 up, 3 in

flags sortbitwise,require_jewel_osds

  pgmap v679: 256 pgs, 2 pools, 0 bytes data, 0 objects

106 MB used, 134 GB / 134 GB avail

 256 active+clean


# ceph osd tree

ID WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 0.13197 root default

-5 0.04399 rack ip-10-0-9-233-rack

-3 0.04399  host ip-10-0-9-233

0 0.04399  osd.0 up  1.0   1.0

-7 0.04399 rack ip-10-0-9-126-rack

-6 0.04399  host ip-10-0-9-126

1 0.04399  osd.1 up  1.0   1.0

-9 0.04399 rack ip-10-0-9-250-rack

-8 0.04399  host ip-10-0-9-250

2 0.04399  osd.2 up  1.0   1.0


# ceph osd crush rule list

[

"ip-10-0-9-233_ruleset",

"ip-10-0-9-126_ruleset",

"ip-10-0-9-250_ruleset",

"replicated_ruleset"

]


# ceph osd crush rule dump ip-10-0-9-233_ruleset

{

"rule_id": 0,

"rule_name": "ip-10-0-9-233_ruleset",

"ruleset": 1,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -5,

"item_name": "ip-10-0-9-233-rack"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]

}



# ceph osd crush rule dump ip-10-0-9-126_ruleset

{

"rule_id": 1,

"rule_name": "ip-10-0-9-126_ruleset",

"ruleset": 2,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -7,

"item_name": "ip-10-0-9-126-rack"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]

}


# ceph osd crush rule dump replicated_ruleset

{

"rule_id": 4,

"rule_name": "replicated_ruleset",

"ruleset": 4,

"type": 1,

"min_size": 1,

"max_size": 10,

"steps": [

{

"op": "take",

"item": -1,

"item_name": "default"

},

{

"op": "chooseleaf_firstn",

"num": 0,

"type": "host"

},

{

"op": "emit"

}

]


# ceph -s

cluster 7a238d99-67ed-4610-540a-449043b3c24e

health HEALTH_ERR

1 full osd(s)

full,sortbitwise,require_jewel_osds flag(s) set

monmap e3: 3 mons at {ip-10-0-9-126=10.0.9.126:6789/0,ip-10-0-9-233=10.0.9.
233:6789/0,ip-10-0-9-250=10.0.9.250:6789/0}

election epoch 8, quorum 0,1,2 ip-10-0-9-126,ip-10-0-9-233,
ip-10-0-9-250

osdmap e99: 3 osds: 3 up, 3 in

flags full,sortbitwise,require_jewel_osds

  pgmap v920: 256 pgs, 2 pools, 42696 MB data, 2 objects

44844 MB used, 93324 MB / 134 GB avail

 256 active+clean


# ceph osd df

ID WEIGHT  REWEIGHT SIZE   USE AVAIL  %USE  VAR  PGS

0 0.04399  1.0 46056M 43801M  2255M 95.10 3.00 128

1 0.04399  1.0 46056M 36708k 46020M  0.08 0.00 128

2 0.04399  1.0 46056M 34472k 46022M  0.07 0.00   0

  TOTAL   134G 43870M 94298M 31.75

MIN/MAX VAR: 0.00/3.00  STDDEV: 44.80


# ceph df

GLOBAL:

SIZE AVAIL   RAW USED %RAW USED


Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Etienne Menguy
Hi,


Your crushmap has issues.

You don't have any root and you have duplicates entries. Currently you store 
data on a single OSD.


You can manually fix the crushmap by decompiling, editing and compiling.

http://docs.ceph.com/docs/hammer/rados/operations/crush-map/#editing-a-crush-map

(if you have some production data, do a backup first)


Étienne



From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Mandar Naik 
<mandar.p...@gmail.com>
Sent: Wednesday, August 16, 2017 09:39
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% 
of total capacity

Hi,
I just wanted to give a friendly reminder for this issue. I would appreciate if 
someone
can help me out here. Also, please do let me know in case some more information 
is
required here.

On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik 
<mandar.p...@gmail.com<mailto:mandar.p...@gmail.com>> wrote:
Hi Peter,
Thanks a lot for the reply. Please find 'ceph osd df' output here -

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
  TOTAL   134G 43925M 94244M 31.79
MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85

I setup this cluster by manipulating CRUSH map using CLI. I had a default root
before but it gave me an impression that since every rack is under a single
root bucket its marking entire cluster down in case one of the osd is 95% full. 
So I
removed root bucket but that still did not help me. No crush rule is referring
to root bucket in the above mentioned case.

Yes, I added one osd under two racks by linking host bucket from one rack to 
another
using following command -

"osd crush link   [...] :  link existing entry for  
under location "


On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney 
<peter.malo...@brockmann-consult.de<mailto:peter.malo...@brockmann-consult.de>> 
wrote:
I think a `ceph osd df` would be useful.

And how did you set up such a cluster? I don't see a root, and you have each 
osd in there more than once...is that even possible?



On 08/10/17 08:46, Mandar Naik wrote:

Hi,

I am evaluating ceph cluster for a solution where ceph could be used for 
provisioning

pools which could be either stored local to a node or replicated across a 
cluster.  This

way ceph could be used as single point of solution for writing both local as 
well as replicated

data. Local storage helps avoid possible storage cost that comes with 
replication factor of more

than one and also provide availability as long as the data host is alive.


So I tried an experiment with Ceph cluster where there is one crush rule which 
replicates data across

nodes and other one only points to a crush bucket that has local ceph osd. 
Cluster configuration

is pasted below.


Here I observed that if one of the disk is full (95%) entire cluster goes into 
error state and stops

accepting new writes from/to other nodes. So ceph cluster became unusable even 
though it’s only

32% full. The writes are blocked even for pools which are not touching the full 
osd.


I have tried playing around crush hierarchy but it did not help. So is it 
possible to store data in the above

manner with Ceph ? If yes could we get cluster state in usable state after one 
of the node is full ?



# ceph df


GLOBAL:

   SIZE AVAIL  RAW USED %RAW USED

   134G 94247M   43922M 31.79


# ceph –s


   cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f

health HEALTH_ERR

   1 full osd(s)

   full,sortbitwise,require_jewel_osds flag(s) set

monmap e3: 3 mons at 
{ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0<http://10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0>}

   election epoch 14, quorum 0,1,2 
ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210

osdmap e93: 3 osds: 3 up, 3 in

   flags full,sortbitwise,require_jewel_osds

 pgmap v630: 384 pgs, 6 pools, 43772 MB data, 18640 objects

   43922 MB used, 94247 MB / 134 GB avail

384 active+clean


# ceph osd tree


ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY

-9 0.04399 rack ip-10-0-9-146-rack

-8 0.04399 host ip-10-0-9-146

2 0.04399 osd.2up  1.0  1.0

-7 0.04399 rack ip-10-0-9-210-rack

-6 0.04399 host ip-10-0-9-210

1 0.04399 osd.1up  1.0  1.0

-5 0.04399 rack ip-10-0-9-122-rack

-3 0.04399 host ip-10-0-9-122

0 0.04399 osd.0up  1.0   

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Luis Periquito
Not going through the obvious of that crush map is just not looking
correct or even sane... or that the policy itself doesn't sound very
sane - but I'm sure you'll understand the caveats and issues it may
present...

what's most probably happening is that a (or several) pool is using
those same OSDs and the requests to those PGs are also getting blocked
because of the disk full. This turns that some (or all) of the
remaining OSDs are waiting for that one to complete some IO, and
whilst those OSDs have IOs waiting to complete it also stops
responding to the IO that was only local.

Adding more insanity to your architecture what should (the keyword
here is should as I never tested, saw or even thought of such
scenario) work would be OSDs to have local storage and OSDs to have
distributed storage.

As for the architecture itself, and not knowing much of your use-case,
it may make sense to have local storage in something else than Ceph -
you're not using any of the facilities it provides you, and having
some overheads - or using a different strategy for it. IIRC there was
a way to hint data locality to Ceph...


On Wed, Aug 16, 2017 at 8:39 AM, Mandar Naik  wrote:
> Hi,
> I just wanted to give a friendly reminder for this issue. I would appreciate
> if someone
> can help me out here. Also, please do let me know in case some more
> information is
> required here.
>
> On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik  wrote:
>>
>> Hi Peter,
>> Thanks a lot for the reply. Please find 'ceph osd df' output here -
>>
>> # ceph osd df
>> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>>   TOTAL   134G 43925M 94244M 31.79
>> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
>>
>> I setup this cluster by manipulating CRUSH map using CLI. I had a default
>> root
>> before but it gave me an impression that since every rack is under a
>> single
>> root bucket its marking entire cluster down in case one of the osd is 95%
>> full. So I
>> removed root bucket but that still did not help me. No crush rule is
>> referring
>> to root bucket in the above mentioned case.
>>
>> Yes, I added one osd under two racks by linking host bucket from one rack
>> to another
>> using following command -
>>
>> "osd crush link   [...] :  link existing entry for
>>  under location "
>>
>>
>> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney
>>  wrote:
>>>
>>> I think a `ceph osd df` would be useful.
>>>
>>> And how did you set up such a cluster? I don't see a root, and you have
>>> each osd in there more than once...is that even possible?
>>>
>>>
>>>
>>> On 08/10/17 08:46, Mandar Naik wrote:
>>>
>>> Hi,
>>>
>>> I am evaluating ceph cluster for a solution where ceph could be used for
>>> provisioning
>>>
>>> pools which could be either stored local to a node or replicated across a
>>> cluster.  This
>>>
>>> way ceph could be used as single point of solution for writing both local
>>> as well as replicated
>>>
>>> data. Local storage helps avoid possible storage cost that comes with
>>> replication factor of more
>>>
>>> than one and also provide availability as long as the data host is alive.
>>>
>>>
>>> So I tried an experiment with Ceph cluster where there is one crush rule
>>> which replicates data across
>>>
>>> nodes and other one only points to a crush bucket that has local ceph
>>> osd. Cluster configuration
>>>
>>> is pasted below.
>>>
>>>
>>> Here I observed that if one of the disk is full (95%) entire cluster goes
>>> into error state and stops
>>>
>>> accepting new writes from/to other nodes. So ceph cluster became unusable
>>> even though it’s only
>>>
>>> 32% full. The writes are blocked even for pools which are not touching
>>> the full osd.
>>>
>>>
>>> I have tried playing around crush hierarchy but it did not help. So is it
>>> possible to store data in the above
>>>
>>> manner with Ceph ? If yes could we get cluster state in usable state
>>> after one of the node is full ?
>>>
>>>
>>>
>>> # ceph df
>>>
>>>
>>> GLOBAL:
>>>
>>>SIZE AVAIL  RAW USED %RAW USED
>>>
>>>134G 94247M   43922M 31.79
>>>
>>>
>>> # ceph –s
>>>
>>>
>>>cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
>>>
>>> health HEALTH_ERR
>>>
>>>1 full osd(s)
>>>
>>>full,sortbitwise,require_jewel_osds flag(s) set
>>>
>>> monmap e3: 3 mons at
>>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0}
>>>
>>>election epoch 14, quorum 0,1,2
>>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
>>>
>>> osdmap e93: 3 osds: 3 up, 3 

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-16 Thread Mandar Naik
Hi,
I just wanted to give a friendly reminder for this issue. I would
appreciate if someone
can help me out here. Also, please do let me know in case some more
information is
required here.

On Thu, Aug 10, 2017 at 2:41 PM, Mandar Naik  wrote:

> Hi Peter,
> Thanks a lot for the reply. Please find 'ceph osd df' output here -
>
> # ceph osd df
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>  0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
>  1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
>  2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
>   TOTAL   134G 43925M 94244M 31.79
> MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85
>
> I setup this cluster by manipulating CRUSH map using CLI. I had a default
> root
> before but it gave me an impression that since every rack is under a single
> root bucket its marking entire cluster down in case one of the osd is 95%
> full. So I
> removed root bucket but that still did not help me. No crush rule is
> referring
> to root bucket in the above mentioned case.
>
> Yes, I added one osd under two racks by linking host bucket from one rack
> to another
> using following command -
>
> "osd crush link   [...] :  link existing entry for
>  under location "
>
>
> On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney  consult.de> wrote:
>
>> I think a `ceph osd df` would be useful.
>>
>> And how did you set up such a cluster? I don't see a root, and you have
>> each osd in there more than once...is that even possible?
>>
>>
>>
>> On 08/10/17 08:46, Mandar Naik wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> * Hi, I am evaluating ceph cluster for a solution where ceph could be
>> used for provisioning pools which could be either stored local to a node or
>> replicated across a cluster.  This way ceph could be used as single point
>> of solution for writing both local as well as replicated data. Local
>> storage helps avoid possible storage cost that comes with replication
>> factor of more than one and also provide availability as long as the data
>> host is alive.   So I tried an experiment with Ceph cluster where there is
>> one crush rule which replicates data across nodes and other one only points
>> to a crush bucket that has local ceph osd. Cluster configuration is pasted
>> below. Here I observed that if one of the disk is full (95%) entire cluster
>> goes into error state and stops accepting new writes from/to other nodes.
>> So ceph cluster became unusable even though it’s only 32% full. The writes
>> are blocked even for pools which are not touching the full osd. I have
>> tried playing around crush hierarchy but it did not help. So is it possible
>> to store data in the above manner with Ceph ? If yes could we get cluster
>> state in usable state after one of the node is full ? # ceph df GLOBAL:
>>SIZE AVAIL  RAW USED %RAW USED134G 94247M
>>   43922M 31.79 # ceph –scluster
>> ba658a02-757d-4e3c-7fb3-dc4bf944322f health HEALTH_ERR1
>> full osd(s)full,sortbitwise,require_jewel_osds flag(s) set
>> monmap e3: 3 mons at
>> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
>> }
>>election epoch 14, quorum 0,1,2
>> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210 osdmap e93: 3 osds: 3 up, 3
>> inflags full,sortbitwise,require_jewel_osds  pgmap v630:
>> 384 pgs, 6 pools, 43772 MB data, 18640 objects43922 MB used,
>> 94247 MB / 134 GB avail 384 active+clean # ceph osd tree ID
>> WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY -9
>> 0.04399 rack ip-10-0-9-146-rack -8 0.04399 host ip-10-0-9-146 2 0.04399
>> osd.2up  1.0  1.0 -7 0.04399 rack
>> ip-10-0-9-210-rack -6 0.04399 host ip-10-0-9-210 1 0.04399
>> osd.1up  1.0  1.0 -5 0.04399 rack
>> ip-10-0-9-122-rack -3 0.04399 host ip-10-0-9-122 0 0.04399
>> osd.0up  1.0  1.0 -4 0.13197 rack
>> rep-rack -3 0.04399 host ip-10-0-9-122 0 0.04399 osd.0
>>up  1.0  1.0 -6 0.04399 host
>> ip-10-0-9-210 1 0.04399 osd.1up  1.0
>>  1.0 -8 0.04399 host ip-10-0-9-146 2 0.04399 osd.2
>>up  1.0  1.0 # ceph osd crush rule list [
>>"rep_ruleset","ip-10-0-9-122_ruleset","ip-10-0-9-210_ruleset",
>>"ip-10-0-9-146_ruleset" ] # ceph osd crush rule dump rep_ruleset {
>>"rule_id": 0,"rule_name": "rep_ruleset","ruleset": 0,"type":
>> 1,

Re: [ceph-users] ceph Cluster attempt to access beyond end of device

2017-08-15 Thread ZHOU Yuan
Hi Hauke,

It's possibly the XFS issue as discussed in the previous thread. I also saw
this issue in some JBOD setup, running with RHEL 7.3


Sincerely, Yuan

On Tue, Aug 15, 2017 at 7:38 PM, Hauke Homburg 
wrote:

> Hello,
>
>
> I found some error in the Cluster with dmes -T:
>
> attempt to access beyond end of device
>
> I found the following Post:
>
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html
>
> Is this a Problem with the Size of the Filesystem itself oder "only"
> eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware
> RAID 6 running. In this RAID we have the XFS Partition.
>
> Also we have one big Filesystem in 1 OSD in each Server instead of 1
> Filesystem per HDD at 8 HDD in each Server.
>
> greetings
>
> Hauke
>
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph Cluster attempt to access beyond end of device

2017-08-15 Thread David Turner
The error found in that thread, iirc, is that the block size of the disk
does not match the block size of the FS and is trying to access the rest of
a block at the end of a disk. I also remember that the error didn't cause
any problems.

Why raid 6? Rebuilding a raid 6 seems like your cluster would have worse
degraded performance while rebuilding the raid after a dead drive than if
you only had individual osds and list a drive. I suppose you wouldn't be in
a situation of the cluster seeing degraded objects/PGs, so if that is your
use case need them it makes sense. From a cross architecture sense, it
doesn't make sense.

On Tue, Aug 15, 2017, 5:39 AM Hauke Homburg  wrote:

> Hello,
>
>
> I found some error in the Cluster with dmes -T:
>
> attempt to access beyond end of device
>
> I found the following Post:
>
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html
>
> Is this a Problem with the Size of the Filesystem itself oder "only"
> eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware
> RAID 6 running. In this RAID we have the XFS Partition.
>
> Also we have one big Filesystem in 1 OSD in each Server instead of 1
> Filesystem per HDD at 8 HDD in each Server.
>
> greetings
>
> Hauke
>
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph Cluster attempt to access beyond end of device

2017-08-15 Thread Hauke Homburg
Hello,


I found some error in the Cluster with dmes -T:

attempt to access beyond end of device

I found the following Post:

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg39101.html

Is this a Problem with the Size of the Filesystem itself oder "only"
eine Driver Bug? I ask becaue we habe in each Node 8 HDD with a Hardware
RAID 6 running. In this RAID we have the XFS Partition.

Also we have one big Filesystem in 1 OSD in each Server instead of 1
Filesystem per HDD at 8 HDD in each Server.

greetings

Hauke


-- 
www.w3-creative.de

www.westchat.de

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Cluster with Deeo Scrub Error

2017-08-14 Thread Hauke Homburg
Am 04.07.2017 um 17:58 schrieb Etienne Menguy:
> rados list-inconsistent-ob

Hello,


Sorry for my late reply. We installed some new Server and now wie have
osd pool default size = 3.

At this Point i tried again to repair the with ceph pg rair and ceph pg
deep-srub. I tried to delete again rados Objects like
https://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/.

So i had after Ceph sync 3 Objects the Thinking that the Cluster now
repairs itself.

So i tried to to touch on the primary osd the the deleted object ago, on
all 3 osd, and restarted the osd and ceph pg repair.

Today i tried ceph health details and saw that the nothing happend.

The last Entry in the logs are the Errors from the first event.

What can i do to repair this?

A ceph pg query shows:

{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"epoch": 3442,
"up": [
6,
4,
5
],
"acting": [
6,
4,
5
],
"actingbackfill": [
"4",
"5",
"6"
],
"info": {
"pgid": "1.4c",
"last_update": "3257'297879",
"last_complete": "3257'297879",
"log_tail": "3017'294194",
"last_user_version": 297274,
"last_backfill": "MAX",
"last_backfill_bitwise": 1,
"purged_snaps": "[1~3]",
"history": {
"epoch_created": 30,
"last_epoch_started": 3439,
"last_epoch_clean": 3439,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 3437,
"same_interval_since": 3438,
"same_primary_since": 3433,
"last_scrub": "3257'297879",
"last_scrub_stamp": "2017-08-10 07:11:19.390861",
"last_deep_scrub": "3257'297879",
"last_deep_scrub_stamp": "2017-08-10 07:11:19.390861",
"last_clean_scrub_stamp": "2017-07-06 19:55:03.540097"
},
"stats": {
"version": "3257'297879",
"reported_seq": "646820",
"reported_epoch": "3439",
"state": "active+clean+inconsistent",
"last_fresh": "2017-08-14 18:40:20.668895",
"last_change": "2017-08-14 18:40:20.668895",
"last_active": "2017-08-14 18:40:20.668895",
"last_peered": "2017-08-14 18:40:20.668895",
"last_clean": "2017-08-14 18:40:20.668895",
"last_became_active": "2017-08-14 18:40:20.624556",
"last_became_peered": "2017-08-14 18:40:20.624556",
"last_unstale": "2017-08-14 18:40:20.668895",
"last_undegraded": "2017-08-14 18:40:20.668895",
"last_fullsized": "2017-08-14 18:40:20.668895",
"mapping_epoch": 3437,
"log_start": "3017'294194",
"ondisk_log_start": "3017'294194",
"created": 30,
"last_epoch_clean": 3439,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "3257'297879",
"last_scrub_stamp": "2017-08-10 07:11:19.390861",
"last_deep_scrub": "3257'297879",
"last_deep_scrub_stamp": "2017-08-10 07:11:19.390861",
"last_clean_scrub_stamp": "2017-07-06 19:55:03.540097",
"log_size": 3685,
"ondisk_log_size": 3685,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"stat_sum": {
"num_bytes": 202010947584,
"num_objects": 48309,
"num_object_clones": 0,
"num_object_copies": 144927,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 48309,
"num_whiteouts": 0,
"num_read": 1033913,
"num_read_kb": 128529336,
"num_write": 547200,
"num_write_kb": 754707388,
"num_scrub_errors": 2,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 2,
"num_objects_recovered": 280690,
"num_bytes_recovered": 1170817835008,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
 

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-10 Thread Mandar Naik
Hi Peter,
Thanks a lot for the reply. Please find 'ceph osd df' output here -

# ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  PGS
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 0 0.04399  1.0 46056M 43851M  2205M 95.21 2.99 192
 1 0.04399  1.0 46056M 40148k 46017M  0.09 0.00 384
 2 0.04399  1.0 46056M 35576k 46021M  0.08 0.00   0
  TOTAL   134G 43925M 94244M 31.79
MIN/MAX VAR: 0.00/2.99  STDDEV: 44.85

I setup this cluster by manipulating CRUSH map using CLI. I had a default
root
before but it gave me an impression that since every rack is under a single
root bucket its marking entire cluster down in case one of the osd is 95%
full. So I
removed root bucket but that still did not help me. No crush rule is
referring
to root bucket in the above mentioned case.

Yes, I added one osd under two racks by linking host bucket from one rack
to another
using following command -

"osd crush link   [...] :  link existing entry for 
under location "


On Thu, Aug 10, 2017 at 1:40 PM, Peter Maloney <
peter.malo...@brockmann-consult.de> wrote:

> I think a `ceph osd df` would be useful.
>
> And how did you set up such a cluster? I don't see a root, and you have
> each osd in there more than once...is that even possible?
>
>
>
> On 08/10/17 08:46, Mandar Naik wrote:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> * Hi, I am evaluating ceph cluster for a solution where ceph could be used
> for provisioning pools which could be either stored local to a node or
> replicated across a cluster.  This way ceph could be used as single point
> of solution for writing both local as well as replicated data. Local
> storage helps avoid possible storage cost that comes with replication
> factor of more than one and also provide availability as long as the data
> host is alive.   So I tried an experiment with Ceph cluster where there is
> one crush rule which replicates data across nodes and other one only points
> to a crush bucket that has local ceph osd. Cluster configuration is pasted
> below. Here I observed that if one of the disk is full (95%) entire cluster
> goes into error state and stops accepting new writes from/to other nodes.
> So ceph cluster became unusable even though it’s only 32% full. The writes
> are blocked even for pools which are not touching the full osd. I have
> tried playing around crush hierarchy but it did not help. So is it possible
> to store data in the above manner with Ceph ? If yes could we get cluster
> state in usable state after one of the node is full ? # ceph df GLOBAL:
>SIZE AVAIL  RAW USED %RAW USED134G 94247M
>   43922M 31.79 # ceph –scluster
> ba658a02-757d-4e3c-7fb3-dc4bf944322f health HEALTH_ERR1
> full osd(s)full,sortbitwise,require_jewel_osds flag(s) set
> monmap e3: 3 mons at
> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
> }
>election epoch 14, quorum 0,1,2
> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210 osdmap e93: 3 osds: 3 up, 3
> inflags full,sortbitwise,require_jewel_osds  pgmap v630:
> 384 pgs, 6 pools, 43772 MB data, 18640 objects43922 MB used,
> 94247 MB / 134 GB avail 384 active+clean # ceph osd tree ID
> WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY -9
> 0.04399 rack ip-10-0-9-146-rack -8 0.04399 host ip-10-0-9-146 2 0.04399
> osd.2up  1.0  1.0 -7 0.04399 rack
> ip-10-0-9-210-rack -6 0.04399 host ip-10-0-9-210 1 0.04399
> osd.1up  1.0  1.0 -5 0.04399 rack
> ip-10-0-9-122-rack -3 0.04399 host ip-10-0-9-122 0 0.04399
> osd.0up  1.0  1.0 -4 0.13197 rack
> rep-rack -3 0.04399 host ip-10-0-9-122 0 0.04399 osd.0
>up  1.0  1.0 -6 0.04399 host
> ip-10-0-9-210 1 0.04399 osd.1up  1.0
>  1.0 -8 0.04399 host ip-10-0-9-146 2 0.04399 osd.2
>up  1.0  1.0 # ceph osd crush rule list [
>"rep_ruleset","ip-10-0-9-122_ruleset","ip-10-0-9-210_ruleset",
>"ip-10-0-9-146_ruleset" ] # ceph osd crush rule dump rep_ruleset {
>"rule_id": 0,"rule_name": "rep_ruleset","ruleset": 0,"type":
> 1,"min_size": 1,"max_size": 10,"steps": [{
>"op": "take","item": -4,"item_name":
> "rep-rack"},{"op": "chooseleaf_firstn",
>"num": 0,"type": "host"},{
>"op": "emit"}] } # ceph osd crush rule dump
> ip-10-0-9-122_ruleset {"rule_id": 1,"rule_name":

Re: [ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-10 Thread Peter Maloney
I think a `ceph osd df` would be useful.

And how did you set up such a cluster? I don't see a root, and you have
each osd in there more than once...is that even possible?


On 08/10/17 08:46, Mandar Naik wrote:
> *
>
> Hi,
>
> I am evaluating ceph cluster for a solution where ceph could be used
> for provisioning
>
> pools which could be either stored local to a node or replicated
> across a cluster.  This
>
> way ceph could be used as single point of solution for writing both
> local as well as replicated
>
> data. Local storage helps avoid possible storage cost that comes with
> replication factor of more
>
> than one and also provide availability as long as the data host is
> alive.  
>
>
> So I tried an experiment with Ceph cluster where there is one crush
> rule which replicates data across
>
> nodes and other one only points to a crush bucket that has local ceph
> osd. Cluster configuration
>
> is pasted below.
>
>
> Here I observed that if one of the disk is full (95%) entire cluster
> goes into error state and stops
>
> accepting new writes from/to other nodes. So ceph cluster became
> unusable even though it’s only
>
> 32% full. The writes are blocked even for pools which are not touching
> the full osd.
>
>
> I have tried playing around crush hierarchy but it did not help. So is
> it possible to store data in the above
>
> manner with Ceph ? If yes could we get cluster state in usable state
> after one of the node is full ?
>
>
>
> # ceph df
>
>
> GLOBAL:
>
>SIZE AVAIL  RAW USED %RAW USED
>
>134G 94247M   43922M 31.79
>
>
> # ceph –s
>
>
>cluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
>
> health HEALTH_ERR
>
>1 full osd(s)
>
>full,sortbitwise,require_jewel_osds flag(s) set
>
> monmap e3: 3 mons at
> {ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
> }
>
>election epoch 14, quorum 0,1,2
> ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210
>
> osdmap e93: 3 osds: 3 up, 3 in
>
>flags full,sortbitwise,require_jewel_osds
>
>  pgmap v630: 384 pgs, 6 pools, 43772 MB data, 18640 objects
>
>43922 MB used, 94247 MB / 134 GB avail
>
> 384 active+clean
>
>
> # ceph osd tree
>
>
> ID WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
>
> -9 0.04399 rack ip-10-0-9-146-rack
>
> -8 0.04399 host ip-10-0-9-146
>
> 2 0.04399 osd.2up  1.0  1.0
>
> -7 0.04399 rack ip-10-0-9-210-rack
>
> -6 0.04399 host ip-10-0-9-210
>
> 1 0.04399 osd.1up  1.0  1.0
>
> -5 0.04399 rack ip-10-0-9-122-rack
>
> -3 0.04399 host ip-10-0-9-122
>
> 0 0.04399 osd.0up  1.0  1.0
>
> -4 0.13197 rack rep-rack
>
> -3 0.04399 host ip-10-0-9-122
>
> 0 0.04399 osd.0up  1.0  1.0
>
> -6 0.04399 host ip-10-0-9-210
>
> 1 0.04399 osd.1up  1.0  1.0
>
> -8 0.04399 host ip-10-0-9-146
>
> 2 0.04399 osd.2up  1.0  1.0
>
>
> # ceph osd crush rule list
>
> [
>
>"rep_ruleset",
>
>"ip-10-0-9-122_ruleset",
>
>"ip-10-0-9-210_ruleset",
>
>"ip-10-0-9-146_ruleset"
>
> ]
>
>
> # ceph osd crush rule dump rep_ruleset
>
> {
>
>"rule_id": 0,
>
>"rule_name": "rep_ruleset",
>
>"ruleset": 0,
>
>"type": 1,
>
>"min_size": 1,
>
>"max_size": 10,
>
>"steps": [
>
>{
>
>"op": "take",
>
>"item": -4,
>
>"item_name": "rep-rack"
>
>},
>
>{
>
>"op": "chooseleaf_firstn",
>
>"num": 0,
>
>"type": "host"
>
>},
>
>{
>
>"op": "emit"
>
>}
>
>]
>
> }
>
>
> # ceph osd crush rule dump ip-10-0-9-122_ruleset
>
> {
>
>"rule_id": 1,
>
>"rule_name": "ip-10-0-9-122_ruleset",
>
>"ruleset": 1,
>
>"type": 1,
>
>"min_size": 1,
>
>"max_size": 10,
>
>"steps": [
>
>{
>
>"op": "take",
>
>"item": -5,
>
>"item_name": "ip-10-0-9-122-rack"
>
>},
>
>{
>
>"op": "chooseleaf_firstn",
>
>"num": 0,
>
>"type": "host"
>
>},
>
>{
>
>"op": "emit"
>
>}
>
>]
>
> }
>
> *
>
> -- 
> Thanks,
> Mandar Naik.
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 


Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.malo...@brockmann-consult.de
Internet: http://www.brockmann-consult.de



[ceph-users] Ceph cluster in error state (full) with raw usage 32% of total capacity

2017-08-10 Thread Mandar Naik
*Hi,I am evaluating ceph cluster for a solution where ceph could be used
for provisioningpools which could be either stored local to a node or
replicated across a cluster.  This way ceph could be used as single point
of solution for writing both local as well as replicateddata. Local storage
helps avoid possible storage cost that comes with replication factor of
more than one and also provide availability as long as the data host is
alive.  So I tried an experiment with Ceph cluster where there is one crush
rule which replicates data acrossnodes and other one only points to a crush
bucket that has local ceph osd. Cluster configuration is pasted below.Here
I observed that if one of the disk is full (95%) entire cluster goes into
error state and stopsaccepting new writes from/to other nodes. So ceph
cluster became unusable even though it’s only32% full. The writes are
blocked even for pools which are not touching the full osd.I have tried
playing around crush hierarchy but it did not help. So is it possible to
store data in the abovemanner with Ceph ? If yes could we get cluster state
in usable state after one of the node is full ?# ceph dfGLOBAL:SIZE
AVAIL  RAW USED %RAW USED134G 94247M   43922M
31.79# ceph –scluster ba658a02-757d-4e3c-7fb3-dc4bf944322f
health HEALTH_ERR1 full osd(s)
   full,sortbitwise,require_jewel_osds flag(s) set monmap e3: 3
mons at
{ip-10-0-9-122=10.0.9.122:6789/0,ip-10-0-9-146=10.0.9.146:6789/0,ip-10-0-9-210=10.0.9.210:6789/0
}
   election epoch 14, quorum 0,1,2
ip-10-0-9-122,ip-10-0-9-146,ip-10-0-9-210 osdmap e93: 3 osds: 3 up, 3
inflags full,sortbitwise,require_jewel_osds  pgmap v630:
384 pgs, 6 pools, 43772 MB data, 18640 objects43922 MB used,
94247 MB / 134 GB avail 384 active+clean# ceph osd treeID
WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY-9 0.04399
rack ip-10-0-9-146-rack-8 0.04399 host ip-10-0-9-146 2 0.04399
osd.2up  1.0  1.0-7 0.04399 rack
ip-10-0-9-210-rack-6 0.04399 host ip-10-0-9-210 1 0.04399 osd.1
   up  1.0  1.0-5 0.04399 rack
ip-10-0-9-122-rack-3 0.04399 host ip-10-0-9-122 0 0.04399 osd.0
   up  1.0  1.0-4 0.13197 rack rep-rack-3
0.04399 host ip-10-0-9-122 0 0.04399 osd.0up
 1.0  1.0-6 0.04399 host ip-10-0-9-210 1 0.04399
osd.1up  1.0  1.0-8 0.04399
host ip-10-0-9-146 2 0.04399 osd.2up  1.0
 1.0# ceph osd crush rule list["rep_ruleset",
   "ip-10-0-9-122_ruleset","ip-10-0-9-210_ruleset",
   "ip-10-0-9-146_ruleset"]# ceph osd crush rule dump rep_ruleset{
   "rule_id": 0,"rule_name": "rep_ruleset","ruleset": 0,"type":
1,"min_size": 1,"max_size": 10,"steps": [{
   "op": "take","item": -4,"item_name":
"rep-rack"},{"op": "chooseleaf_firstn",
   "num": 0,"type": "host"},{
   "op": "emit"}]}# ceph osd crush rule dump
ip-10-0-9-122_ruleset{"rule_id": 1,"rule_name":
"ip-10-0-9-122_ruleset","ruleset": 1,"type": 1,"min_size": 1,
   "max_size": 10,"steps": [{"op": "take",
   "item": -5,"item_name": "ip-10-0-9-122-rack"
   },{"op": "chooseleaf_firstn","num":
0,"type": "host"},{"op": "emit"
   }]}*

-- 
Thanks,
Mandar Naik.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cluster experiencing major performance issues

2017-08-08 Thread Nick Fisk


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mclean, Patrick
> Sent: 08 August 2017 20:13
> To: David Turner <drakonst...@gmail.com>; ceph-us...@ceph.com
> Cc: Colenbrander, Roelof <roderick.colenbran...@sony.com>; Payno,
> Victor <victor.pa...@sony.com>; Yip, Rae <rae....@sony.com>
> Subject: Re: [ceph-users] ceph cluster experiencing major performance
issues
> 
> On 08/08/17 10:50 AM, David Turner wrote:
> > Are you also seeing osds marking themselves down for a little bit and
> > then coming back up?  There are 2 very likely problems
> > causing/contributing to this.  The first is if you are using a lot of
> > snapshots.  Deleting snapshots is a very expensive operation for your
> > cluster and can cause a lot of slowness.  The second is PG subfolder
> > splitting.  This will show as blocked requests and osds marking
> > themselves down and coming back up a little later without any errors
> > in the log.  I linked a previous thread where someone was having these
> > problems where both causes were investigated.
> >
> > https://www.mail-archive.com/ceph-
> us...@lists.ceph.com/msg36923.html
> 
> We are not seeing OSDs marking themselves down a little bit and coming
> back as far as we can tell. We will do some more investigation in to this.
> 
> We are creating and deleting quite a few snapshots, is there anything we
can
> do to make this less expensive? We are going to attempt to create less
> snapshots in our systems, but unfortunately we have to create a fair
number
> due to our use case.

That's probably most likely your problem. Upgrade to 10.2.9 and enable the
snap trim sleep option on your OSD's to somewhere around 0.1, it has a
massive effect on snapshot removal.

> 
> Is slow snapshot deletion likely to cause a slow backlog of purged snaps?
In
> some cases we are seeing ~40k snaps still in cached_removed_snaps.
> 
> > If you have 0.94.9 or 10.2.5 or later, then you can split your PG
> > subfolders sanely while your osds are temporarily turned off using the
> > 'ceph-objectstore-tool apply-layout-settings'.  There are a lot of
> > ways to skin the cat of snap trimming, but it depends greatly on your
use
> case.
> 
> We are currently running 10.2.5, and are planning to update to 10.2.9 at
> some point soon. Our clients are using the 4.9 kernel RBD driver (which
sort
> of forces us to keep our snapshot count down below 510), we are currently
> testing the possibility of using the nbd-rbd driver as an alternative.
> 
> > On Mon, Aug 7, 2017 at 11:49 PM Mclean, Patrick
> > <patrick.mcl...@sony.com <mailto:patrick.mcl...@sony.com>> wrote:
> >
> > High CPU utilization and inexplicably slow I/O requests
> >
> > We have been having similar performance issues across several ceph
> > clusters. When all the OSDs are up in the cluster, it can stay
HEALTH_OK
> > for a while, but eventually performance worsens and becomes (at
first
> > intermittently, but eventually continually) HEALTH_WARN due to slow
> I/O
> > request blocked for longer than 32 sec. These slow requests are
> > accompanied by "currently waiting for rw locks", but we have not
found
> > any network issue that normally is responsible for this warning.
> >
> > Examining the individual slow OSDs from `ceph health detail` has
been
> > unproductive; there don't seem to be any slow disks and if we stop
the
> > OSD the problem just moves somewhere else.
> >
> > We also think this trends with increased number of RBDs on the
clusters,
> > but not necessarily a ton of Ceph I/O. At the same time, user %CPU
time
> > spikes up to 95-100%, at first frequently and then consistently,
> > simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz
> CPU
> > with 6 cores and 64GiB RAM per node.
> >
> > ceph1 ~ $ sudo ceph status
> > cluster ----
> >  health HEALTH_WARN
> > 547 requests are blocked > 32 sec
> >  monmap e1: 3 mons at
> >
> {cephmon1.XXX=XXX.XXX.XXX.XXX:/0,cephmon1.
> XXX=XXX.XXX.XXX.XX:/0,cephmon1.XXX
> =XXX.XXX.XXX.XXX:/0}
> > election epoch 16, quorum 0,1,2
> >
> cephmon1.XXX,cephmon1.X
> XX,cephmon1.XXX
> >  osdmap e577122: 72 osds: 68 up, 68 in
> > flags sortbitwise,require_jewel_osds
> >   pgmap v6799002: 4096 p

Re: [ceph-users] ceph cluster experiencing major performance issues

2017-08-08 Thread Mclean, Patrick
On 08/08/17 10:50 AM, David Turner wrote:
> Are you also seeing osds marking themselves down for a little bit and
> then coming back up?  There are 2 very likely problems
> causing/contributing to this.  The first is if you are using a lot of
> snapshots.  Deleting snapshots is a very expensive operation for your
> cluster and can cause a lot of slowness.  The second is PG subfolder
> splitting.  This will show as blocked requests and osds marking
> themselves down and coming back up a little later without any errors in
> the log.  I linked a previous thread where someone was having these
> problems where both causes were investigated.
> 
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36923.html  

We are not seeing OSDs marking themselves down a little bit and coming
back as far as we can tell. We will do some more investigation in to this.

We are creating and deleting quite a few snapshots, is there anything we
can do to make this less expensive? We are going to attempt to create
less snapshots in our systems, but unfortunately we have to create a
fair number due to our use case.

Is slow snapshot deletion likely to cause a slow backlog of purged
snaps? In some cases we are seeing ~40k snaps still in cached_removed_snaps.

> If you have 0.94.9 or 10.2.5 or later, then you can split your PG
> subfolders sanely while your osds are temporarily turned off using the
> 'ceph-objectstore-tool apply-layout-settings'.  There are a lot of ways
> to skin the cat of snap trimming, but it depends greatly on your use case.

We are currently running 10.2.5, and are planning to update to 10.2.9 at
some point soon. Our clients are using the 4.9 kernel RBD driver (which
sort of forces us to keep our snapshot count down below 510), we are
currently testing the possibility of using the nbd-rbd driver as an
alternative.

> On Mon, Aug 7, 2017 at 11:49 PM Mclean, Patrick  > wrote:
> 
> High CPU utilization and inexplicably slow I/O requests
> 
> We have been having similar performance issues across several ceph
> clusters. When all the OSDs are up in the cluster, it can stay HEALTH_OK
> for a while, but eventually performance worsens and becomes (at first
> intermittently, but eventually continually) HEALTH_WARN due to slow I/O
> request blocked for longer than 32 sec. These slow requests are
> accompanied by "currently waiting for rw locks", but we have not found
> any network issue that normally is responsible for this warning.
> 
> Examining the individual slow OSDs from `ceph health detail` has been
> unproductive; there don't seem to be any slow disks and if we stop the
> OSD the problem just moves somewhere else.
> 
> We also think this trends with increased number of RBDs on the clusters,
> but not necessarily a ton of Ceph I/O. At the same time, user %CPU time
> spikes up to 95-100%, at first frequently and then consistently,
> simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz CPU
> with 6 cores and 64GiB RAM per node.
> 
> ceph1 ~ $ sudo ceph status
> cluster ----
>  health HEALTH_WARN
> 547 requests are blocked > 32 sec
>  monmap e1: 3 mons at
> 
> {cephmon1.XXX=XXX.XXX.XXX.XXX:/0,cephmon1.XXX=XXX.XXX.XXX.XX:/0,cephmon1.XXX=XXX.XXX.XXX.XXX:/0}
> election epoch 16, quorum 0,1,2
> 
> cephmon1.XXX,cephmon1.XXX,cephmon1.XXX
>  osdmap e577122: 72 osds: 68 up, 68 in
> flags sortbitwise,require_jewel_osds
>   pgmap v6799002: 4096 pgs, 4 pools, 13266 GB data, 11091 kobjects
> 126 TB used, 368 TB / 494 TB avail
> 4084 active+clean
>   12 active+clean+scrubbing+deep
>   client io 113 kB/s rd, 11486 B/s wr, 135 op/s rd, 7 op/s wr
> 
> ceph1 ~ $ vmstat 5 5
> procs ---memory-- ---swap-- -io -system--
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
> id wa st
> 27  1  0 3112660 165544 3626169200   472  127401 22
> 1 76  1  0
> 25  0  0 3126176 165544 3624650800   858 12692 12122 110478
> 97  2  1  0  0
> 22  0  0 3114284 165544 3625813600 1  6118 9586 118625
> 97  2  1  0  0
> 11  0  0 3096508 165544 3627624400 8  6762 10047 188618
> 89  3  8  0  0
> 18  0  0 2990452 165544 3638404800  1209 21170 11179 179878
> 85  4 11  0  0
> 
> There is no apparent memory shortage, and none of the HDDs or SSDs show
> consistently high utilization, slow service times, or any other form of
> hardware saturation, other than user CPU utilization. Can CPU starvation

Re: [ceph-users] ceph cluster experiencing major performance issues

2017-08-08 Thread David Turner
Are you also seeing osds marking themselves down for a little bit and then
coming back up?  There are 2 very likely problems causing/contributing to
this.  The first is if you are using a lot of snapshots.  Deleting
snapshots is a very expensive operation for your cluster and can cause a
lot of slowness.  The second is PG subfolder splitting.  This will show as
blocked requests and osds marking themselves down and coming back up a
little later without any errors in the log.  I linked a previous thread
where someone was having these problems where both causes were investigated.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36923.html

If you have 0.94.9 or 10.2.5 or later, then you can split your PG
subfolders sanely while your osds are temporarily turned off using the
'ceph-objectstore-tool apply-layout-settings'.  There are a lot of ways to
skin the cat of snap trimming, but it depends greatly on your use case.

On Mon, Aug 7, 2017 at 11:49 PM Mclean, Patrick 
wrote:

> High CPU utilization and inexplicably slow I/O requests
>
> We have been having similar performance issues across several ceph
> clusters. When all the OSDs are up in the cluster, it can stay HEALTH_OK
> for a while, but eventually performance worsens and becomes (at first
> intermittently, but eventually continually) HEALTH_WARN due to slow I/O
> request blocked for longer than 32 sec. These slow requests are
> accompanied by "currently waiting for rw locks", but we have not found
> any network issue that normally is responsible for this warning.
>
> Examining the individual slow OSDs from `ceph health detail` has been
> unproductive; there don't seem to be any slow disks and if we stop the
> OSD the problem just moves somewhere else.
>
> We also think this trends with increased number of RBDs on the clusters,
> but not necessarily a ton of Ceph I/O. At the same time, user %CPU time
> spikes up to 95-100%, at first frequently and then consistently,
> simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz CPU
> with 6 cores and 64GiB RAM per node.
>
> ceph1 ~ $ sudo ceph status
> cluster ----
>  health HEALTH_WARN
> 547 requests are blocked > 32 sec
>  monmap e1: 3 mons at
>
> {cephmon1.XXX=XXX.XXX.XXX.XXX:/0,cephmon1.XXX=XXX.XXX.XXX.XX:/0,cephmon1.XXX=XXX.XXX.XXX.XXX:/0}
> election epoch 16, quorum 0,1,2
>
> cephmon1.XXX,cephmon1.XXX,cephmon1.XXX
>  osdmap e577122: 72 osds: 68 up, 68 in
> flags sortbitwise,require_jewel_osds
>   pgmap v6799002: 4096 pgs, 4 pools, 13266 GB data, 11091 kobjects
> 126 TB used, 368 TB / 494 TB avail
> 4084 active+clean
>   12 active+clean+scrubbing+deep
>   client io 113 kB/s rd, 11486 B/s wr, 135 op/s rd, 7 op/s wr
>
> ceph1 ~ $ vmstat 5 5
> procs ---memory-- ---swap-- -io -system--
> --cpu-
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
> id wa st
> 27  1  0 3112660 165544 3626169200   472  127401 22
> 1 76  1  0
> 25  0  0 3126176 165544 3624650800   858 12692 12122 110478
> 97  2  1  0  0
> 22  0  0 3114284 165544 3625813600 1  6118 9586 118625
> 97  2  1  0  0
> 11  0  0 3096508 165544 3627624400 8  6762 10047 188618
> 89  3  8  0  0
> 18  0  0 2990452 165544 3638404800  1209 21170 11179 179878
> 85  4 11  0  0
>
> There is no apparent memory shortage, and none of the HDDs or SSDs show
> consistently high utilization, slow service times, or any other form of
> hardware saturation, other than user CPU utilization. Can CPU starvation
> be responsible for "waiting for rw locks"?
>
> Our main pool (the one with all the data) currently has 1024 PGs,
> leaving us room to add more PGs if needed, but we're concerned if we do
> so that we'd consume even more CPU.
>
> We have moved to running Ceph + jemalloc instead of tcmalloc, and that
> has helped with CPU utilization somewhat, but we still see occurences of
> 95-100% CPU with not terribly high Ceph workload.
>
> Any suggestions of what else to look at? We have a peculiar use case
> where we have many RBDs but only about 1-5% of them are active at the
> same time, and we're constantly making and expiring RBD snapshots. Could
> this lead to aberrant performance? For instance, is it normal to have
> ~40k snaps still in cached_removed_snaps?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph cluster experiencing major performance issues

2017-08-07 Thread Mclean, Patrick
High CPU utilization and inexplicably slow I/O requests

We have been having similar performance issues across several ceph
clusters. When all the OSDs are up in the cluster, it can stay HEALTH_OK
for a while, but eventually performance worsens and becomes (at first
intermittently, but eventually continually) HEALTH_WARN due to slow I/O
request blocked for longer than 32 sec. These slow requests are
accompanied by "currently waiting for rw locks", but we have not found
any network issue that normally is responsible for this warning.

Examining the individual slow OSDs from `ceph health detail` has been
unproductive; there don't seem to be any slow disks and if we stop the
OSD the problem just moves somewhere else.

We also think this trends with increased number of RBDs on the clusters,
but not necessarily a ton of Ceph I/O. At the same time, user %CPU time
spikes up to 95-100%, at first frequently and then consistently,
simultaneously across all cores. We are running 12 OSDs on a 2.2 GHz CPU
with 6 cores and 64GiB RAM per node.

ceph1 ~ $ sudo ceph status
cluster ----
 health HEALTH_WARN
547 requests are blocked > 32 sec
 monmap e1: 3 mons at
{cephmon1.XXX=XXX.XXX.XXX.XXX:/0,cephmon1.XXX=XXX.XXX.XXX.XX:/0,cephmon1.XXX=XXX.XXX.XXX.XXX:/0}
election epoch 16, quorum 0,1,2
cephmon1.XXX,cephmon1.XXX,cephmon1.XXX
 osdmap e577122: 72 osds: 68 up, 68 in
flags sortbitwise,require_jewel_osds
  pgmap v6799002: 4096 pgs, 4 pools, 13266 GB data, 11091 kobjects
126 TB used, 368 TB / 494 TB avail
4084 active+clean
  12 active+clean+scrubbing+deep
  client io 113 kB/s rd, 11486 B/s wr, 135 op/s rd, 7 op/s wr

ceph1 ~ $ vmstat 5 5
procs ---memory-- ---swap-- -io -system--
--cpu-
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy
id wa st
27  1  0 3112660 165544 3626169200   472  127401 22 
1 76  1  0
25  0  0 3126176 165544 3624650800   858 12692 12122 110478
97  2  1  0  0
22  0  0 3114284 165544 3625813600 1  6118 9586 118625
97  2  1  0  0
11  0  0 3096508 165544 3627624400 8  6762 10047 188618
89  3  8  0  0
18  0  0 2990452 165544 3638404800  1209 21170 11179 179878
85  4 11  0  0

There is no apparent memory shortage, and none of the HDDs or SSDs show
consistently high utilization, slow service times, or any other form of
hardware saturation, other than user CPU utilization. Can CPU starvation
be responsible for "waiting for rw locks"?

Our main pool (the one with all the data) currently has 1024 PGs,
leaving us room to add more PGs if needed, but we're concerned if we do
so that we'd consume even more CPU.

We have moved to running Ceph + jemalloc instead of tcmalloc, and that
has helped with CPU utilization somewhat, but we still see occurences of
95-100% CPU with not terribly high Ceph workload.

Any suggestions of what else to look at? We have a peculiar use case
where we have many RBDs but only about 1-5% of them are active at the
same time, and we're constantly making and expiring RBD snapshots. Could
this lead to aberrant performance? For instance, is it normal to have
~40k snaps still in cached_removed_snaps?



[global]

cluster = 
fsid = ----

keyring = /etc/ceph/ceph.keyring

auth_cluster_required = none
auth_service_required = none
auth_client_required = none

mon_host = 
cephmon1.XXX,cephmon1.XXX,cephmon1.XXX
mon_addr = XXX.XXX.XXX.XXX:,XXX.XXX.XXX.XXX:XXX,XXX.XXX.XXX.XXX:
mon_initial_members = 
cephmon1.XXX,cephmon1.XXX,cephmon1.XXX

cluster_network = 172.20.0.0/18
public_network = XXX.XXX.XXX.XXX/20

mon osd full ratio = .80
mon osd nearfull ratio = .60

rbd default format = 2
rbd default order = 25
rbd_default_features = 1

osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 1024
osd pool default pgp num = 1024

osd_recovery_op_priority = 1

osd_max_backfills = 1

osd_recovery_threads = 1

osd_recovery_max_active = 1

osd_recovery_max_single_start = 1

osd_scrub_thread_suicide_timeout = 300

osd scrub during recovery = false

osd scrub sleep = 60
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >