[ceph-users] Re: PG inconsistent+failed_repair

2021-06-24 Thread Vladimir Prokofev
Followup. This is what's written in logs when I try to fix one PG:
ceph pg repair 3.60

primary osd log:
2021-06-25 01:07:32.146 7fc006339700 -1 log_channel(cluster) log [ERR] :
repair 3.53 3:cb4336ff:::rbd_data.e2d302dd699130.69b3:6aa5 : is
an unexpected clone
2021-06-25 01:07:32.146 7fc006339700 -1 osd.6 pg_epoch: 210926 pg[3.53( v
210926'64271902 (210920'64268839,210926'64271902]
local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f
210883/210883/5620 210811/210882/210882) [6,22,12] r=0 lpr=210882
luod=210926'64271899 crt=210926'64271902 lcod 210926'64271898 mlcod
210926'64271898 active+clean+scrubbing+deep+inconsistent+repair]
_scan_snaps no clone_snaps for
3:cb4336ff:::rbd_data.e2d302dd699130.69b3:6aa5 in 6aa5=[6aa5]:{}

secondary osd 1:
2021-06-25 01:07:31.934 7f9eae8fa700 -1 osd.22 pg_epoch: 210926 pg[3.53( v
210926'64271899 (210920'64268839,210926'64271899]
local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f
210883/210883/5620 210811/210882/210882) [6,22,12] r=1 lpr=210882 luod=0'0
lua=210881'64265352 crt=210926'64271899 lcod 210926'64271898
active+inconsistent mbc={}] _scan_snaps no clone_snaps for
3:cb4336ff:::rbd_data.e2d302dd699130.69b3:6aa5 in 6aa5=[6aa5]:{}

secondary osd 2:
2021-06-25 01:07:30.828 7f94d6e61700 -1 osd.12 pg_epoch: 210926 pg[3.53( v
210926'64271899 (210920'64268839,210926'64271899]
local-lis/les=210882/210883 n=6046 ec=56/56 lis/c 210882/210882 les/c/f
210883/210883/5620 210811/210882/210882) [6,22,12] r=2 lpr=210882 luod=0'0
lua=210881'64265352 crt=210926'64271899 lcod 210926'64271898
active+inconsistent mbc={}] _scan_snaps no clone_snaps for
3:cb4336ff:::rbd_data.e2d302dd699130.69b3:6aa5 in 6aa5=[6aa5]:{}

And nothing happens, it's still in a failed_repair state.

пт, 25 июн. 2021 г. в 00:36, Vladimir Prokofev :

> Hello.
>
> Today we've experienced a complete CEPH cluster outage - total loss of
> power in the whole infrastructure.
> 6 osd nodes and 3 monitors went down at the same time. CEPH 14.2.10
>
> This resulted in unfound objects, which were "reverted" in a hurry with
> ceph pg  mark_unfound_lost revert
> In retrospect that was probably a mistake as the "have" part stated 0'0.
>
> But then deep-scrubs started and they found inconsistent PGs. We tried
> repairing them, but they just switched to failed_repair.
>
> Here's a log example:
> 2021-06-25 00:08:07.693645 osd.0 [ERR] 3.c shard 6
> 3:3163e703:::rbd_data.be08c566ef438d.2445:head : missing
> 2021-06-25 00:08:07.693710 osd.0 [ERR] repair 3.c
> 3:3163e2ee:::rbd_data.efa86358d15f4a.004b:6ab1 : is an
> unexpected clone
> 2021-06-25 00:11:55.128951 osd.0 [ERR] 3.c repair 1 missing, 0
> inconsistent objects
> 2021-06-25 00:11:55.128969 osd.0 [ERR] 3.c repair 2 errors, 1 fixed
>
> I tried manually deleting conflicting objects from secondary osds
> with ceph-objectstore-tool like this
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 3.c
> rbd_data.efa86358d15f4a.004b:6ab1 remove
> it removes it but without any positive impact. Pretty sure I don't
> understand the concept.
>
> So currently I have the following thoughts:
>  - is there any doc on the object placement specifics and what all of
> those numbers in their name mean? I've seen objects with similar prefix/mid
> but different suffix and I have no idea what does it mean;
>  - I'm actually not sure what the production impact is at that point
> because everything seems to work so far. So I'm thinking if it's possible
> to kill replicas on secondary OSDd with ceph-objectstore-tool and just let
> CEPH create a replica from primary PG?
>
> I have 8 scrub errors and 4 inconsistent+failed_repair PGs, and I'm afraid
> that further deep scrubs will reveal more errors.
> Any thoughts appreciated.
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] PG inconsistent+failed_repair

2021-06-24 Thread Vladimir Prokofev
Hello.

Today we've experienced a complete CEPH cluster outage - total loss of
power in the whole infrastructure.
6 osd nodes and 3 monitors went down at the same time. CEPH 14.2.10

This resulted in unfound objects, which were "reverted" in a hurry with
ceph pg  mark_unfound_lost revert
In retrospect that was probably a mistake as the "have" part stated 0'0.

But then deep-scrubs started and they found inconsistent PGs. We tried
repairing them, but they just switched to failed_repair.

Here's a log example:
2021-06-25 00:08:07.693645 osd.0 [ERR] 3.c shard 6
3:3163e703:::rbd_data.be08c566ef438d.2445:head : missing
2021-06-25 00:08:07.693710 osd.0 [ERR] repair 3.c
3:3163e2ee:::rbd_data.efa86358d15f4a.004b:6ab1 : is an
unexpected clone
2021-06-25 00:11:55.128951 osd.0 [ERR] 3.c repair 1 missing, 0 inconsistent
objects
2021-06-25 00:11:55.128969 osd.0 [ERR] 3.c repair 2 errors, 1 fixed

I tried manually deleting conflicting objects from secondary osds
with ceph-objectstore-tool like this
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-22 --pgid 3.c
rbd_data.efa86358d15f4a.004b:6ab1 remove
it removes it but without any positive impact. Pretty sure I don't
understand the concept.

So currently I have the following thoughts:
 - is there any doc on the object placement specifics and what all of those
numbers in their name mean? I've seen objects with similar prefix/mid but
different suffix and I have no idea what does it mean;
 - I'm actually not sure what the production impact is at that point
because everything seems to work so far. So I'm thinking if it's possible
to kill replicas on secondary OSDd with ceph-objectstore-tool and just let
CEPH create a replica from primary PG?

I have 8 scrub errors and 4 inconsistent+failed_repair PGs, and I'm afraid
that further deep scrubs will reveal more errors.
Any thoughts appreciated.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: JSON output schema

2021-06-16 Thread Vladimir Prokofev
This is a great start, thank you! Basically I can look through the code to
get the keys I need.
But maybe I'm approaching this task wrong? Maybe there's already some
better solution to monitor cluster health details?

ср, 16 июн. 2021 г. в 02:47, Anthony D'Atri :

> Before Luminous, mon clock skew was part of the health status JSON.  With
> Luminous and later releases, one has to invoke a separate command to get
> the info.
>
> This is a royal PITA for monitoring / metrics infrastructure and I’ve
> never seen a reason why it was done.
>
> You might find the code here
> https://github.com/digitalocean/ceph_exporter
> useful.  Note that there are multiple branches, which can be confusing.
>
> > On Jun 15, 2021, at 4:21 PM, Vladimir Prokofev  wrote:
> >
> > Good day.
> >
> > I'm writing some code for parsing output data for monitoring purposes.
> > The data is that of "ceph status -f json", "ceph df -f json", "ceph osd
> > perf -f json" and "ceph osd pool stats -f json".
> > I also need support for all major CEPH releases, starting with Jewel till
> > Pacific.
> >
> > What I've stumbled upon is that:
> > - keys in JSON output are not present if there's no appropriate data.
> > For example the key ['pgmap', 'read_bytes_sec'] will not be present in
> > "ceph status" output if there's no read activity in the cluster;
> > - some keys changed between versions. For example ['health']['status']
> key
> > is not present in Jewel, but is available in all the following versions;
> > vice-versa, key ['osdmap', 'osdmap'] is not present in Pacific, but is in
> > all the previous versions.
> >
> > So I need to get a list of all possible keys for all CEPH releases. Any
> > ideas how this can be achieved? My only thought atm is to build a
> "failing"
> > cluster with all the possible states and get a reference data out of it.
> > Not only this is tedious work since it requires each possible cluster
> > version, but it is also prone for error.
> > Is there any publicly available JSON schema for output?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] JSON output schema

2021-06-15 Thread Vladimir Prokofev
Good day.

I'm writing some code for parsing output data for monitoring purposes.
The data is that of "ceph status -f json", "ceph df -f json", "ceph osd
perf -f json" and "ceph osd pool stats -f json".
I also need support for all major CEPH releases, starting with Jewel till
Pacific.

What I've stumbled upon is that:
 - keys in JSON output are not present if there's no appropriate data.
For example the key ['pgmap', 'read_bytes_sec'] will not be present in
"ceph status" output if there's no read activity in the cluster;
 - some keys changed between versions. For example ['health']['status'] key
is not present in Jewel, but is available in all the following versions;
vice-versa, key ['osdmap', 'osdmap'] is not present in Pacific, but is in
all the previous versions.

So I need to get a list of all possible keys for all CEPH releases. Any
ideas how this can be achieved? My only thought atm is to build a "failing"
cluster with all the possible states and get a reference data out of it.
Not only this is tedious work since it requires each possible cluster
version, but it is also prone for error.
Is there any publicly available JSON schema for output?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] python API

2020-12-20 Thread Vladimir Prokofev
Hello.

I'm trying to write some Python code for analysis of my RBD images storage
usage, rbd and rados package versions are 14.2.16.
Basically I want the same data that I can acquire from shell 'rbd du
' and 'rbd info ' commands, but through Python API.
At the moment I can connect to the cluster, get an image list from the
pool, but that's about it.
As far as I can read, there's no API method to obtain image info such as
provisioned/used storage, etc.
Is this really not possible with the current Python API, or am I missing
something?

Documentation used:
https://docs.ceph.com/en/latest/rbd/api/librbdpy/
https://docs.ceph.com/en/latest/rados/api/python/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] librdbpy examples

2020-11-30 Thread Vladimir Prokofev
Hi.

Just want to notice that if you google for ceph python lib examples it
leads to 404

https://www.google.ru/search?hl=ru=ceph+python+rbd
https://docs.ceph.com/en/latest/rbd/api/librbdpy/

Some 3rd party sites and chinese version works fine though
http://docs.ceph.org.cn/rbd/librbdpy/
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/block_device_guide/librbd_python
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seriously degraded performance after update to Octopus

2020-11-02 Thread Vladimir Prokofev
Just shooting in the dark here, but you may be affected by similar issue I
had a while back, it was discussed here:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/ZOPBOY6XQOYOV6CQMY27XM37OC6DKWZ7/

In short - they've changed setting bluefs_buffered_io to false in the
recent Nautilus release. I guess the same was applied to newer releases.
That lead to severe performance issues and similar symptoms, i.e. lower
memory usage on OSD nodes. Worth checking out.

Of course, it may be something completely different. You should look into
monitoring all your OSDs separately, checking their utilization, await, and
other parameters, at the same time comparing them to pre-upgrade values, to
find the root cause.

пн, 2 нояб. 2020 г. в 11:55, Marc Roos :

>
> I am advocating already a long time for publishing testing data of some
> basic test cluster against different ceph releases. Just a basic ceph
> cluster that covers most configs and run the same tests, so you can
> compare just ceph performance. That would mean a lot for smaller
> companies that do not have access to a good test environment. I have
> asked also about this at some ceph seminar.
>
>
>
> -Original Message-
> From: Martin Rasmus Lundquist Hansen [mailto:han...@imada.sdu.dk]
> Sent: Monday, November 02, 2020 7:53 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Seriously degraded performance after update to
> Octopus
>
> Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to
> Octopus (15.2.5), an update that was long overdue. We used the Ansible
> playbooks to perform a rolling update and except from a few minor
> problems with the Ansible code, the update went well. The Ansible
> playbooks were also used for setting up the cluster in the first place.
> Before updating the Ceph software we also performed a full update of
> CentOS and the Linux kernel (this part of the update had already been
> tested on one of the OSD nodes the week before and we didn't notice any
> problems).
>
> However, after the update we are seeing a serious decrease in
> performance, more than a factor of 10x in some cases. I spend a week
> trying to come up with an explantion or solution, but I am completely
> blank. Independently of Ceph I tested the network performance and the
> performance of the OSD disks, and I am not really seeing any problems
> here.
>
> The specifications of the cluster is:
> - 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU
> @ 1.80GHz, 16 cores, 196 GB RAM)
> - 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold
> 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
> - CentOS 7.8 and Kernel 5.4.51
> - 100 Gbps Infiniband
>
> We are collecting various metrics using Prometheus, and on the OSD nodes
> we are seeing some clear differences when it comes to CPU and Memory
> usage. I collected some graphs here: http://mitsted.dk/ceph . After the
> update the system load is highly reduced, there is almost no longer any
> iowait for the CPU, and the free memory is no longer used for Buffers (I
> can confirm that the changes in these metrics are not due to the update
> of CentOS or the Linux kernel). All in all, now the OSD nodes are almost
> completely idle all the time (and so are the monitors). On the linked
> page I also attached two RADOS benchmarks. The first benchmark was
> performed when the cluster was initially configured, and the second is
> the same benchmark after the update to Octopus. When comparing these
> two, it is clear that the performance has changed dramatically. For
> example, in the write test the bandwidth is reduced from 320 MB/s to 21
> MB/s and the number of IOPS has also dropped significantly.
>
> I temporarily tried to disable the firewall and SELinux on all nodes to
> see if it made any difference, but it didnt look like it (I did not
> restart any services during this test, I am not sure if that could be
> necessary).
>
> Any suggestions for finding the root cause of this performance decrease
> would be greatly appreciated.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-02 Thread Vladimir Prokofev
With sequential read you get "read ahead" mechanics attached which helps a
lot.
So let's say you do 4KB seq reads with fio.
By default, Ubuntu, for example, has 128KB read ahead size. That means when
you request that 4KB of data, driver will actually request 128KB. When your
IO is served, and you request next seq 4KB, they're already in VMs memory,
so no new read IO is necessary.
All those 128KB will likely reside on the same OSD, depending on your CEPH
object size.
When you'll reach the end of that 128KB of data, and request next - once
again it will likely reside in the same rbd object as before, assuming 4MB
object size, so depending on the internal mechanics which I'm not really
familiar with, that data can be either in the hosts memory, or at least in
osd node memory, so no real physical IO will be necessary.
What you're thinking about is the worst case scenario - when that 128KB is
split between 2 objects residing on 2 different osds - well, you just get 2
real physical IO for your 1 virtual, and in that moment you'll have slower
request, but after that you get read ahead to help for a lot of seq IOs.
In the end, read ahead with sequential IOs leads to way way less real
physical reads than random read, hence the IOPS difference.

пн, 2 нояб. 2020 г. в 06:20, Tony Liu :

> Another confusing about read vs. random read. My understanding is
> that, when fio does read, it reads from the test file sequentially.
> When it does random read, it reads from the test file randomly.
> That file read inside VM comes down to volume read handed by RBD
> client who distributes read to PG and eventually to OSD. So a file
> sequential read inside VM won't be a sequential read on OSD disk.
> Is that right?
> Then what difference seq. and rand. read make on OSD disk?
> Is it rand. read on OSD disk for both cases?
> Then how to explain the performance difference between seq. and rand.
> read inside VM? (seq. read IOPS is 20x than rand. read, Ceph is
> with 21 HDDs on 3 nodes, 7 on each)
>
> Thanks!
> Tony
> > -Original Message-
> > From: Vladimir Prokofev 
> > Sent: Sunday, November 1, 2020 5:58 PM
> > Cc: ceph-users 
> > Subject: [ceph-users] Re: read latency
> >
> > Not exactly. You can also tune network/software.
> > Network - go for lower latency interfaces. If you have 10G go to 25G or
> > 100G. 40G will not do though, afaik they're just 4x10G so their latency
> > is the same as in 10G.
> > Software - it's closely tied to your network card queues and processor
> > cores. In short - tune affinity so that the packet receive queues and
> > osds processes run on the same corresponding cores. Disabling process
> > power saving features helps a lot. Also watch out for NUMA interference.
> > But overall all these tricks will save you less than switching from HDD
> > to SSD.
> >
> > пн, 2 нояб. 2020 г. в 02:45, Tony Liu :
> >
> > > Hi,
> > >
> > > AWIK, the read latency primarily depends on HW latency, not much can
> > > be tuned in SW. Is that right?
> > >
> > > I ran a fio random read with iodepth 1 within a VM backed by Ceph with
> > > HDD OSD and here is what I got.
> > > =
> > >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> > > slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> > > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
> > >  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> > > = I checked HDD average latency is 2.9 ms. Looks like
> > > the test result makes perfect sense, isn't it?
> > >
> > > If I want to get shorter latency (more IOPS), I will have to go for
> > > better disk, eg. SSD. Right?
> > >
> > >
> > > Thanks!
> > > Tony
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > > email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-01 Thread Vladimir Prokofev
Not exactly. You can also tune network/software.
Network - go for lower latency interfaces. If you have 10G go to 25G or
100G. 40G will not do though, afaik they're just 4x10G so their latency is
the same as in 10G.
Software - it's closely tied to your network card queues and processor
cores. In short - tune affinity so that the packet receive queues and osds
processes run on the same corresponding cores. Disabling process power
saving features helps a lot. Also watch out for NUMA interference.
But overall all these tricks will save you less than switching from HDD to
SSD.

пн, 2 нояб. 2020 г. в 02:45, Tony Liu :

> Hi,
>
> AWIK, the read latency primarily depends on HW latency,
> not much can be tuned in SW. Is that right?
>
> I ran a fio random read with iodepth 1 within a VM backed by
> Ceph with HDD OSD and here is what I got.
> =
>read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
>  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> =
> I checked HDD average latency is 2.9 ms. Looks like the test
> result makes perfect sense, isn't it?
>
> If I want to get shorter latency (more IOPS), I will have to go
> for better disk, eg. SSD. Right?
>
>
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

2020-08-04 Thread Vladimir Prokofev
I really would not focus that much on a particular device model.
Yes, Kingston SSDs are slower for reads, we knew that since we tested them.
But that was before they were used as block.db devices, they first were
intended purely as block.wal devices. This was even before bluestore
actually, so their primary function was to serve as XFS journals.
I see increased read activity on all block.db devices while snaptrims are
active, be them Intel or Kingston. Latter just suffer more.
This read load was not there before 14.2.10/system upgrade.

I'm actually thinking about rolling back to 14.2.8. Any ideas how safe that
procedure is? I suppose it should be safe since there was no change in the
actual data storage scheme?

вт, 4 авг. 2020 г. в 14:33, Vladimir Prokofev :

> > What Kingston SSD model?
>
> === START OF INFORMATION SECTION ===
> Model Family: SandForce Driven SSDs
> Device Model: KINGSTON SE50S3100G
> Serial Number:
> LU WWN Device Id: 
> Firmware Version: 611ABBF0
> User Capacity:100,030,242,816 bytes [100 GB]
> Sector Size:  512 bytes logical/physical
> Rotation Rate:Solid State Device
> Form Factor:  2.5 inches
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
> SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:Tue Aug  4 14:31:36 2020 MSK
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> вт, 4 авг. 2020 г. в 14:17, Eneko Lacunza :
>
>> Hi Vladimir,
>>
>> What Kingston SSD model?
>>
>> El 4/8/20 a las 12:22, Vladimir Prokofev escribió:
>> > Here's some more insight into the issue.
>> > Looks like the load is triggered because of a snaptrim operation. We
>> have a
>> > backup pool that serves as Openstack cinder-backup storage, performing
>> > snapshot backups every night. Old backups are also deleted every night,
>> so
>> > snaptrim is initiated.
>> > This snaptrim increased load on the block.db devices after upgrade, and
>> > just kills one SSD's performance in particular. It serves as a
>> block.db/wal
>> > device for one of the fatter backup pool OSDs which has more PGs placed
>> > there.
>> > This is a Kingston SSD, and we see this issue on other Kingston SSD
>> > journals too, Intel SSD journals are not that affected, though they too
>> > experience increased load.
>> > Nevertheless, there're now a lot of read IOPS on block.db devices after
>> > upgrade that were not there before.
>> > I wonder how 600 IOPS can destroy SSDs performance that hard.
>> >
>> > вт, 4 авг. 2020 г. в 12:54, Vladimir Prokofev :
>> >
>> >> Good day, cephers!
>> >>
>> >> We've recently upgraded our cluster from 14.2.8 to 14.2.10 release,
>> also
>> >> performing full system packages upgrade(Ubuntu 18.04 LTS).
>> >> After that performance significantly dropped, main reason beeing that
>> >> journal SSDs are now have no merges, huge queues, and increased
>> latency.
>> >> There's a few screenshots in attachments. This is for an SSD journal
>> that
>> >> supports block.db/block.wal for 3 spinning OSDs, and it looks like
>> this for
>> >> all our SSD block.db/wal devices across all nodes.
>> >> Any ideas what may cause that? Maybe I've missed something important in
>> >> release notes?
>> >>
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> --
>> Eneko Lacunza   | Tel.  943 569 206
>>  | Email elacu...@binovo.es
>> Director Técnico| Site. https://www.binovo.es
>> BINOVO IT HUMAN PROJECT S.L | Dir.  Astigarragako Bidea, 2 - 2º izda.
>> Oficina 10-11, 20180 Oiartzun
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

2020-08-04 Thread Vladimir Prokofev
> What Kingston SSD model?

=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: KINGSTON SE50S3100G
Serial Number:
LU WWN Device Id: 
Firmware Version: 611ABBF0
User Capacity:100,030,242,816 bytes [100 GB]
Sector Size:  512 bytes logical/physical
Rotation Rate:Solid State Device
Form Factor:  2.5 inches
Device is:In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:Tue Aug  4 14:31:36 2020 MSK
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

вт, 4 авг. 2020 г. в 14:17, Eneko Lacunza :

> Hi Vladimir,
>
> What Kingston SSD model?
>
> El 4/8/20 a las 12:22, Vladimir Prokofev escribió:
> > Here's some more insight into the issue.
> > Looks like the load is triggered because of a snaptrim operation. We
> have a
> > backup pool that serves as Openstack cinder-backup storage, performing
> > snapshot backups every night. Old backups are also deleted every night,
> so
> > snaptrim is initiated.
> > This snaptrim increased load on the block.db devices after upgrade, and
> > just kills one SSD's performance in particular. It serves as a
> block.db/wal
> > device for one of the fatter backup pool OSDs which has more PGs placed
> > there.
> > This is a Kingston SSD, and we see this issue on other Kingston SSD
> > journals too, Intel SSD journals are not that affected, though they too
> > experience increased load.
> > Nevertheless, there're now a lot of read IOPS on block.db devices after
> > upgrade that were not there before.
> > I wonder how 600 IOPS can destroy SSDs performance that hard.
> >
> > вт, 4 авг. 2020 г. в 12:54, Vladimir Prokofev :
> >
> >> Good day, cephers!
> >>
> >> We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also
> >> performing full system packages upgrade(Ubuntu 18.04 LTS).
> >> After that performance significantly dropped, main reason beeing that
> >> journal SSDs are now have no merges, huge queues, and increased latency.
> >> There's a few screenshots in attachments. This is for an SSD journal
> that
> >> supports block.db/block.wal for 3 spinning OSDs, and it looks like this
> for
> >> all our SSD block.db/wal devices across all nodes.
> >> Any ideas what may cause that? Maybe I've missed something important in
> >> release notes?
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> --
> Eneko Lacunza   | Tel.  943 569 206
>  | Email elacu...@binovo.es
> Director Técnico| Site. https://www.binovo.es
> BINOVO IT HUMAN PROJECT S.L | Dir.  Astigarragako Bidea, 2 - 2º izda.
> Oficina 10-11, 20180 Oiartzun
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: block.db/block.wal device performance dropped after upgrade to 14.2.10

2020-08-04 Thread Vladimir Prokofev
Here's some more insight into the issue.
Looks like the load is triggered because of a snaptrim operation. We have a
backup pool that serves as Openstack cinder-backup storage, performing
snapshot backups every night. Old backups are also deleted every night, so
snaptrim is initiated.
This snaptrim increased load on the block.db devices after upgrade, and
just kills one SSD's performance in particular. It serves as a block.db/wal
device for one of the fatter backup pool OSDs which has more PGs placed
there.
This is a Kingston SSD, and we see this issue on other Kingston SSD
journals too, Intel SSD journals are not that affected, though they too
experience increased load.
Nevertheless, there're now a lot of read IOPS on block.db devices after
upgrade that were not there before.
I wonder how 600 IOPS can destroy SSDs performance that hard.

вт, 4 авг. 2020 г. в 12:54, Vladimir Prokofev :

> Good day, cephers!
>
> We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also
> performing full system packages upgrade(Ubuntu 18.04 LTS).
> After that performance significantly dropped, main reason beeing that
> journal SSDs are now have no merges, huge queues, and increased latency.
> There's a few screenshots in attachments. This is for an SSD journal that
> supports block.db/block.wal for 3 spinning OSDs, and it looks like this for
> all our SSD block.db/wal devices across all nodes.
> Any ideas what may cause that? Maybe I've missed something important in
> release notes?
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] block.db/block.wal device performance dropped after upgrade to 14.2.10

2020-08-04 Thread Vladimir Prokofev
Good day, cephers!

We've recently upgraded our cluster from 14.2.8 to 14.2.10 release, also
performing full system packages upgrade(Ubuntu 18.04 LTS).
After that performance significantly dropped, main reason beeing that
journal SSDs are now have no merges, huge queues, and increased latency.
There's a few screenshots in attachments. This is for an SSD journal that
supports block.db/block.wal for 3 spinning OSDs, and it looks like this for
all our SSD block.db/wal devices across all nodes.
Any ideas what may cause that? Maybe I've missed something important in
release notes?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: BlueFS spillover yet again

2020-02-05 Thread Vladimir Prokofev
Thank you for the insight.
> If you're using the default options for rocksdb, then the size of L3 will
be 25GB
Where this number comes from? Any documentation I can read?
I want to have a better understanding on how DB size is calculated.

ср, 5 февр. 2020 г. в 18:53, Moreno, Orlando :

> Hi Vladimir,
>
> If you're using the default options for rocksdb, then the size of L3 will
> be 25GB. Since your block-db is only 20GB and L3 can only be filled if the
> entire level's size is available, bluefs will begin spillover. Like Igor
> said, having 30GB+ is recommended if you want to host up to 3 levels of
> rocksdb in the SSD.
>
> Thanks,
> Orlando
>
> -Original Message-
> From: Igor Fedotov 
> Sent: Wednesday, February 5, 2020 7:04 AM
> To: Vladimir Prokofev ; ceph-users@ceph.io
> Subject: [ceph-users] Re: Fwd: BlueFS spillover yet again
>
> Hi Vladimir,
>
> there were a plenty of discussions/recommendations around db volume size
> selection here.
>
> In short it's advised to have DB volume of 30 - 64GB for most of use cases.
>
> Thanks,
>
> Igor
>
> On 2/5/2020 4:21 PM, Vladimir Prokofev wrote:
> > Cluster upgraded from 12.2.12 to 14.2.5. All went smooth, except
> > BlueFS spillover warning.
> > We create OSDs with ceph-deploy, command goes like this:
> > ceph-deploy osd create --bluestore --data /dev/sdf --block-db
> > /dev/sdb5 --block-wal /dev/sdb6 ceph-osd3 where block-db and block-wal
> > are SSD partitions.
> > Default ceph-deploy settings created partitions ~1GB which is, of
> > course, too small. So we redeployed OSDs using manually partitioned
> > SSD for block-db/block-wal with sizes of 20G/5G respectively.
> > But now we still get BlueFS spillover warning for redeployed OSDs:
> >   osd.10 spilled over 2.4 GiB metadata from 'db' device (2.8 GiB
> > used of
> > 19 GiB) to slow device
> >   osd.19 spilled over 3.7 GiB metadata from 'db' device (2.7 GiB
> > used of
> > 19 GiB) to slow device
> >   osd.20 spilled over 4.2 GiB metadata from 'db' device (2.6 GiB
> > used of
> > 19 GiB) to slow device
> > osd size is 1.8 TiB.
> >
> > These OSDs are used primarily for RBD as a backup drives, so a lot of
> > snapshots held there. They also have RGW pool assigned to them, but it
> > has no data.
> > I know of sizing recommendations[1] for block-db/block-wal, but I
> > assumed since it's primarily RBD 1%(~20G) should be enough.
> > Also, compaction stats doesn't make sense to me[2]. It states that sum
> > of DB is only 5.08GB, that should be placed on block-db without a
> problem?
> > Am I understanding all this wrong? Should block-db size be greater in
> > my case?
> >
> > [1]
> > https://docs.ceph.com/docs/master/rados/configuration/bluestore-config
> > -ref/#sizing
> > [2] osd.10 logs as an example
> > https://pastebin.com/hC6w6jSn
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: BlueFS spillover yet again

2020-02-05 Thread Vladimir Prokofev
Cluster upgraded from 12.2.12 to 14.2.5. All went smooth, except BlueFS
spillover warning.
We create OSDs with ceph-deploy, command goes like this:
ceph-deploy osd create --bluestore --data /dev/sdf --block-db /dev/sdb5
--block-wal /dev/sdb6 ceph-osd3
where block-db and block-wal are SSD partitions.
Default ceph-deploy settings created partitions ~1GB which is, of course,
too small. So we redeployed OSDs using manually partitioned SSD for
block-db/block-wal with sizes of 20G/5G respectively.
But now we still get BlueFS spillover warning for redeployed OSDs:
 osd.10 spilled over 2.4 GiB metadata from 'db' device (2.8 GiB used of
19 GiB) to slow device
 osd.19 spilled over 3.7 GiB metadata from 'db' device (2.7 GiB used of
19 GiB) to slow device
 osd.20 spilled over 4.2 GiB metadata from 'db' device (2.6 GiB used of
19 GiB) to slow device
osd size is 1.8 TiB.

These OSDs are used primarily for RBD as a backup drives, so a lot of
snapshots held there. They also have RGW pool assigned to them, but it has
no data.
I know of sizing recommendations[1] for block-db/block-wal, but I assumed
since it's primarily RBD 1%(~20G) should be enough.
Also, compaction stats doesn't make sense to me[2]. It states that sum of
DB is only 5.08GB, that should be placed on block-db without a problem?
Am I understanding all this wrong? Should block-db size be greater in my
case?

[1]
https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
[2] osd.10 logs as an example
https://pastebin.com/hC6w6jSn
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io