[ceph-users] RFI: Prometheus, Etc, Services - Optimum Number To Run

2024-01-19 Thread duluxoz

Hi All,

In regards to the monitoring services on a Ceph Cluster (ie Prometheus, 
Grafana, Alertmanager, Loki, Node-Exported, Promtail, etc) how many 
instances should/can we run for fault tolerance purposes? I can't seem 
to recall that advice being in the doco anywhere (but of course, I 
probably missed it).


I'm concerned about HA on those services - will they continue to run if 
the Ceph Node they're on fails?


At the moment we're running only 1 instance of each in the cluster, but 
several Ceph Nodes are capable of running each - ie/eg 3 nodes 
configured but only count:1.


This is on the latest version of Reef using cephadmin (if it makes a 
huge difference :-) ).


So any advice, etc, would be greatly appreciated, including if we should 
be running any services not mentioned (not Mgr, Mon, OSD, or iSCSI, 
obviously :-) )


Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd map snapshot, mount lv, node crash

2024-01-19 Thread Ilya Dryomov
On Fri, Jan 19, 2024 at 2:38 PM Marc  wrote:
>
> Am I doing something weird when I do on a ceph node (nautilus, el7):
>
> rbd snap ls vps-test -p rbd
> rbd map vps-test@vps-test.snap1 -p rbd
>
> mount -o ro /dev/mapper/VGnew-LVnew /mnt/disk <--- reset/reboot ceph node

Hi Marc,

It's not clear where /dev/mapper/VGnew-LVnew points to, but in
principle this should work.  There is nothing weird about mapping
a snapshot, so I'd suggest capturing the kernel panic splat for
debugging.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-19 Thread Mark Nelson

HI Roman,

The fact that changing the pg_num for the index pool drops the latency 
back down might be a clue.  Do you have a lot of deletes happening on 
this cluster?  If you have a lot of deletes and long pauses between 
writes, you could be accumulating tombstones that you have to keep 
iterating over during bucket listing.  Those get cleaned up during 
compaction.  If there are no writes, you might not be compacting the 
tombstones away enough.  Just a theory, but when you rearrange the PG 
counts, Ceph does a bunch of writes to move the data around, triggering 
compaction, and deleting the tombstones.


In v17.2.7 we enabled a feature that automatically performs a compaction 
if too many tombstones are present during iteration in RocksDB.  It 
might be worth upgrading to see if it helps (you might have to try 
tweaking the settings if the defaults aren't helping enough).  The PR is 
here:


https://github.com/ceph/ceph/pull/50893


Mark


On 1/16/24 04:22, Roman Pashin wrote:

Hello Ceph users,

we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition. Benchmarks didn't expose any issues with cluster, but since we
placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. Also we noticed, that any change of pg_num for index pool
(from 256 to 128 for instance) also drops latency to zero. And it starts
its growth again (https://postimg.cc/5YHk9bby).

 From client perspective it looks like one operation takes longer and longer
each other day and operation time drops each time when we do some change on
index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend
most time in `queued_for_pg` state, but physical disk utilization is about
10-20%. Also per logs I see, that longest operation is Listbucket, but it
is strange, that with less than 100'000 items in bucket list even with
'max_keys=1' takes 3-40 seconds.

If it matters client is Apache Flink doing checkpoints via S3 protocol.

Here is an example of operation with debug_optracking logs:

2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call 

[ceph-users] Re: OSD read latency grows over time

2024-01-19 Thread Roman Pashin
Hi Stefan,

Do you make use of a separate db partition as well? And if so, where is
> it stored?
>

No, only WAL partition is on separate NVME partition. Not sure if
ceph-ansible could install Ceph with db partition on separate device on
v17.6.2

Do you only see latency increase in reads? And not writes?
>
Exactly, I see it on read only. Write latency looks pretty constant.

Not sure what metrics you are looking at, but remember that some metrics
> are "long running averages" (from the start of the daemon). If you
> restart the daemon it might look like things dramatically changed, while
> in real life this does need to be so.
>

Metrics are standard Ceph metrics: "ceph_osd_op_r_latency_sum" and
"ceph_osd_op_r_latency_count" and graph shows "rate(latency_sum[1m]) /
rate(latency_count[1m])" value. The thing is that not only metrics are
growing, but also response time for clients are growing with it. But when
we do any transformation with index pool (migration to other OSDs or
changing pg_num) - it drops and start growing again.

I've not catched yet what is causing this, but I see, that sometimes
latency drops without any manual intervention and start raising again (like
on this graph https://postimg.cc/p9ys3yX5).
--
Thank you,
Roman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Performance impact of Heterogeneous environment

2024-01-19 Thread Mark Nelson

On 1/18/24 03:40, Frank Schilder wrote:


For multi- vs. single-OSD per flash drive decision the following test might be 
useful:

We found dramatic improvements using multiple OSDs per flash drive with octopus 
*if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one 
and this thread is effectively sequentializing otherwise async IO if saturated.

There was a dev discussion about having more kv_sync_threads per OSD daemon by 
splitting up rocks-dbs for PGs, but I don't know if this ever materialized.


I advocated for it at one point, but Sage was pretty concerned about how 
much we'd be disturbing Bluestore's write path. Instead, Adam ended up 
implementing the column family sharding inside RocksDB which got us some 
(but not all) of the benefit.


A lot of the work that has gone into refactoring the RocksDB settings in 
Reef has been to help mitigate some of the overhead in the kv sync 
thread.  The gist of it is that we are trying to balance keeping the 
memtables large enough to avoid letting short lived items like pg log 
entries from regularly leaking into the DB, while simultaneously keeping 
the memtables as small as possible to reduce the number of comparisons 
RocksDB needs to do to keep them in sorted order during inserts (which 
is a significant overhead in the kv sync thread during heavy small 
random writes).  This is also part of the reason that Igor was 
experimenting with implementing a native bluestore WAL rather than 
relying on the one in RocksDB.




My guess is that for good NVMe drives it is possible that a single 
kv_sync_thread can saturate the device and there will be no advantage of having 
more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments 
usually are better, because the on-disk controller requires concurrency to 
saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with 
iodepth=1.


Oddly enough, we do see some effect on small random reads.  The way that 
the async msgr / shards / threads interact doesn't scale well past about 
14-16 cpu threads (and increasing the shard/thread counts has 
complicated effects, it may not always help).  If you look at that 1 vs 
2 NVMe article on the ceph.io page, you'll see that once you hit the 14 
CPU threads the 2 OSD/NVMe configuration keeps scaling but the single 
OSD configuration tops out.





With good NVME drives I have seen fio-tests with direct IO saturate the drive 
with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that 
and I could imagine that here 1 OSD/drive is sufficient. For such drives, 
storage access quickly becomes CPU bound, so some benchmarking taking all 
system properties into account is required. If you are already CPU bound (too 
many NVMe drives per core, many standard servers with 24+ NVMe drives have that 
property) there is no point adding extra CPU load with more OSD daemons.

Don't just look at single disks, look at the whole system.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Bailey Allison 
Sent: Thursday, January 18, 2024 12:36 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Performance impact of Heterogeneous environment

+1 to this, great article and great research. Something we've been keeping a 
very close eye on ourselves.

Overall we've mostly settled on the old keep it simple stupid methodology with 
good results. Especially as the benefits have gotten less beneficial the more 
recent your ceph version, and have been rocking with single OSD/NVMe, but as 
always everything is workload dependant and there is sometimes a need for 
doubling up 

Regards,

Bailey



-Original Message-
From: Maged Mokhtar 
Sent: January 17, 2024 4:59 PM
To: Mark Nelson ; ceph-users@ceph.io
Subject: [ceph-users] Re: Performance impact of Heterogeneous
environment

Very informative article you did Mark.

IMHO if you find yourself with very high per-OSD core count, it may be logical
to just pack/add more nvmes per host, you'd be getting the best price per
performance and capacity.

/Maged


On 17/01/2024 22:00, Mark Nelson wrote:

It's a little tricky.  In the upstream lab we don't strictly see an
IOPS or average latency advantage with heavy parallelism by running
muliple OSDs per NVMe drive until per-OSD core counts get very high.
There does seem to be a fairly consistent tail latency advantage even
at moderately low core counts however.  Results are here:

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

Specifically for jitter, there is probably an advantage to using 2
cores per OSD unless you are very CPU starved, but how much that
actually helps in practice for a typical production workload is
questionable imho.  You do pay some overhead for running 2 OSDs per
NVMe as well.


Mark


On 1/17/24 12:24, Anthony D'Atri wrote:

Conventional wisdom is that with recent Ceph releases there is no
longer a clear advantage to this.



[ceph-users] rbd map snapshot, mount lv, node crash

2024-01-19 Thread Marc
Am I doing something weird when I do on a ceph node (nautilus, el7):

rbd snap ls vps-test -p rbd
rbd map vps-test@vps-test.snap1 -p rbd

mount -o ro /dev/mapper/VGnew-LVnew /mnt/disk <--- reset/reboot ceph node
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-19 Thread Roman Pashin
Hi Eugen,

How is the data growth in your cluster? Is the pool size rather stable or
> is it constantly growing?
>

Pool size is fairly constant with tiny up trend. It's growth doesn't
correlate with increase of OSD read latency. I've combined pool usage with
OSD read latency on one graph to provide overall picture of what it looks
like. Here is the graph - https://postimg.cc/p9ys3yX5 .

Write latency btw doesn't have the same trend. It is more or less constant
opposite to read latency.
--
Thank you,
Roman
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-19 Thread Stefan Kooman

On 16-01-2024 11:22, Roman Pashin wrote:

Hello Ceph users,

we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition.


Do you make use of a separate db partition as well? And if so, where is 
it stored?


 Benchmarks didn't expose any issues with cluster, but since we

placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. 


Do you only see latency increase in reads? And not writes?

Not sure what metrics you are looking at, but remember that some metrics 
are "long running averages" (from the start of the daemon). If you 
restart the daemon it might look like things dramatically changed, while 
in real life this does need to be so.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm orchestrator and special label _admin in 17.2.7

2024-01-19 Thread Eugen Block
Oh that does sound strange indeed. I don't have a good idea right now,  
hopefully someone from the dev team can shed some light on this.


Zitat von Robert Sander :


Hi,

more strang behaviour:

When I isssue "ceph mgr fail" a backup MGR takes over and updates  
all config files on all hosts including /etc/ceph/ceph.conf.


At first I thought that this was the solution but when I now remove  
the _admin label and add it again the new MGR also does not update  
/etc/ceph/ceph.conf.


Only when I again do "ceph mgr fail" the new MGR will update  
/etc/ceph/ceph.conf on the hosts labeled with _admin.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD read latency grows over time

2024-01-19 Thread Eugen Block

Hi,

I checked two production clusters which don't use RGW too heavily,  
both on Pacific though. There's no latency increase visible there. How  
is the data growth in your cluster? Is the pool size rather stable or  
is it constantly growing?


Thanks,
Eugen

Zitat von Roman Pashin :


Hello Ceph users,

we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition. Benchmarks didn't expose any issues with cluster, but since we
placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. Also we noticed, that any change of pg_num for index pool
(from 256 to 128 for instance) also drops latency to zero. And it starts
its growth again (https://postimg.cc/5YHk9bby).

From client perspective it looks like one operation takes longer and longer
each other day and operation time drops each time when we do some change on
index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend
most time in `queued_for_pg` state, but physical disk utilization is about
10-20%. Also per logs I see, that longest operation is Listbucket, but it
is strange, that with less than 100'000 items in bucket list even with
'max_keys=1' takes 3-40 seconds.

If it matters client is Apache Flink doing checkpoints via S3 protocol.

Here is an example of operation with debug_optracking logs:

2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client.
1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146216+0300, event: done, op:
osd_op(client.1227774.0:22575820
7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] 

[ceph-users] Re: Cephadm orchestrator and special label _admin in 17.2.7

2024-01-19 Thread Robert Sander

Hi,

more strang behaviour:

When I isssue "ceph mgr fail" a backup MGR takes over and updates all 
config files on all hosts including /etc/ceph/ceph.conf.


At first I thought that this was the solution but when I now remove the 
_admin label and add it again the new MGR also does not update 
/etc/ceph/ceph.conf.


Only when I again do "ceph mgr fail" the new MGR will update 
/etc/ceph/ceph.conf on the hosts labeled with _admin.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Keyring location for ceph-crash?

2024-01-19 Thread Jan Kasprzak
Hi Eugen,

thanks for verifying this. I have created a tracker issue:
https://tracker.ceph.com/issues/64102

-Yenya

Eugen Block wrote:
: Hi,
: 
: I checked the behaviour on Octopus, Pacific and Quincy, I can
: confirm. I don't have the time to dig deeper right now, but I'd
: suggest to open a tracker issue.
: 
: Thanks,
: Eugen
: 
: Zitat von Jan Kasprzak :
: 
: >Hello, Ceph users,
: >
: >what is the correct location of keyring for ceph-crash?
: >I tried to follow this document:
: >
: >https://docs.ceph.com/en/latest/mgr/crash/
: >
: ># ceph auth get-or-create client.crash mon 'profile crash' mgr
: >'profile crash' > /etc/ceph/ceph.client.crash.keyring
: >
: >and copy this file to all nodes. When I run ceph-crash.service on the node
: >where client.admin.keyring is available, it works. But on the rest of nodes,
: >it tries to access the admin keyring anyway. Journalctl -u ceph-crash
: >says this:
: >
: >Jan 18 18:03:35 my.node.name systemd[1]: Started Ceph crash dump collector.
: >Jan 18 18:03:35 my.node.name ceph-crash[2973164]:
: >INFO:ceph-crash:pinging cluster to exercise our key
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.786+0100 7eff34016640 -1 auth: unable to find
: >a keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
: >(2) No such file or directory
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.786+0100 7eff34016640 -1
: >AuthRegistry(0x7eff2c063ce8) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
: >disabling cephx
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.787+0100 7eff34016640 -1 auth: unable to find
: >a keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
: >(2) No such file or directory
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.787+0100 7eff34016640 -1
: >AuthRegistry(0x7eff2c067de0) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
: >disabling cephx
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.788+0100 7eff34016640 -1 auth: unable to find
: >a keyring on 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
: >(2) No such file or directory
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]:
: >2024-01-18T18:03:35.788+0100 7eff34016640 -1
: >AuthRegistry(0x7eff340150c0) no keyring found at 
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
: >disabling cephx
: >Jan 18 18:03:35 my.node.name ceph-crash[2973166]: [errno 2] RADOS
: >object not found (error connecting to the cluster)
: >Jan 18 18:03:35 my.node.name ceph-crash[2973164]:
: >INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s
: >
: >Is the documentation outdated, or am I doing anything wrong? Thanks for
: >any hint. This is non-containerized Ceph 18.2.1 on AlmaLinux 9.
: >
: >-Yenya
: >
: >--
: >| Jan "Yenya" Kasprzak  |
: >| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
: >We all agree on the necessity of compromise. We just can't agree on
: >when it's necessary to compromise. --Larry Wall
: >___
: >ceph-users mailing list -- ceph-users@ceph.io
: >To unsubscribe send an email to ceph-users-le...@ceph.io
: 
: 
: ___
: ceph-users mailing list -- ceph-users@ceph.io
: To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Degraded PGs on EC pool when marking an OSD out

2024-01-19 Thread Hector Martin
I'm having a bit of a weird issue with cluster rebalances with a new EC
pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
Until now I've been using an erasure coded k=5 m=3 pool for most of my
data. I've recently started to migrate to a k=5 m=4 pool, so I can
configure the CRUSH rule to guarantee that data remains available if a
whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
pool to this setup, although by nature I know its PGs will become
inactive if a host goes down (need at least k+1 OSDs to be up).

I've only just started migrating data to the 5,4 pool, but I've noticed
that any time I trigger any kind of backfilling (e.g. take one OSD out),
a bunch of PGs in the 5,4 pool become degraded (instead of just
misplaced/backfilling). This always seems to happen on that pool only,
and the object count is a significant fraction of the total pool object
count (it's not just "a few recently written objects while PGs were
repeering" or anything like that, I know about that effect).

Here are the pools:

pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14133 lfor 0/11307/11305 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14509 lfor 0/0/14234 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs

EC profiles:

# ceph osd erasure-code-profile get ec5.3
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get ec5.4
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=4
plugin=jerasure
technique=reed_sol_van
w=8

They both use the same CRUSH rule, which is designed to select 9 OSDs
balanced across the hosts (of which only 8 slots get used for the older
5,3 pool):

rule hdd-ec-x3 {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 3 type host
step choose indep 3 type osd
step emit
}

If I take out an OSD (14), I get something like this:

health: HEALTH_WARN
Degraded data redundancy: 37631/120155160 objects degraded
(0.031%), 38 pgs degraded

All the degraded PGs are in the 5,4 pool, and the total object count is
around 50k, so this is *most* of the data in the pool becoming degraded
just because I marked an OSD out (without stopping it). If I mark the
OSD in again, the degraded state goes away.

Example degraded PGs:

# ceph pg dump | grep degraded
dumped all
14.3c812   0   838  00
119250277580   0  1088 0  1088
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772
[18,17,16,1,3,2,11,13,12]  18[18,17,16,1,3,2,11,NONE,12]
 18  14537'432  2024-01-12T11:25:54.168048+0900
0'0  2024-01-08T15:18:21.654679+0900  02
 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
  2410
14.3d772   0  1602  00
113032802230   0  1283 0  1283
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.919971+0900 15470'1283 15486:13384
[18,17,16,3,1,0,13,11,12]  18  [18,17,16,3,1,0,NONE,NONE,12]
 18  14990'771  2024-01-15T12:15:59.397469+0900
0'0  2024-01-08T15:18:21.654679+0900  03
 periodic scrub scheduled @ 2024-01-23T15:56:58.912801+0900
  5340
14.3e806   0   832  00
118430196970   0  1035 0  1035
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:42.297251+0900 15465'1035 15486:15423
[18,16,17,12,13,11,1,3,0]  18[18,16,17,12,13,NONE,1,3,0]
 18  14623'500  2024-01-13T08:54:55.709717+0900
0'0  2024-01-08T15:18:21.654679+0900  01
 periodic scrub scheduled @ 2024-01-22T09:54:51.278368+0900
  3310
14.3f782   0   813  00
115983930340   0  1083 0  1083
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.845173+0900 15465'1083 15486:18496
[17,18,16,3,0,1,11,12,13]  17[17,18,16,3,0,1,11,NONE,13]
 17  14990'800  2024-01-15T16:42:08.037844+0900
14990'800  2024-01-15T16:42:08.037844+0900  0
   40  periodic scrub scheduled @ 2024-01-23T10:44:06.083985+0900
563   

[ceph-users] Re: Keyring location for ceph-crash?

2024-01-19 Thread Eugen Block

Hi,

I checked the behaviour on Octopus, Pacific and Quincy, I can confirm.  
I don't have the time to dig deeper right now, but I'd suggest to open  
a tracker issue.


Thanks,
Eugen

Zitat von Jan Kasprzak :


Hello, Ceph users,

what is the correct location of keyring for ceph-crash?
I tried to follow this document:

https://docs.ceph.com/en/latest/mgr/crash/

# ceph auth get-or-create client.crash mon 'profile crash' mgr  
'profile crash' > /etc/ceph/ceph.client.crash.keyring


and copy this file to all nodes. When I run ceph-crash.service on the node
where client.admin.keyring is available, it works. But on the rest of nodes,
it tries to access the admin keyring anyway. Journalctl -u ceph-crash
says this:

Jan 18 18:03:35 my.node.name systemd[1]: Started Ceph crash dump collector.
Jan 18 18:03:35 my.node.name ceph-crash[2973164]:  
INFO:ceph-crash:pinging cluster to exercise our key
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.786+0100 7eff34016640 -1 auth: unable to find a  
keyring on  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or  
directory
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.786+0100 7eff34016640 -1  
AuthRegistry(0x7eff2c063ce8) no keyring found at  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling  
cephx
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.787+0100 7eff34016640 -1 auth: unable to find a  
keyring on  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or  
directory
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.787+0100 7eff34016640 -1  
AuthRegistry(0x7eff2c067de0) no keyring found at  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling  
cephx
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.788+0100 7eff34016640 -1 auth: unable to find a  
keyring on  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin: (2) No such file or  
directory
Jan 18 18:03:35 my.node.name ceph-crash[2973166]:  
2024-01-18T18:03:35.788+0100 7eff34016640 -1  
AuthRegistry(0x7eff340150c0) no keyring found at  
/etc/ceph/ceph.client.admin.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin, disabling  
cephx
Jan 18 18:03:35 my.node.name ceph-crash[2973166]: [errno 2] RADOS  
object not found (error connecting to the cluster)
Jan 18 18:03:35 my.node.name ceph-crash[2973164]:  
INFO:ceph-crash:monitoring path /var/lib/ceph/crash, delay 600s


Is the documentation outdated, or am I doing anything wrong? Thanks for
any hint. This is non-containerized Ceph 18.2.1 on AlmaLinux 9.

-Yenya

--
| Jan "Yenya" Kasprzak  |
| https://www.fi.muni.cz/~kas/GPG: 4096R/A45477D5 |
We all agree on the necessity of compromise. We just can't agree on
when it's necessary to compromise. --Larry Wall
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io