[ceph-users] Re: activating+undersized+degraded+remapped

2024-03-17 Thread Joachim Kraftmayer - ceph ambassador

also helpful is the output of:

cephpg{poolnum}.{pg-id}query

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 16.03.24 um 13:52 schrieb Eugen Block:
Yeah, the whole story would help to give better advice. With EC the 
default min_size is k+1, you could reduce the min_size to 5 
temporarily, this might bring the PGs back online. But the long term 
fix is to have all required OSDs up and have enough OSDs to sustain an 
outage.


Zitat von Wesley Dillingham :

Please share "ceph osd tree" and "ceph osd df tree" I suspect you 
have not

enough hosts to satisfy the EC

On Sat, Mar 16, 2024, 8:04 AM Deep Dish  wrote:


Hello

I found myself in the following situation:

[WRN] PG_AVAILABILITY: Reduced data availability: 3 pgs inactive

    pg 4.3d is stuck inactive for 8d, current state
activating+undersized+degraded+remapped, last acting
[4,NONE,46,NONE,10,13,NONE,74]

    pg 4.6e is stuck inactive for 9d, current state
activating+undersized+degraded+remapped, last acting
[NONE,27,77,79,55,48,50,NONE]

    pg 4.cb is stuck inactive for 8d, current state
activating+undersized+degraded+remapped, last acting
[6,NONE,42,8,60,22,35,45]


I have one cephfs with two backing pools -- one for replicated data, 
the
other for erasure data.  Each pool is mapped to REPLICATED/ vs. 
ERASURE/

directories on the filesystem.


The above pgs. are affecting the ERASURE pool (5+3) backing the 
FS.   How

can I get ceph to recover these three PGs?



Thank you.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Emergency, I lost 4 monitors but all osd disk are safe

2023-11-02 Thread Joachim Kraftmayer - ceph ambassador

Hi,

another short note regarding the documentation, the paths are designed 
for a package installation.


the paths for container installation look a bit different e.g.: 
/var/lib/ceph//osd.y/


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 02.11.23 um 12:02 schrieb Robert Sander:

Hi,

On 11/2/23 11:28, Mohamed LAMDAOUAR wrote:


   I have 7 machines on CEPH cluster, the service ceph runs on a docker
container.
  Each machine has 4 hdd of data (available) and 2 nvme sssd (bricked)
   During a reboot, the ssd bricked on 4 machines, the data are 
available on
the HDD disk but the nvme is bricked and the system is not available. 
is it

possible to recover the data of the cluster (the data disk are all
available)


You can try to recover the MON db from the OSDs, as they keep a copy 
of it:


https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures 



Regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stickyness of writing vs full network storage writing

2023-10-28 Thread Joachim Kraftmayer - ceph ambassador

Hi,

I know similar requirements, the motivation and the need behind them.
We have chosen a clear approach to this, which also does not make the 
whole setup too complicated to operate.
1.) Everything that doesn't require strong consistency we do with other 
tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies 
with high IOPs and low latencies.


2.) Everything that requires high data security, strong consistency and 
higher failure domains as host we do with Ceph.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 27.10.23 um 17:58 schrieb Anthony D'Atri:

Ceph is all about strong consistency and data durability.  There can also be a 
distinction between performance of the cluster in aggregate vs a single client, 
especially in a virtualization scenario where to avoid the noisy-neighbor 
dynamic you deliberately throttle iops and bandwidth per client.


For my discussion I am assuming nowadays PCIe based NVMe drives, which are 
capable of writing about 8GiB/s, which is about 64GBit/s.

Written how, though?  Benchmarks sometimes are written with 100% sequential 
workloads, top-SKU CPUs that mortals can't afford, and especially with a queue 
depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty 
much random and with lower QD.  And depending on the drive, significant read 
traffic may impact write bandwidth to a degree.  At . Mountpoint (Vancouver 
BC 2018) someone gave a presentation about the difficulties saturating NVMe 
bandwidth.


Now considering the situation that you have 5 nodes each has 4 of that drives,
will make all small and mid-sized companies to go bankrupt ;-) only from buying 
the corresponding networking switches.

Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read 
intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please stick with real 
enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often 
get SSDs elsewhere for half of what they cost from your chassis manufacturer.


   But the servers hardware is still a simplistic commodity hardware which can 
saturate the given any given commodity network hardware easily.
If I want to be able to use full 64GBit/s I would require at least 100GBit/s 
networking or tons of trunked ports and cabaling with lower bandwidth switches.

Throughput and latency are different things, though.  Also, are you assuming 
here the traditional topology of separate public and 
cluster/private/replication networks?  With modern networking (and Ceph 
releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're


   If we now also consider distributing the nodes over racks, building on same 
location or distributed datacenters, the costs will be even more painfull.

Don't you already have multiple racks?  They don't need to be dedicated only to 
Ceph.


The ceph commit requirement will be 2 copies on different OSDs (comparable to a 
mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID 
with multiple disk redudancy)

Not entirely comparable, but the distinctions mostly don't matter here.


In all our tests so far, we could not control the behavior of how ceph is 
persisting this 2 copies. It will always try to persist it somehow over the 
network.
Q1: Is this behavior mandatory?

It's a question of how important the data is, and how bad it would be to lose 
some.


   Our common workload, and afaik nearly all webservice based applications are:
- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster

QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these 
days if you shop smart you can get TLC for close the same cost.  Won't always 
be true though, and you can't get a 60TB TLC SKU ;)


Hope I could explain the situation here well enough.
 Now assuming my ideal world with ceph:
if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill 
the common needs of ceph storage with 4 copies

You could I think craft a CRUSH rule to do that.  Default for replicated pools 
FWIW is 3 copies not 4.


3. maybe optionally move 1 copy away from the intial node which still holds the 
2 local copies...

I don't know of an elegant way to change placement after the fact.


   this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the 
local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and 
the latency would be comparable as writing locally
- 

[ceph-users] Re: Remove empty orphaned PGs not mapped to a pool

2023-10-05 Thread Joachim Kraftmayer - ceph ambassador

@Eugen

We have seen the same problems 8 years ago. I can only recommend never 
to use cache tiering in production.
At Cephalocon this was part of my talk and as far as I remember cache 
tiering will also disappear from ceph soon.


Cache tiering has been deprecated in the Reef release as it has lacked a 
maintainer for a very long time. This does not mean it will be certainly 
removed, but we may choose to remove it without much further notice.


https://docs.ceph.com/en/latest/rados/operations/cache-tiering/

Regards, Joachim


___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 05.10.23 um 10:02 schrieb Eugen Block:
Which ceph version is this? I'm trying to understand how removing a 
pool leaves the PGs of that pool... Do you have any logs or something 
from when you removed the pool?
We'll have to deal with a cache tier in the forseeable future as well 
so this is quite relevant for us as well. Maybe I'll try to reproduce 
it in a test cluster first.
Are those SSDs exclusively for the cache tier or are they used by 
other pools as well? If they were used only for the cache tier you 
should be able to just remove them without any risk. But as I said, 
I'd rather try to understand before purging them.



Zitat von Malte Stroem :


Hello Eugen,

yes, we followed the documentation and everything worked fine. The 
cache is gone.


Removing the pool worked well. Everything is clean.

The PGs are empty active+clean.

Possible solutions:

1.

ceph pg {pg-id} mark_unfound_lost delete

I do not think this is the right way since it is for PGs with status 
unfound. But it could work also.


2.

Set the following for the three disk:

ceph osd lost {osd-id}

I am not sure how the cluster will react to this.

3.

ceph-objectstore-tool --data-path /path/to/osd --op remove --pgid 3.0 
--force


Now, will the cluster accept the removed PG status?

4.

The three disks are still presented in the crush rule, class ssd, 
each single OSD under one host entry.


What if I remove them from crush?

Do you have a better idea, Eugen?

Best,
Malte

Am 04.10.23 um 09:21 schrieb Eugen Block:

Hi,

just for clarity, you're actually talking about the cache tier as 
described in the docs [1]? And you followed the steps until 'ceph 
osd tier remove cold-storage hot-storage' successfully? And the pool 
has been really deleted successfully ('ceph osd pool ls detail')?


[1] 
https://docs.ceph.com/en/latest/rados/operations/cache-tiering/#removing-a-cache-tier


Zitat von Malte Stroem :


Hello,

we removed an SSD cache tier and its pool.

The PGs for the pool do still exist.

The cluster is healthy.

The PGs are empty and they reside on the cache tier pool's SSDs.

We like to take out the disks but it is not possible. The cluster 
sees the PGs and answers with a HEALTH_WARN.


Because of the replication of three there are still 128 PGs on 
three of the 24 OSDs. We were able to remove the other OSDs.


Summary:

- pool removed
- 3 x 128 empty PGs still exist
- 3 of 24 OSDs still exist

How is it possible to remove these empty and healthy PGs?

The only way I found was something like:

ceph pg {pg-id} mark_unfound_lost delete

Is that the right way?

Some output of:

ceph pg ls-by-osd 23

PG  OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES OMAP_BYTES* 
OMAP_KEYS*  LOG   STATE SINCE VERSION REPORTED UP 
ACTING SCRUB_STAMP DEEP_SCRUB_STAMP
3.0  0 0  0    0 0    0   0     
0  active+clean    27h 0'0    2627265:196316 [15,6,23]p15  
[15,6,23]p15 2023-09-28T12:41:52.982955+0200 
2023-09-27T06:48:23.265838+0200
3.1  0 0  0    0 0    0   0     
0  active+clean 9h 0'0    2627266:19330 [6,23,15]p6  
[6,23,15]p6 2023-09-29T06:30:57.630016+0200 
2023-09-27T22:58:21.992451+0200
3.2  0 0  0    0 0    0   0     
0  active+clean 2h 0'0   2627265:1135185 [23,15,6]p23  
[23,15,6]p23 2023-09-29T13:42:07.346658+0200 
2023-09-24T14:31:52.844427+0200
3.3  0 0  0    0 0    0   0     
0  active+clean    13h 0'0    2627266:193170 [6,15,23]p6  
[6,15,23]p6 2023-09-29T01:56:54.517337+0200 
2023-09-27T17:47:24.961279+0200
3.4  0 0  0    0 0    0   0     
0  active+clean    14h 0'0   2627265:2343551 [23,6,15]p23  
[23,6,15]p23 2023-09-29T00:47:47.548860+0200 
2023-09-25T09:39:51.259304+0200
3.5  0 0  0    0 0    0   0     
0  active+clean 2h 0'0    2627265:194111 [15,6,23]p15  
[15,6,23]p15 2023-09-29T13:28:48.879959+0200 
2023-09-26T15:35:44.217302+0200
3.6  0 0  0    0 0    0   0     
0  active+clean 6h 0'0   2627265:2345717 [23,15,6]p23  
[23,15,6]p23 2023-09-29T09:26:02.534825+0200 
2023-09-27T21:56:57.500126+0200


Best regards,
Malte

[ceph-users] Re: Balancer blocked as autoscaler not acting on scaling change

2023-10-04 Thread Joachim Kraftmayer - ceph ambassador

Hi,
we have often seen strange behavior and also interesting pg targets from 
pg_autoscaler in the last years.

That's why we disable it globally.

The commands:
ceph osd reweight-by-utilization
ceph osd test-reweight-by-utilization
are from the time before the upmap balancer was introduced and did not 
solve the problem in the long run in an active cluster.


That the balancer skips the pool with the rebalancing, I've seen more 
than once.


Why pg_autoscaler behaves this way would have to be analyzed in more 
detail. As mentioned above, we normally turn it off.


The idea of Eugen would help when the pool rebalancing is done.

Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 22.09.23 um 11:22 schrieb b...@sanger.ac.uk:

Hi Folks,

We are currently running with one nearfull OSD and 15 nearfull pools. The most 
full OSD is about 86% full but the average is 58% full. However, the balancer 
is skipping a pool on which the autoscaler is trying to complete a pg_num 
reduction from 131,072 to 32,768 (default.rgw.buckets.data pool). However, the 
autoscaler has been working on this for the last 20 days, it works through a 
list of objects that are misplaced but when it gets close to the end, more 
objects get added to the list.

This morning I observed the list get down to c. 7,000 objects misplaced with 2 
PGs active+remapped+backfilling, one PG completed the backfilling then the list 
shot up to c. 70,000 objects misplaced with 3 PGs active+remapped+backfilling.

Has anyone come across this behaviour before? If so, what was your remediation?

Thanks in advance for sharing.
Bruno

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Separating Mons and OSDs in Ceph Cluster

2023-09-12 Thread Joachim Kraftmayer - ceph ambassador

Another the possibility is also the ceph mon discovery via DNS:

https://docs.ceph.com/en/quincy/rados/configuration/mon-lookup-dns/#looking-up-monitors-through-dns

Regards, Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 11.09.23 um 09:32 schrieb Robert Sander:

Hi,

On 9/9/23 09:34, Ramin Najjarbashi wrote:


The primary goal is to deploy new Monitors on different servers without
causing service interruptions or disruptions to data availability.


Just do that. New MONs will be added to the mon map which will be 
distributed to all running components. All OSDs will immediately know 
about the new MONs.


The same goes when removing an old MON.

After that you have to update the ceph.conf on each host to make the 
change "reboot safe".


No need to restart any other component including OSDs.

Regards

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: replacing all disks in a stretch mode ceph cluster

2023-07-19 Thread Joachim Kraftmayer - ceph ambassador

Hi,

short note if you replace the disks with large disks, the weight of the 
osd and host will change and this will force data migration.


Perhaps you read a bit more about the upmap balancer, if you want to 
avoid data migration during the upgrade phase.


Regards, Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 19.07.23 um 09:00 schrieb Eugen Block:

Hi,
during cluster upgrades from L to N or later one had to rebuild OSDs 
which were originally deployed by ceph-disk switching to ceph-volume. 
We've done this on multiple clusters and redeployed one node by one. 
We did not drain the nodes beforehand because the EC resiliency 
configuration was well planned. So we also didn't set any flags 
because tearing down OSDs only took a few minutes, we didn't care if a 
few MB or GB had been recovered. You just need to be careful with the 
mon_max_pg_per_osd limit (default 250) as this can easily be hit with 
the first rebuilt OSD that's coming up, or the last one being removed 
as well. So you could temporarily increase it during the rebuilt 
procedure. In your case you probably should set flags because you'll 
need time to exchange the physical disks, I guess the ones you already 
mentioned would suffice. With stretch mode and replicated size 4 you 
should also be able to tear down an entire node at a time and rebuilt 
it, I wouldn't rebuild more than one though.
I have not done this yet with cephadm clusters but if your 
drivegroup.yml worked correctly before it should work here as well, 
hoping that you don't find another ceph-volume bug. ;-)


Regards,
Eugen

Zitat von Zoran Bošnjak :


Hello ceph users,
my ceph configuration is
- ceph version 17.2.5 on ubuntu 20.04
- stretch mode
- 2 rooms with OSDs and monitors + additional room for the tiebreaker 
monitor

- 4 OSD servers in each room
- 6 OSDs per OSD server
- ceph installation/administration is manual (without ansible, 
orch... or any other tool like this)


Ceph health is currently OK.
Raw usage is around 60%,
Pools usage is below 75%

I need to replace all OSD disks in the cluster with larger capacity 
disks (500G to 1000G). So the eventual configuration will contain the 
same number of OSDs and servers.


I understand I can replace OSDs one by one, following the documented 
procedure (removing old and adding new OSD to the configuration) and 
waiting for health OK. But in this case, ceph will probably copy data 
around like crazy after each step. So, my question is:


What is the recommended procedure in this case of replacing ALL disks 
and keeping the ceph operational during the upgrade?


In particular:
Should I use any of "nobackfill, norebalance, norecover..." flags 
during the process? If yes, which?
Should I do one OSD at the time, server at the time or even room at 
the time?


Thanks for the suggestions.

regards,
Zoran
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH orch made osd without WAL

2023-07-10 Thread Joachim Kraftmayer - ceph ambassador
you can also test it directly with ceph bench, if the WAL is on the 
flash device:


https://www.clyso.com/blog/verify-ceph-osd-db-and-wal-setup/

Joachim


___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 10.07.23 um 09:12 schrieb Eugen Block:
Yes, because you did *not* specify a dedicated WAL device. This is 
also reflected in the OSD metadata:


$ ceph osd metadata 6 | grep dedicated
    "bluefs_dedicated_db": "1",
    "bluefs_dedicated_wal": "0"

Only if you had specified a dedicated WAL device you would see it in 
the lvm list output, so this is all as expected.
You can check out the perf dump of an OSD to see that it actually 
writes to the WAL:


# ceph daemon osd.6 perf dump bluefs | grep wal
    "wal_total_bytes": 0,
    "wal_used_bytes": 0,
    "files_written_wal": 1588,
    "bytes_written_wal": 1090677563392,
    "max_bytes_wal": 0,


Zitat von Jan Marek :


Hello,

but when I try to list devices config with ceph-volume, I can see
a DB devices, but no WAL devices:

ceph-volume lvm list

== osd.8 ===

  [db] 
/dev/ceph-5aa92e38-077b-48e2-bda6-5b7db7b7701c/osd-db-bfd11468-d109-4f85-9723-75976f51bfb9


  block device 
/dev/ceph-eaf5f0d7-ad50-4009-9ee6-04b8204b5b1a/osd-block-26b1d4b7-2425-4a2f-912b-111cf66a5970

  block uuid j4s9lv-wS9n-xg2W-I4Y0-fUSu-Vuvl-9gOB2P
  cephx lockbox secret
  cluster fsid 2c565e24-7850-47dc-a751-a6357cbbaf2a
  cluster name  ceph
  crush device class
  db device 
/dev/ceph-5aa92e38-077b-48e2-bda6-5b7db7b7701c/osd-db-bfd11468-d109-4f85-9723-75976f51bfb9

  db uuid d9MZ2r-ImXX-Xod0-TNDS-tqi5-oG5Y-wrXFtW
  encrypted 0
  osd fsid 26b1d4b7-2425-4a2f-912b-111cf66a5970
  osd id    8
  osdspec affinity  osd_spec_default
  type  db
  vdo   0
  devices   /dev/nvme0n1

  [block] 
/dev/ceph-eaf5f0d7-ad50-4009-9ee6-04b8204b5b1a/osd-block-26b1d4b7-2425-4a2f-912b-111cf66a5970


  block device 
/dev/ceph-eaf5f0d7-ad50-4009-9ee6-04b8204b5b1a/osd-block-26b1d4b7-2425-4a2f-912b-111cf66a5970

  block uuid j4s9lv-wS9n-xg2W-I4Y0-fUSu-Vuvl-9gOB2P
  cephx lockbox secret
  cluster fsid 2c565e24-7850-47dc-a751-a6357cbbaf2a
  cluster name  ceph
  crush device class
  db device 
/dev/ceph-5aa92e38-077b-48e2-bda6-5b7db7b7701c/osd-db-bfd11468-d109-4f85-9723-75976f51bfb9

  db uuid d9MZ2r-ImXX-Xod0-TNDS-tqi5-oG5Y-wrXFtW
  encrypted 0
  osd fsid 26b1d4b7-2425-4a2f-912b-111cf66a5970
  osd id    8
  osdspec affinity  osd_spec_default
  type  block
  vdo   0
  devices   /dev/sdi

(part of listing...)

Sincerely
Jan Marek


Dne Po, čec 10, 2023 at 08:10:58 CEST napsal Eugen Block:

Hi,

if you don't specify a different device for WAL it will be 
automatically

colocated on the same device as the DB. So you're good with this
configuration.

Regards,
Eugen


Zitat von Jan Marek :

> Hello,
>
> I've tried to add to CEPH cluster OSD node with a 12 rotational
> disks and 1 NVMe. My YAML was this:
>
> service_type: osd
> service_id: osd_spec_default
> service_name: osd.osd_spec_default
> placement:
>   host_pattern: osd8
> spec:
>   block_db_size: 64G
>   data_devices:
> rotational: 1
>   db_devices:
> paths:
> - /dev/nvme0n1
>   filter_logic: AND
>   objectstore: bluestore
>
> Now I have 12 OSD with DB on NVMe device, but without WAL. How I
> can add WAL to this OSD?
>
> NVMe device still have 128GB free place.
>
> Thanks a lot.
>
> Sincerely
> Jan Marek
> --
> Ing. Jan Marek
> University of South Bohemia
> Academic Computer Centre
> Phone: +420389032080
> http://www.gnu.org/philosophy/no-word-attachments.cs.html


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Ing. Jan Marek
University of South Bohemia
Academic Computer Centre
Phone: +420389032080
http://www.gnu.org/philosophy/no-word-attachments.cs.html



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Rook on bare-metal?

2023-07-06 Thread Joachim Kraftmayer - ceph ambassador

Hello

we have been following rook since 2018 and have had our experiences both 
on bare-metal and in the hyperscalers.

In the same way, we have been following cephadm from the beginning.

Meanwhile, we have been using both in production for years and the 
decision which orchestrator to use depends from project to project. 
e.g., the features of both projects are not identical.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 06.07.23 um 07:16 schrieb Nico Schottelius:

Morning,

we are running some ceph clusters with rook on bare metal and can very
much recomend it. You should have proper k8s knowledge, knowing how to
change objects such as configmaps or deployments, in case things go
wrong.

In regards to stability, the rook operator is written rather defensive,
not changing monitors or the cluster if the quorom is not met and
checking how the osd status is on removal/adding of osds.

So TL;DR: very much usable and rather k8s native.

BR,

Nico

zs...@tuta.io writes:


Hello!

I am looking to simplify ceph management on bare-metal by deploying
Rook onto kubernetes that has been deployed on bare metal (rke). I
have used rook in a cloud environment but I have not used it on
bare-metal. I am wondering if anyone here runs rook in bare-metal?
Would you recommend it to cephadm or would you steer clear of it?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Deleting millions of objects

2023-05-17 Thread Joachim Kraftmayer - ceph ambassador

Hi Rok,

try this:


rgw_delete_multi_obj_max_num - Max number of objects in a single 
multi-object delete request

  (int, advanced)
  Default: 1000
  Can update at runtime: true
  Services: [rgw]


config set   


WHO: client. or client.rgw

KEY: rgw_delete_multi_obj_max_num

VALUE: 1

Regards, Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 17.05.23 um 14:24 schrieb Rok Jaklič:

thx.

I tried with:
ceph config set mon rgw_delete_multi_obj_max_num 1
ceph config set client rgw_delete_multi_obj_max_num 1
ceph config set global rgw_delete_multi_obj_max_num 1

but still only 1000 objects get deleted.

Is the target something different?

On Wed, May 17, 2023 at 11:58 AM Robert Hish 
wrote:


I think this is capped at 1000 by the config setting. Ive used the aws
and s3cmd clients to delete more than 1000 objects at a time and it
works even with the config setting capped at 1000. But it is a bit slow.

#> ceph config help rgw_delete_multi_obj_max_num

rgw_delete_multi_obj_max_num - Max number of objects in a single multi-
object delete request
   (int, advanced)
   Default: 1000
   Can update at runtime: true
   Services: [rgw]

On Wed, 2023-05-17 at 10:51 +0200, Rok Jaklič wrote:

Hi,

I would like to delete millions of objects in RGW instance with:
mc rm --recursive --force ceph/archive/veeam

but it seems it allows only 1000 (or 1002 exactly) removals per
command.

How can I delete/remove all objects with some prefix?

Kind regards,
Rok
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH Version choice

2023-05-15 Thread Joachim Kraftmayer - ceph ambassador

Adam & Mark topics: bluestore and bluestore v2

https://youtu.be/FVUoGw6kY5k

https://youtu.be/7D5Bgd5TuYw


___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 16:47 schrieb Jens Galsgaard:

https://www.youtube.com/playlist?list=PLrBUGiINAakPd9nuoorqeOuS9P9MTWos3


-Original Message-
From: Marc 
Sent: Monday, May 15, 2023 4:42 PM
To: Joachim Kraftmayer - ceph ambassador ; Frank Schilder 
; Tino Todino 
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: CEPH Version choice


By the way, regarding performance I recommend the Cephalocon
presentations by Adam and Mark. There you can learn what efforts are
made to improve ceph performance for current and future versions.


Link?
___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm does not honor container_image default value

2023-05-15 Thread Joachim Kraftmayer - ceph ambassador
Don't know if it helps, but we have also experienced something similar 
with osd images. We changed the image tag from version to sha and it did 
not happen again.


___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 14:50 schrieb Adam King:

I think with the `config set` commands there is logic to notify the
relevant mgr modules and update their values. That might not exist with
`config rm`, so it's still using the last set value. Looks like a real bug.
Curious what happens if the mgr restarts after the `config rm`. Whether it
goes back to the default image in that case or not. Might take a look later.

On Mon, May 15, 2023 at 7:37 AM Daniel Krambrock <
krambr...@hrz.uni-marburg.de> wrote:


Hello.

I think i found a bug in cephadm/ceph orch:
Redeploying a container image (tested with alertmanager) after removing
a custom `mgr/cephadm/container_image_alertmanager` value, deploys the
previous container image and not the default container image.

I'm running `cephadm` from ubuntu 22.04 pkg 17.2.5-0ubuntu0.22.04.3 and
`ceph` version 17.2.6.

Here is an example. Node clrz20-08 is the node altermanager is running
on, clrz20-01 the node I'm controlling ceph from:

* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:v0.23.0"
```

* Set alertmanager image
```
root@clrz20-01:~# ceph config set mgr
mgr/cephadm/container_image_alertmanager quay.io/prometheus/alertmanager
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager
```

* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```

* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:latest"
```

* Remove alertmanager image setting, revert to default:
```
root@clrz20-01:~# ceph config rm mgr
mgr/cephadm/container_image_alertmanager
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
```

* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```

* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:latest"
```
-> `mgr/cephadm/container_image_alertmanager` is set to
`quay.io/prometheus/alertmanager:v0.23.0`
, but redeploy uses
`quay.io/prometheus/alertmanager:latest`
. This looks like a bug.

* Set alertmanager image explicitly to the default value
```
root@clrz20-01:~# ceph config set mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
root@clrz20-01:~# ceph config get mgr
mgr/cephadm/container_image_alertmanager
quay.io/prometheus/alertmanager:v0.23.0
```

* redeploy altermanager
```
root@clrz20-01:~# ceph orch redeploy alertmanager
Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
```

* Get alertmanager version
```
root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
"alertmanager")| .container_image_name'
"quay.io/prometheus/alertmanager:v0.23.0"
```
-> Setting `mgr/cephadm/container_image_alertmanager` to the default
setting fixes the issue.



Bests,
Daniel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPH Version choice

2023-05-15 Thread Joachim Kraftmayer - ceph ambassador

Hi,


I know the problems that Frank has raised. However, it should also be 
mentioned that many critical bugs have been fixed in the major versions. 
We are working on the fixes ourselves.


We and others have written a lot of tools for ourselves in the last 10 
years to improve migration/update and upgrade paths/strategy.


From version to version, we also test for up to 6 months before putting 
them into production.


However, our goal is always to use Ceph versions that still get 
backports and on the other hand, only use the features we really need.
Our developers also always aim to bring bug fixes upstream and into the 
supported versions.


By the way, regarding performance I recommend the Cephalocon 
presentations by Adam and Mark. There you can learn what efforts are 
made to improve ceph performance for current and future versions.


Regards, Joachim


___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 15.05.23 um 12:11 schrieb Frank Schilder:

What are the main reasons for not upgrading to the latest and greatest?

Because more often than not it isn't.

I guess when you write "latest and greatest" you talk about features. When we admins talk about 
"latest and greatest" we talk about stability. The times that one could jump with a production 
system onto a "stable" release with the ending .2 are long gone. Anyone who becomes an early 
adapter is more and more likely to experience serious issues. Which leads to more admins waiting with 
upgrades. Which in turn leads to more bugs discovered only at late releases. Which again makes more admins 
postpone an upgrade. A vicious cycle.

A long time ago there was a discussion about exactly this problem and the 
admins were pretty much in favor of increasing the release cadence to at least 
4 years if not longer. Its simply too many releases with too many serious bugs 
not fixed, lately not even during their official life time. Octopus still has 
serious bugs but is EOL.

I'm not surprised that admins give up on upgrading entirely and stay on a 
version until their system dies.

To give you one from my own experience, upgrading from mimic latest to octopus latest. 
This experience almost certainly applies to every upgrade that involves an OSD format 
change (the infamous "quick fix" that could take several days per OSD and crush 
entire clusters).

There is an OSD conversion involved in this upgrade and we found out that out 
of 2 possible upgrade paths, one leads to a heavily performance degraded 
cluster with no possibility to recover other than redeploying all OSDs step by 
step. Funnily enough, the problematic procedure is the one described in the 
documentation - it hasn't been updated until today despite users still getting 
caught in this trap.

To give you an idea of what amount of work is now involved in an attempt to 
avoid such pitfalls, here our path:

We set up a test cluster with a script producing realistic workload and started 
testing an upgrade under load. This took about a month (meaning repeating the 
upgrade with a cluster on mimic deployed and populated from scratch every time) 
to confirm that we managed to get onto a robust path avoiding a number of 
pitfalls along the way - mainly the serious performance degradation due to OSD 
conversion, but also an issue with stray entries plus noise. A month! Once we 
were convinced that it would work - meaning we did run it a couple of times 
without any further issues being discovered, we started upgrading our 
production cluster.

Went smooth until we started the OSD conversion of our FS meta data OSDs. They 
had a special performance optimized deployment resulting in a large number of 
100G OSDs with about 30-40% utilization. These OSDs started crashing with some 
weird corruption. Turns out - thanks Igor! - that while spill-over from fast to 
slow drive was handled, the other direction was not. Our OSDs crashed because 
Octopus apparently required substantially more space on the slow device and 
couldn't use the plenty of fast space that was actually available.

The whole thing ended in 3 days of complete downtime and me working 12 hour 
days on the weekend. We managed to recover from this only because we had a 
larger delivery of hardware already on-site and I could scavenge parts from 
there.

So, the story was that after 1 month of testing we still run into 3 days of 
downtime, because there was another unannounced change that broke a config that 
was working fine for years on mimic.

To say the same thing with different words: major version upgrades have become 
very disruptive and require a lot of effort to get halfway right. And I'm not 
talking about the deployment system here.

Add to this list the still open cases discussed on the list about MDS dentry 
corruption, snapshots disappearing/corrupting together with a lack of good 
built-in tools for detection and repair, 

[ceph-users] Re: Veeam backups to radosgw seem to be very slow

2023-04-26 Thread Joachim Kraftmayer - ceph ambassador

"bucket does not exist" or "permission denied".
Had received similar error messages with another client program. The default 
region did not match the region of the cluster.

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 25.04.23 um 15:01 schrieb Boris Behrens:

We have a customer that tries to use veeam with our rgw objectstorage and
it seems to be blazingly slow.

What also seems to be strange, that veeam sometimes show "bucket does not
exist" or "permission denied".

I've tested parallel and everything seems to work fine from the s3cmd/aws
cli standpoint.

Does anyone here ever experienced veeam problems with rgw?

Cheers
  Boris
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD_TOO_MANY_REPAIRS on random OSDs causing clients to hang

2023-04-26 Thread Joachim Kraftmayer - ceph ambassador

Hello Thomas,
I would strongly recommend you to read the messages on the mailing list 
regarding ceph version 16.2.11,16.2.12 and 16.2.13.


Joachim

___
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 26.04.23 um 13:24 schrieb Thomas Hukkelberg:

Hi all,

Over the last 2 weeks we have experienced several OSD_TOO_MANY_REPAIRS errors 
that we struggle to handle in a non-intrusive manner. Restarting MDS + 
hypervisor that accessed the object in question seems to be the only way we can 
clear the error so we can repair the PG and recover access. Any pointers on how 
to handle this issue in a more gentle way than rebooting the hypervisor and 
failing the MDS would be welcome!


The problem seems to only affect one specific pool (id 42) that is used for 
cephfs_data. This pool is our second cephfs data pool in this cluster. The data 
in the pool is accessible via LXC container via Samba and have the cephfs 
filesystem bind-mounted from hypervisor.

Ceph is recently updated to version 16.2.11 (pacific) -- kernel version is 
5.13.19-6-pve on OSD-hosts/samba-containers and 5.19.17-2-pve on MDS-hosts.


The following warnings are issued:
$ ceph health detail
HEALTH_WARN 1 clients failing to respond to capability release; Too many 
repaired reads on 1 OSDs; Degraded data redundancy: 1/2648430
090 objects degraded (0.000%), 1 pg degraded; 1 slow ops, oldest one blocked 
for 608 sec, osd.34 has slow ops
[WRN] MDS_CLIENT_LATE_RELEASE: 1 clients failing to respond to capability 
release
 mds.hk-cephnode-65(mds.0): Client hk-cephnode-56 failing to respond to 
capability release client_id: 9534859837
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 1 OSDs
 osd.34 had 9936 reads repaired
[WRN] PG_DEGRADED: Degraded data redundancy: 1/2648430090 objects degraded 
(0.000%), 1 pg degraded
 pg 42.e2 is active+recovering+degraded+repair, acting [34,275,284]
[WRN] SLOW_OPS: 1 slow ops, oldest one blocked for 608 sec, osd.34 has slow ops



The logs for OSD.34 are flooded with these messages:
root@hk-cephnode-53:~# tail /var/log/ceph/ceph-osd.34.log
2023-04-26T11:41:00.760+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.784+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.812+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.824+0200 7f03a821f700 -1 osd.34 1352563 get_health_metrics 
reporting 1 slow ops, oldest is osd_op(client.9534859837.0:20412906 42.e2 
42:4703efac:::10003d86a99.0001:head [read 0~1048576 [307@0] out=1048576b] 
snapc 0=[] RETRY=5 ondisk+retry+read+known_if_redirected e1352553)
2023-04-26T11:41:00.824+0200 7f03a821f700  0 log_channel(cluster) log [WRN] : 1 
slow requests (by type [ 'delayed' : 1 ] most affected pool [ 'qa-cephfs_data' 
: 1 ])
2023-04-26T11:41:00.840+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head
2023-04-26T11:41:00.864+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 missing primary copy of 42:4703efac:::10003d86a99.0001:head, will try 
copies on 275,284
2023-04-26T11:41:00.888+0200 7f03921f3700 -1 log_channel(cluster) log [ERR] : 
42.e2 full-object read crc 0xebd673ed != expected 0x on 
42:4703efac:::10003d86a99.0001:head



We have tried the following:
  - Restarting the OSD in question clears the error for a few seconds but then 
we also we get OSD_TOO_MANY_REPAIRS on OSDs with PGs that holds the object that 
have blocked I/O.

  - Trying to repair the PG seems to restart every 10 second and not actually 
do anything/progressing. (Is there a way to check repair progress?)

  - Restarting the MDS and hypervisor clears the error (the hypervisor hangs 
for several minutes before timing out). However if the object is requested 
again the error reoccurs. If we don't access the object we are able to 
eventually repair the PG.

  - Occasionally setting the primary-affinity to 0 for the primary OSD in the 
PG clears the error after restarting all affected OSD and we are able to repair 
the PG (unless the object is accessed during recovery) and access to the object 
is OK afterwards.

  - Finding and