[ceph-users] Re: Clients failing to advance oldest client?

2024-03-25 Thread David Yang
You can use the "ceph health detail" command to see which clients are
not responding.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler
Ok! Thank you. Is there a way to tell which client is slow?

> On Mar 25, 2024, at 9:06 PM, David Yang  wrote:
> 
> It is recommended to disconnect the client first and then observe
> whether the cluster's slow requests recover.
> 
> Erich Weiler  于2024年3月26日周二 05:02写道:
>> 
>> Hi Y'all,
>> 
>> I'm seeing this warning via 'ceph -s' (this is on Reef):
>> 
>> # ceph -s
>>   cluster:
>> id: 58bde08a-d7ed-11ee-9098-506b4b4da440
>> health: HEALTH_WARN
>> 3 clients failing to advance oldest client/flush tid
>> 1 MDSs report slow requests
>> 1 MDSs behind on trimming
>> 
>>   services:
>> mon: 5 daemons, quorum
>> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)
>> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
>> mds: 1/1 daemons up, 1 standby
>> osd: 46 osds: 46 up (since 3d), 46 in (since 2w)
>> 
>>   data:
>> volumes: 1/1 healthy
>> pools:   4 pools, 1313 pgs
>> objects: 258.13M objects, 454 TiB
>> usage:   688 TiB used, 441 TiB / 1.1 PiB avail
>> pgs: 1303 active+clean
>>  8active+clean+scrubbing
>>  2active+clean+scrubbing+deep
>> 
>>   io:
>> client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr
>> 
>> I googled around and looked at the docs and it seems like this isn't a
>> critical problem, but I couldn't find a clear path to resolution.  Does
>> anyone have any advice on what I can do to resolve the health issues up top?
>> 
>> My CephFS filesystem is incredibly busy so I have a feeling that has
>> some impact here, but not 100% sure...
>> 
>> Thanks as always for the help!
>> 
>> cheers,
>> erich
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Clients failing to advance oldest client?

2024-03-25 Thread David Yang
It is recommended to disconnect the client first and then observe
whether the cluster's slow requests recover.

Erich Weiler  于2024年3月26日周二 05:02写道:
>
> Hi Y'all,
>
> I'm seeing this warning via 'ceph -s' (this is on Reef):
>
> # ceph -s
>cluster:
>  id: 58bde08a-d7ed-11ee-9098-506b4b4da440
>  health: HEALTH_WARN
>  3 clients failing to advance oldest client/flush tid
>  1 MDSs report slow requests
>  1 MDSs behind on trimming
>
>services:
>  mon: 5 daemons, quorum
> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)
>  mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
>  mds: 1/1 daemons up, 1 standby
>  osd: 46 osds: 46 up (since 3d), 46 in (since 2w)
>
>data:
>  volumes: 1/1 healthy
>  pools:   4 pools, 1313 pgs
>  objects: 258.13M objects, 454 TiB
>  usage:   688 TiB used, 441 TiB / 1.1 PiB avail
>  pgs: 1303 active+clean
>   8active+clean+scrubbing
>   2active+clean+scrubbing+deep
>
>io:
>  client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr
>
> I googled around and looked at the docs and it seems like this isn't a
> critical problem, but I couldn't find a clear path to resolution.  Does
> anyone have any advice on what I can do to resolve the health issues up top?
>
> My CephFS filesystem is incredibly busy so I have a feeling that has
> some impact here, but not 100% sure...
>
> Thanks as always for the help!
>
> cheers,
> erich
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2024-03-25 Thread Marc
> "complexity, OMG!!!111!!!" is not enough of a statement. You have to explain
> what complexity you gain and what complexity you reduce.
> Installing SeaweedFS consists of the following: `cd seaweedfs/weed && make
> install`
> This is the type of problem that Ceph is trying to solve, and starting a
> discussion by saying that everything is bad, without providing any helpful
> message is useless FUD.
> 
> Max

Max stop eating seaweed
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 25-03-2024 23:07, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty 
OSDs. How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing 
OSD, but

they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that 
explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Forgot the link
https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf


Thanks again, seems the explanation for the low number of concurrent 
backfills is then simply that backfill_wait can hold partial reservations.


Mvh.

Torkil

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] put bucket notification configuration - access denied

2024-03-25 Thread Giada Malatesta

Hello everyone,

we are facing a problem regarding the s3 operation put bucket 
notification configuration.


We are using Ceph version 17.2.6. We are trying to configure buckets in 
our cluster so that  a notification message is sent via amqps protocol 
when the content of the bucket change. To do so, we created a local rgw 
user with "special" capabilities and we wrote ad hoc policies for this 
user (list of all buckets, read access to all buckets and possibility to 
add, list and delete bucket configurations).


The problems regards the configurations of all buckets except the one he 
owns, when doing this put bucket notification configuration 
cross-account operation  we get an access denied error.


I have the suspect that this problem is related to the version we are 
using, because when we were doing tests on another cluster we were using 
version 18.2.1 and we did not face this problem. Can you confirm my 
hypothesis?


Thanks,

GM.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph RGW reply "ERROR: S3 error: 404 (NoSuchKey)" but rgw object metadata exist

2024-03-25 Thread xuchenhuig
Hi, 
My ceph cluster has 9 nodes for Ceph Object Store.  Recently, I have 
experienced data loss that reply 404 (NoSuchKey) by s3cmd get xxx command.  
However, I can get metadata info by s3cmd ls  xxx. The RGW object size is above 
1GB that have many multipart object.  Commanding 'rados -p 
default.rgw.buckets.data stats object' show that it only have head object, all 
of multipart and shadow part have gone. The bucket data only support write and 
read operation, no delete, and has no lifecycle policy.

   I have found similar problem in https://tracker.ceph.com/issues/47866 that 
had repaired in v16.0.0.  Maybe this is new data loss problem that very serious 
for us. 

ceph version: 16.2.5
#command info:
s3cmd ls 
s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz
 
2024-03-13 09:27   1208269953  
s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz

s3cmd get 
s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz
download: 
's3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz'
 -> './CC-MAIN-20200118052321-20200118080321-
00547.warc.gz'  [1 of 1] ERROR: Download of 
'./CC-MAIN-20200118052321-20200118080321-00547.warc.gz' failed (Reason: 404 
(NoSuchKey))
ERROR: S3 error: 404 (NoSuchKey)

# head exist and size is 0, multipart and shadow had lost
 rados -p default.rgw.buckets.data stat 
df8c0fe6-01c8-4c07-b310-2d102356c004.76248.1__multipart_batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz.2~C2M72EJLHrNe_fnHnifS4N7pw70hVmE.1
 error stat-ing 
eck6m2.rgw.buckets.data/df8c0fe6-01c8-4c07-b310-2d102356c004.76248.1__multipart_batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz.2~C2M72EJLHrNe_fnHnifS4N7pw70hVmE.1:
 (2) No such file or directory

thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Quincy/Dashboard: Object Gateway not accessible after applying self-signed cert to rgw service

2024-03-25 Thread stephan . budach
Hi,

I am running a Ceph cluster and configured RGW for S3, initially w/o SSL. The 
service works nicely and I updated the service usinfg SSL certs, signed by our 
own CA, just as I already did for the dashboard itself. However, as soon as I 
applied the new config, the dashboard wasn't able to access and display the 
service anymore, while the service itself still works, now using the supplied 
SSL certificate. The error supplied is:

Error 500
The server encountered an unexpected condition which prevented it from 
fulfilling the request.

My guess is, that the dashboard for some reason doesn't like the certificate, 
the rgw service is providing, despite the fact, that itself is using it. Any 
hints on how to make dashboard display the Object Gatway again?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Mounting A RBD Via Kernal Modules

2024-03-25 Thread Alwin Antreich
Hi,


March 24, 2024 at 8:19 AM, "duluxoz"  wrote:
> 
> Hi,
> 
> Yeah, I've been testing various configurations since I sent my last 
> 
> email - all to no avail.
> 
> So I'm back to the start with a brand new 4T image which is rbdmapped to 
> 
> /dev/rbd0.
> 
> Its not formatted (yet) and so not mounted.
> 
> Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs 
> 
> /dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the 
> 
> resulting image then becomes unusable (in ever sense of the word).
> 
> If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up 
> 
> in the list - no, I don't actually do a full fdisk on the image.
> 
> An rbd info my_pool:my_image shows the same expected values on both the 
> 
> host and ceph cluster.
> 
> I've tried this with a whole bunch of different sized images from 100G 
> 
> to 4T and all fail in exactly the same way. (My previous successful 100G 
> 
> test I haven't been able to reproduce).
> 
> I've also tried all of the above using an "admin" CephX(sp?) account - I 
> 
> always can connect via rbdmap, but as soon as I try an mkfs.xfs it 
> 
> fails. This failure also occurs with a mkfs.ext4 as well (all size drives).
> 
> The Ceph Cluster is good (self reported and there are other hosts 
> 
> happily connected via CephFS) and this host also has a CephFS mapping 
> 
> which is working.
> 
> Between running experiments I've gone over the Ceph Doco (again) and I 
> 
> can't work out what's going wrong.
> 
> There's also nothing obvious/helpful jumping out at me from the 
> 
> logs/journal (sample below):
> 
> ~~~
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
> 
> 524773 0~65536 result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
> 
> 524772 65536~4128768 result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119 
> 
> callbacks suppressed
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector 
> 
> 4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
> 
> 524774 0~65536 result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno 
> 
> 524773 65536~4128768 result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1
> 
> Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector 
> 
> 4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2
> 
> ~~~
> 
> Any ideas what I should be looking at?

Could you please share the command you've used to create the RBD?

Cheers,
Alwin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Linux Laptop Losing CephFS mounts on Sleep/Hibernate

2024-03-25 Thread matthew
Hi All,

So I've got a Ceph Reef Cluster (latest version) with a CephFS system set up 
with a number of directories on it.

On a Laptop (running Rocky Linux (latest version)) I've used fstab to mount a 
number of those directories - all good, everything works, happy happy joy joy! 
:-)

However, when the laptop goes into sleep or hibernate mode (ie when I close the 
lid) and then bring it back out of sleep/hibernate (ie open the lid) the CephFS 
mounts are "not present". The only way to get them back is to run `mount -a` as 
either root or as sudo. This, as I'm sure you'll agree, is less than ideal - 
especially as this is a pilot project for non-admin users (ie they won't have 
access to the root account or sudo on their own (corporate) laptops).

So, my question to the combined wisdom of the Community is what's the best way 
to resolve this issue?

I've looked at autofs, and even tried (half-heartedly - it was late, and I 
wanted to go home  :-) ) to get this running, but I'm note sure if this is the 
best way to resolve things.

All help and advice on this greatly appreciated - thank in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 25-03-2024 22:58, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty OSDs. 
How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing 
OSD, but

they are busy with backfilling so they have no free backfill reservation.


If I have 60 backfills going on that would be 60 read reservations and 
60 write reservations if I understand it correctly. The only way I can 
see that getting stuck at 60 backfills with osd_max_backfill = 3 is for 
those 60 reservations to be tied up on 20 OSDs being the only ones 
either read from or written to, and all other OSDs waiting on those.

 > I do recommend this document by Joshua Baergen at Digital Ocean that

explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Thanks, I'll take a look at that =)

Mvh.

Torkil


--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Mounting A RBD Image via Kernal Modules

2024-03-25 Thread matthew
Hi All,

I'm looking for a bit of advice on the subject of this post. I've been "staring 
at the trees so long I can't see the forest any more".  :-)

Rocky Linux Client latest version.
Ceph Reef latest version.

I have read *all* the doco on the Ceph website.
I have created a pool (my_pool) and an image (my_image).
I had activated the pool for RBD.
I can run the `rbdmap map` command on the client and the image shows up as 
/dev/rbd0 (and also /dev/rbd/my_pool/my_image).

But here's where I'm running into issues - and I'm pretty sure it's a 'Level 8' 
issue, so it'll be something simple that I'm just not "getting":

Do I need to run `mkfs` on /dev/rbd0 before I try `mount 
/dev/rbd/my_pool/my_image /mnt/rbd_image`?

The reason I ask is that I've tried to mount the image before I run 'mkfs' and 
I get back `mount: /mnt/rbd_image: wrong fs type, bad option, bad superblock on 
/dev/rbdo, missing codepage or helper program, or other error`.

I've also tried to mount the image after I run 'mkfs' and I get back `mount: 
/mnt/rbd_image: can't read superblock on /dev/rbd0`.

Basically, as I've said, I'm missing or don't understand *something* about this 
process - which is why I'm now seeking the collective wisdom of the Community.

All help and advice greatly appreciated - thanks in advance

Cheers

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why a lot of pgs are degraded after host(+osd) restarted?

2024-03-25 Thread jaemin joo
I understood the mechanism more through your answer.
I'm using erasure coding and backfilling step took quite a long time :(
If there was just a lot of pg peering. I think it's reasonable. but I was 
curious why there was a lot of backfill_wait instead of peering.
(e.g. pg 9.5a is stuck undersized for 39h, current state 
active+undersized+degraded+remapped+backfill_wait )

let me know if you have the tips to increase the performance of backfill or 
prevent unnecessary backfill.
Thank you for your answer.

Joshua Baergen wrote:
> Hi Jaemin,
> 
> It is normal for PGs to become degraded during a host reboot, since a
> copy of the data was taken offline and needs to be resynchronized
> after the host comes back. Normally this is quick, as the recovery
> mechanism only needs to modify those objects that have changed while
> the host is down.
> 
> However, if you have backfills ongoing and reboot a host that contains
> OSDs involved in those backfills, then those backfills become
> degraded, and you will need to wait for them to complete for
> degradation to clear. Do you know if you had backfills at the time the
> host was rebooted? If so, the way to avoid this is to wait for
> backfill to complete before taking any OSDs/hosts down for
> maintenance.
> 
> Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-03-25 Thread mads2a
I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 hosts in it, 
each with 4 OSD's attached. The first 2 servers hosting mgr's have 32GB of RAM 
each, and the remaining have 24gb 
For some reason i am unable to identify, the first host in the cluster appears 
to constantly be trying to set the osd_memory_target variable to roughly half 
of what the calculated minimum is for the cluster, i see the following spamming 
the logs constantly
Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing value: 
Value '480485376' is below minimum 939524096
Default is set to 4294967296.
I did double check and osd_memory_base (805306368) + osd_memory_cache_min 
(134217728) adds up to minimum exactly
osd_memory_target_autotune is currently enabled. But i cannot for the life of 
me figure out how it is arriving at 480485376 as a value for that particular 
host that even has the most RAM. Neither the cluster or the host is even 
approaching max utilization on memory, so it's not like there are processes 
competing for resources.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] #1359 (update) Ceph filesystem failure | Ceph filesystem probleem

2024-03-25 Thread Postmaster C (Simon)

English follows Dutch

## Update 2024-03-19

Positief nieuws; we zijn nu bezig met het kopiëren van data uit
CephFS. We konden het filesystem weer mounten met hulp van 42on, onze
support club. We kopiëren nu de data en dat lijkt goed te gaan.  Op elk
moment kunnen we tegen problematische metadata aanlopen, dus afwachten 
en duimen. Als we de data uit CephFS hebben gekopieerd op tijdelijke 
storage, kunnen we verdere oplossingen voor de toekomst verzinnen en 
implementeren.


NB: neem contact op met postmaster (mailto:postmas...@science.ru.nl)
als je een urgent verzoek hebt voor een *kleine* set van files in een
specifieke locatie, dan kunnen we daar prioriteit aan geven. Een 
Petabyte aan data kopiëren duurt weken/maanden, kleine datasets (< 1TB) 
kan wel relatief snel.



bron: https://cncz.science.ru.nl/nl/cpk/1359

=

## Update 2024-03-19

Somewhat good news; we are now copying the RDR data out of CephFS. We
have been able to mount CephFS again with help from 42on (our support
party). Copying to temporary storage is going OK, but we can run into
issues at any time (or not). We're hopeful that this process will let
us recover the data stored in CephFs and then we can look for future
solutions.

NB: contact postmaster (mailto:postmas...@science.ru.nl) if you have
urgent requests for *small* sets of files in a particular location, so
we may restore this with priority. A Petabyte of data takes weeks/months
to copy, but a small amount (< 1TB) can be retrieved relatively fast.


source: https://cncz.science.ru.nl/en/cpk/1359

--
Postmaster: Simon Oosthoek
Postmaster Phone: +31 24 365 3535
Personal Phone: +31 24 365 2097


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] S3 Partial Reads from Erasure Pool

2024-03-25 Thread E3GH75
I am dealing with a cluster that is having terrible performance with partial 
reads from an erasure coded pool. Warp tests and s3bench tests result in 
acceptable performance but when the application hits the data, performance 
plummets. Can anyone clear this up for me, When radosgw gets a partial read 
does it have to assemble all the rados objects that make up the s3 object 
before returning the range? With a replicated poll i am seeing 6 to 7 GiB/s of 
read performance and only 1GiB/s of read from the erasure coded pool which 
leads me to believe that the replicated pool is returning just the rados 
objects for the partial s3 object and the erasure coded pool is not.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Dashboard Clear Cache

2024-03-25 Thread ashar . khan
Hello Ceph members,

How do I clear the Ceph dashboard cache? Kindly guide me on how to do this.


Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2024-03-25 Thread Aaron Moate
We were having this same error;  after some troubleshooting it turned out that 
the 17.2.7 cephadm orchestrator's ssh client was choking on the 
keyboard-interactive AuthenticationMethod (which is really PAM);  Our sshd 
configuration was:

AuthenticationMethods keyboard-interactive publickey,keyboard-interactive 
gssapi-with-mic,keyboard-interactive

Thus, cephadm was trying to use "publickey,keyboard-interactive";  publickey 
would succeed, but the cephadm ssh client would close the connection as soon as 
a follow-up keyboard-interactive method was attempted.  Adding this to 
sshd_config for each orchestrator seemed to fix it by using only publickey 
AuthenticationMethod for just the cephadm orchestrators, but using the standard 
config for everybody else:

Match Address 
  # For some reason the 17.2.7 cephadm orchestrator
  # chokes on keyboard-interactive (PAM)
  # AuthenticationMethod;  thus, exclude it.
  AuthenticationMethods publickey
  PermitRootLogin yes

Match Address 
  # For some reason the 17.2.7 cephadm orchestrator
  # chokes on keyboard-interactive (PAM)
  # AuthenticationMethod;  thus, exclude it.
  AuthenticationMethods publickey
  PermitRootLogin yes

Cheers.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph-Cluster integration with Ovirt-Cluster

2024-03-25 Thread ankit
Hi Guys,

I have a running ovirt-4.3 cluster with 1 manager and 4 hypervisors nodes and 
for storage using traditional SAN storage which is connect using iscsi. where i 
can create VM's and assign storge from SAN. This is running fine since a decade 
but now i want to move from traditional SAN storage to ceph storage cluster due 
to slow speed and scalability issues of storage. 

But i am a very newbie in ceph but able to install ceph-cluster reef_v18 in my 
lab environment with 5 VM's (2 Mon's and Manager and 3 OSD's) on Debian 12.  
And i have installed ovirt 4.5 with 1 hypervisor on VM's on centos stream 9 and 
want to integrate it with ceph-cluster so that i can use ceph-cluster storage 
like i am using SAN with my ovirt and I have tried to do some google and read 
ceph documentation but did not find anything regarding this. 

So how can i do it? 

If you need any other information, please let me know!

Many Thanks!
PJ111288
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MANY_OBJECT_PER_PG on 1 pool which is cephfs_metadata

2024-03-25 Thread e . fazenda
Dear Eugen,

Sorry i forgot to update the case.

I have upgraded to the latest pacific release 16.2.15 and i have done the 
necessary for the pg_num :) 

Thanks for the followup on this.

Topic can be closed.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Adding new OSD's - slow_ops and other issues.

2024-03-25 Thread jskr
Hi. 

We have a cluster working very nicely since it was put up more than a year ago. 
Now we needed to add more NVMe drives to expand. 

After setting all the "no" flags.. we added them using

$ ceph orch osd add  

The twist is that we have managed to get the default weights set to 1 for all 
disks not 7.68 (as the default for the ceph orch command. 

Thus we did a subsequent reweight to change weight -- and then removed the "no" 
flags. 

As a consequence we had a bunch of OSD's delivering slow_ops and -- after 
manually restarting osd's to get rid of them - the system returned to normal. 

... second try... 

Same drill - but somehow the ceph orch command failed to bring the new OSD 
online before we ran the reweight command ... and it works flawlessly 

... third try ... 

Same drill - but now ceph orch brought the new OSD into the system - and we saw 
excactly the same problem again. Being a bit wiser - we forcefully restarted 
the new OSD.. and everything whet back into normal mode again. 

Thus it seems like the "reweight" command on online OSD's have a bad effect on 
our setup - causing major service disruption. 

1) Is it possible to "bulk" change default weights on all OSD's without a huge 
data movement going on? 
2) or Is it possible to instrurct "ceph orch osd add" to set default weight 
before it putting the new OSD into the system? 

I would not expect above to be expected behaviour - if someone has ideas about 
what goes on more than above please share? 

Setup: 
# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

43 7.68 TB NVMe's over 12 OSD hosts - all connected using 2x 100GbitE 


Thanks Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG damaged "failed_repair"

2024-03-25 Thread romain . lebbadi-breteau
Hi,

Sorry for the broken formatting. Here are the outputs again.

ceph osd df:

ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA  OMAP META 
AVAIL%USE   VAR   PGS  STATUS
 3hdd  1.81879 0  0 B  0 B   0 B  0 B  0 B  
0 B  0 00down
12hdd  1.81879   1.0  1.8 TiB  385 GiB   383 GiB  6.7 MiB  1.4 GiB  1.4 
TiB  20.66  1.73   18  up
13hdd  1.81879   1.0  1.8 TiB  422 GiB   421 GiB  5.8 MiB  1.3 GiB  1.4 
TiB  22.67  1.90   17  up
15hdd  1.81879   1.0  1.8 TiB  264 GiB   263 GiB  4.6 MiB  1.1 GiB  1.6 
TiB  14.17  1.19   14  up
16hdd  9.09520   1.0  9.1 TiB  1.0 TiB  1023 GiB  8.8 MiB  2.6 GiB  8.1 
TiB  11.01  0.92   65  up
17hdd  1.81879   1.0  1.8 TiB  319 GiB   318 GiB  6.1 MiB  1.0 GiB  1.5 
TiB  17.13  1.43   15  up
 1hdd  5.45749   1.0  5.5 TiB  546 GiB   544 GiB  7.8 MiB  1.4 GiB  4.9 
TiB   9.76  0.82   29  up
 4hdd  5.45749   1.0  5.5 TiB  801 GiB   799 GiB  8.3 MiB  2.4 GiB  4.7 
TiB  14.34  1.20   44  up
 8hdd  5.45749   1.0  5.5 TiB  708 GiB   706 GiB  9.7 MiB  2.1 GiB  4.8 
TiB  12.67  1.06   36  up
11hdd  5.45749 0  0 B  0 B   0 B  0 B  0 B  
0 B  0 00down
14hdd  1.81879   1.0  1.8 TiB  200 GiB   198 GiB  3.8 MiB  1.3 GiB  1.6 
TiB  10.71  0.90   10  up
 0hdd  9.09520 0  0 B  0 B   0 B  0 B  0 B  
0 B  0 00down
 5hdd  9.09520   1.0  9.1 TiB  859 GiB   857 GiB   17 MiB  2.1 GiB  8.3 
TiB   9.23  0.77   46  up
 9hdd  9.09520   1.0  9.1 TiB  924 GiB   922 GiB   11 MiB  2.3 GiB  8.2 
TiB   9.92  0.83   55  up
   TOTAL   53 TiB  6.3 TiB   6.3 TiB   90 MiB   19 GiB   46 
TiB  11.95   
MIN/MAX VAR: 0.77/1.90  STDDEV: 4.74

ceph osd pool ls detail :

pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 1 pgp_num 1 autoscale_mode on last_change 32 flags hashpspool 
stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 9327 lfor 0/0/104 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 9018 lfor 0/0/104 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 9149 lfor 0/0/106 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 5 'polyphoto_backup' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 372 lfor 0/0/362 
flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm snappy 
compression_mode aggressive application rbd

The error seems to come from a software error in Ceph. I see this error in the 
logs : "FAILED ceph_assert(clone_overlap.count(clone))"

Thanks,
Romain Lebbadi-Breteau
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2024-03-25 Thread maxadamo
Dear Nico,

do you think it is sensible and it's a precise statement saying that "we can't 
reduce complexity by adding a layer of complexity"?  
Containers are always adding a so-called layer, but people keep using them, and 
in some cases, they offload complexity from another side. 
Claiming the "complexity", without explaining examining the details is pure 
FUD. 
"complexity, OMG!!!111!!!" is not enough of a statement. You have to explain 
what complexity you gain and what complexity you reduce. 
Installing SeaweedFS consists of the following: `cd seaweedfs/weed && make 
install`
This is the type of problem that Ceph is trying to solve, and starting a 
discussion by saying that everything is bad, without providing any helpful 
message is useless FUD. 

Max
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck

2024-03-25 Thread e . fazenda
Dear Eugen,

Thanks again for the help.

We managed to upgrade to a minor release 16.2.3, next week we will upgrade to 
latest 16.2.15.
You were right about the number of manager which was blocking the update.

Thanks again for the help.

I

Topic solved.

Best Regards.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Kai Stian Olstad

On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote:

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty 
OSDs. How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me 
but alas.


It seems I can just increase osd_max_backfill even further to get 
the numbers I want so that will do. Thank you all for taking the 
time to look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing OSD, but
they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.


Forgot the link
https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Kai Stian Olstad

On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote:
My tally came to 412 out of 539 OSDs showing up in a blocked_by list 
and that is about every OSD with data prior to adding ~100 empty OSDs. 
How 400 read targets and 100 write targets can only equal ~60 
backfills with osd_max_backill set at 3 just makes no sense to me but 
alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


It's a huge change and 42% of you data need to be moved.
And this move is not only to the new OSD but also between the existing OSD, but
they are busy with backfilling so they have no free backfill reservation.

I do recommend this document by Joshua Baergen at Digital Ocean that explains
backfilling and the problem with it and there solution, a tool called 
pgremapper.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] quincy-> reef upgrade non-cephadm

2024-03-25 Thread Christopher Durham
Hi,
I am upgrading my test cluster from 17.2.6 (quincy) to 18.2.2 (reef).
As it was an rpm install, i am following the directions here:
Reef — Ceph Documentation

| 
| 
|  | 
Reef — Ceph Documentation


 |

 |

 |


The upgrade worked, but I have some observations and questions before I move to 
my production cluster:

1. I see no systemd units with the fsid in them, as described in the document 
above. Both before and after the upgrade, my mon and other units are:
ceph-mon@.serviceceph-osd@[N].service
etc
Should I be concerned?
2. Does order matter? Based on past upgrades, I do not think so, but I wanted 
to be sure. For example, can I update:
mon/mds/radosgw/mgrs first, then afterwards update the osds? This is what i 
have done in previous updates and and all was well.
3. Again on order, if a server serves say, a mon and mds, I can't really easily 
update one without the other, based on shared libraries and such. 
It appears that that is ok, based on my test cluster, but wanted to be sure. 
Again if an mds is one of the servers to update, I know I have to updatethe 
remaining one after max_mds is set to 1 and others are stopped, first.

4. After upgrade of my mgr node I get:
"Module [several module names] has missing NOTIFY_TYPES member"
in ceph-mgr..log 

But the mgr starts up eventually

The system is Rocky Linux 8.9
Thanks for any thoughts
-Chris

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Clients failing to advance oldest client?

2024-03-25 Thread Erich Weiler

Hi Y'all,

I'm seeing this warning via 'ceph -s' (this is on Reef):

# ceph -s
  cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
3 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming

  services:
mon: 5 daemons, quorum 
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d)

mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz
mds: 1/1 daemons up, 1 standby
osd: 46 osds: 46 up (since 3d), 46 in (since 2w)

  data:
volumes: 1/1 healthy
pools:   4 pools, 1313 pgs
objects: 258.13M objects, 454 TiB
usage:   688 TiB used, 441 TiB / 1.1 PiB avail
pgs: 1303 active+clean
 8active+clean+scrubbing
 2active+clean+scrubbing+deep

  io:
client:   131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr

I googled around and looked at the docs and it seems like this isn't a 
critical problem, but I couldn't find a clear path to resolution.  Does 
anyone have any advice on what I can do to resolve the health issues up top?


My CephFS filesystem is incredibly busy so I have a feeling that has 
some impact here, but not 100% sure...


Thanks as always for the help!

cheers,
erich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard
Neither downing or restarting the OSD cleared the bogus blocked_by. I 
guess it makes no sense to look further at blocked_by as the cause when 
the data can't be trusted and there is no obvious smoking gun like a few 
OSDs blocking everything.


My tally came to 412 out of 539 OSDs showing up in a blocked_by list and 
that is about every OSD with data prior to adding ~100 empty OSDs. How 
400 read targets and 100 write targets can only equal ~60 backfills with 
osd_max_backill set at 3 just makes no sense to me but alas.


It seems I can just increase osd_max_backfill even further to get the 
numbers I want so that will do. Thank you all for taking the time to 
look at this.


Mvh.

Torkil

On 25-03-2024 20:44, Anthony D'Atri wrote:

First try "ceph osd down 89"


On Mar 25, 2024, at 15:37, Alexander E. Patrakov  wrote:

On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:




On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?


I'll take a look at that tomorrow, perhaps we can script something
meaningful.


Hi Alexander

While working on a script querying all PGs and making a list of all OSDs
found in a blocked_by list, and how many times for each, I discovered
something odd about pool 38:

"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:




All PGs in the pool are active+clean so why are there any blocked_by at
all? One example attached.


I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).



Mvh.

Torkil


I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt


Output attached.

Thanks again.

Mvh.

Torkil


On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
wrote:


Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.

Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
wrote:




On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.


Thank you for taking the time =)


What happens if you increase the osd_max_backfills setting
temporarily?


We already had the mclock override option in place and I re-enabled
our
babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.


It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.


A few samples attached.


Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs
from
two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you 

[ceph-users] mark direct Zabbix support deprecated? Re: Ceph versus Zabbix: failure: no data sent

2024-03-25 Thread John Jasen
Well, at least on my RHEL Ceph cluster, turns out zabbix-sender,
zabbix-agent, etc aren't in the container image. Doesn't explain why it
didn't work with the Debian/proxmox version, but *shrug*.

It appears there is no interest in adding them back in, per:
https://github.com/ceph/ceph-container/issues/1651

As such, may I recommend marking the Ceph documentation to this effect?
Possibly referring to Zabbix instructions with Agent 2?




On Fri, Mar 22, 2024 at 7:04 PM John Jasen  wrote:

> If the documentation is to be believed, it's just install the zabbix
> sender, then;
>
> ceph mgr module enable zabbix
>
> ceph zabbix config-set zabbix_host my-zabbix-server
>
> (Optional) Set the identifier to the fsid.
>
> And poof. I should now have a discovered entity on my zabbix server to add
> templates to.
>
> However, this has not worked yet on either of my ceph clusters (one RHEL,
> one proxmox).
>
> Reference: https://docs.ceph.com/en/latest/mgr/zabbix/
>
> On Reddit advice, I installed the Ceph templates for Zabbix.
> https://raw.githubusercontent.com/ceph/ceph/master/src/pybind/mgr/zabbix/zabbix_template.xml
>
> Still no dice.  No traffic at all seems to be generated, that I've seen
> from packet traces,
>
> ... OK.
>
> I su'ed to the ceph user on both clusters, and ran zabbix_send:
>
> zabbix_sender -v -z 10.0.0.1 -s "$my_fsid" -k ceph.osd_avg_pgs -o 1
>
> Response from "10.0.0.1:10051": "processed: 1; failed: 0; total: 1;
> seconds spent: 0.42"
>
> sent: 1; skipped: 0; total: 1
>
> As the ceph user, ceph zabbix send/discovery still fail.
>
> I am officially stumped.
>
> Any ideas as to which tree I should be barking up?
>
> Thanks in advance!
>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Anthony D'Atri
First try "ceph osd down 89"

> On Mar 25, 2024, at 15:37, Alexander E. Patrakov  wrote:
> 
> On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>> 
>> 
>> 
>> On 24/03/2024 01:14, Torkil Svensgaard wrote:
>>> On 24-03-2024 00:31, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 Thanks for the update. Even though the improvement is small, it is
 still an improvement, consistent with the osd_max_backfills value, and
 it proves that there are still unsolved peering issues.
 
 I have looked at both the old and the new state of the PG, but could
 not find anything else interesting.
 
 I also looked again at the state of PG 37.1. It is known what blocks
 the backfill of this PG; please search for "blocked_by." However, this
 is just one data point, which is insufficient for any conclusions. Try
 looking at other PGs. Is there anything too common in the non-empty
 "blocked_by" blocks?
>>> 
>>> I'll take a look at that tomorrow, perhaps we can script something
>>> meaningful.
>> 
>> Hi Alexander
>> 
>> While working on a script querying all PGs and making a list of all OSDs
>> found in a blocked_by list, and how many times for each, I discovered
>> something odd about pool 38:
>> 
>> "
>> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
>> OSDs blocking other OSDs:
> 
> 
>> All PGs in the pool are active+clean so why are there any blocked_by at
>> all? One example attached.
> 
> I don't know. In any case, it doesn't match the "one OSD blocks them
> all" scenario that I was looking for. I think this is something bogus
> that can probably be cleared in your example by restarting osd.89
> (i.e, the one being blocked).
> 
>> 
>> Mvh.
>> 
>> Torkil
>> 
 I think we have to look for patterns in other ways, too. One tool that
 produces good visualizations is TheJJ balancer. Although it is called
 a "balancer," it can also visualize the ongoing backfills.
 
 The tool is available at
 https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
 
 Run it as follows:
 
 ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
>>> 
>>> Output attached.
>>> 
>>> Thanks again.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
 On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
 wrote:
> 
> Hi Alex
> 
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
> 
> Mvh.
> 
> Torkil
> 
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>> Hi Torkil,
>> 
>> I have looked at the files that you attached. They were helpful: pool
>> 11 is problematic, it complains about degraded objects for no obvious
>> reason. I think that is the blocker.
>> 
>> I also noted that you mentioned peering problems, and I suspect that
>> they are not completely resolved. As a somewhat-irrational move, to
>> confirm this theory, you can restart osd.237 (it is mentioned at the
>> end of query.11.fff.txt, although I don't understand why it is there)
>> and then osd.298 (it is the primary for that pg) and see if any
>> additional backfills are unblocked after that. Also, please re-query
>> that PG again after the OSD restart.
>> 
>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
>> wrote:
>>> 
>>> 
>>> 
>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 I have looked at the CRUSH rules, and the equivalent rules work on my
 test cluster. So this cannot be the cause of the blockage.
>>> 
>>> Thank you for taking the time =)
>>> 
 What happens if you increase the osd_max_backfills setting
 temporarily?
>>> 
>>> We already had the mclock override option in place and I re-enabled
>>> our
>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>> on how full they are. Active backfills went from 16 to 53 which is
>>> probably because default osd_max_backfills for mclock is 1.
>>> 
>>> I think 53 is still a low number of active backfills given the large
>>> percentage misplaced.
>>> 
 It may be a good idea to investigate a few of the stalled PGs. Please
 run commands similar to this one:
 
 ceph pg 37.0 query > query.37.0.txt
 ceph pg 37.1 query > query.37.1.txt
 ...
 and the same for the other affected pools.
>>> 
>>> A few samples attached.
>>> 
 Still, I must say that some of your rules are actually unsafe.
 
 The 4+2 rule as used by rbd_ec_data will not survive a
 datacenter-offline incident. Namely, for each PG, it chooses OSDs
 from
 two hosts in each datacenter, so 6 OSDs total. When 

[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread John Mulligan
On Monday, March 25, 2024 3:22:26 PM EDT Alexander E. Patrakov wrote:
> On Mon, Mar 25, 2024 at 11:01 PM John Mulligan
> 
>  wrote:
> > On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote:
> > > Hi John,
> > > 
> > > > A few major features we have planned include:
> > > > * Standalone servers (internally defined users/groups)
> > > 
> > > No concerns here
> > > 
> > > > * Active Directory Domain Member Servers
> > > 
> > > In the second case, what is the plan regarding UID mapping? Is NFS
> > > coexistence planned, or a concurrent mount of the same directory using
> > > CephFS directly?
> > 
> > In the immediate future the plan is to have a very simple, fairly
> > "opinionated" idmapping scheme based on the autorid backend.
> 
> OK, the docs for clustered SAMBA do mention the autorid backend in
> examples. It's a shame that the manual page does not explicitly list
> it as compatible with clustered setups.
> 
> However, please consider that the majority of Linux distributions
> (tested: CentOS, Fedora, Alt Linux, Ubuntu, OpenSUSE) use "realmd" to
> join AD domains by default (where "default" means a pointy-clicky way
> in a workstation setup), which uses SSSD, and therefore, by this
> opinionated choice of the autorid backend, you create mappings that
> disagree with the supposed majority and the default. This will create
> problems in the future when you do consider NFS coexistence.
> 

Thanks, I'll keep that in mind.

> Well, it's a different topic that most organizations that I have seen
> seem to ignore this default. Maybe those that don't have any problems
> don't have any reason to talk to me? I think that more research is
> needed here on whether RedHat's and GNOME's push of SSSD is something
> not-ready or indeed the de-facto standard setup.
> 

I think it's a bit of a mix, but am not sure either. 


> Even if you don't want to use SSSD, providing an option to provision a
> few domains with idmap rid backend with statically configured ranges
> (as an override to autorid) would be a good step forward, as this can
> be made compatible with the default RedHat setup.

That's reasonable. Thanks for the suggestion.


> 
> > Sharing the same directories over both NFS and SMB at the same time, also
> > known as "multi-protocol", is not planned for now, however we're all aware
> > that there's often a demand for this feature and we're aware of the
> > complexity it brings. I expect we'll work on that at some point but not
> > initially. Similarly, sharing the same directories over a SMB share and
> > directly on a cephfs mount won't be blocked but we won't recommend it.
> 
> OK. Feature request: in the case if there are several CephFS
> filesystems, support configuration of which one to serve.
> 

Putting it on the list.

> > > In fact, I am quite skeptical, because, at least in my experience,
> > > every customer's SAMBA configuration as a domain member is a unique
> > > snowflake, and cephadm would need an ability to specify arbitrary UID
> > > mapping configuration to match what the customer uses elsewhere - and
> > > the match must be precise.
> > 
> > I agree - our initial use case is something along the lines:
> > Users of a Ceph Cluster that have Windows systems, Mac systems, or
> > appliances that are joined to an existing AD
> > but are not currently interoperating with the Ceph cluster.
> > 
> > I expect to add some idpapping configuration and agility down the line,
> > especially supporting some form of rfc2307 idmapping (where unix IDs are
> > stored in AD).
> 
> Yes, for whatever reason, people do this, even though it is cumbersome
> to manage.
> 
> > But those who already have idmapping schemes and samba accessing ceph will
> > probably need to just continue using the existing setups as we don't have
> > an immediate plan for migrating those users.
> > 
> > > Here is what I have seen or was told about:
> > > 
> > > 1. We don't care about interoperability with NFS or CephFS, so we just
> > > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
> > > idmap backend. It's completely OK that workstations get different UIDs
> > > and GIDs, as only SIDs traverse the wire.
> > 
> > This is pretty close to our initial plan but I'm not clear why you'd think
> > that "workstations get different UIDs and GIDs". For all systems acessing
> > the (same) ceph cluster the id mapping should be consistent.
> > You did make me consider multi-cluster use cases with something like
> > cephfs
> > volume mirroring - that's something that I hadn't thought of before *but*
> > using an algorithmic mapping backend like autorid (and testing) I think
> > we're mostly OK there.
> 
> The tdb2 backend (used in my example) is not algorithmic, it is
> allocating. That is, it sequentially allocates IDs on the
> first-seen-first-allocated basis. Yet this is what this customer uses,
> presumably because it is the only backend that explicitly specifies
> clustering operation in its manual page.
> 
> 

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Alexander E. Patrakov
On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>
>
>
> On 24/03/2024 01:14, Torkil Svensgaard wrote:
> > On 24-03-2024 00:31, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> Thanks for the update. Even though the improvement is small, it is
> >> still an improvement, consistent with the osd_max_backfills value, and
> >> it proves that there are still unsolved peering issues.
> >>
> >> I have looked at both the old and the new state of the PG, but could
> >> not find anything else interesting.
> >>
> >> I also looked again at the state of PG 37.1. It is known what blocks
> >> the backfill of this PG; please search for "blocked_by." However, this
> >> is just one data point, which is insufficient for any conclusions. Try
> >> looking at other PGs. Is there anything too common in the non-empty
> >> "blocked_by" blocks?
> >
> > I'll take a look at that tomorrow, perhaps we can script something
> > meaningful.
>
> Hi Alexander
>
> While working on a script querying all PGs and making a list of all OSDs
> found in a blocked_by list, and how many times for each, I discovered
> something odd about pool 38:
>
> "
> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
> OSDs blocking other OSDs:


> All PGs in the pool are active+clean so why are there any blocked_by at
> all? One example attached.

I don't know. In any case, it doesn't match the "one OSD blocks them
all" scenario that I was looking for. I think this is something bogus
that can probably be cleared in your example by restarting osd.89
(i.e, the one being blocked).

>
> Mvh.
>
> Torkil
>
> >> I think we have to look for patterns in other ways, too. One tool that
> >> produces good visualizations is TheJJ balancer. Although it is called
> >> a "balancer," it can also visualize the ongoing backfills.
> >>
> >> The tool is available at
> >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> >>
> >> Run it as follows:
> >>
> >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
> >
> > Output attached.
> >
> > Thanks again.
> >
> > Mvh.
> >
> > Torkil
> >
> >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
> >> wrote:
> >>>
> >>> Hi Alex
> >>>
> >>> New query output attached after restarting both OSDs. OSD 237 is no
> >>> longer mentioned but it unfortunately made no difference for the number
> >>> of backfills which went 59->62->62.
> >>>
> >>> Mvh.
> >>>
> >>> Torkil
> >>>
> >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>  Hi Torkil,
> 
>  I have looked at the files that you attached. They were helpful: pool
>  11 is problematic, it complains about degraded objects for no obvious
>  reason. I think that is the blocker.
> 
>  I also noted that you mentioned peering problems, and I suspect that
>  they are not completely resolved. As a somewhat-irrational move, to
>  confirm this theory, you can restart osd.237 (it is mentioned at the
>  end of query.11.fff.txt, although I don't understand why it is there)
>  and then osd.298 (it is the primary for that pg) and see if any
>  additional backfills are unblocked after that. Also, please re-query
>  that PG again after the OSD restart.
> 
>  On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
>  wrote:
> >
> >
> >
> > On 23-03-2024 21:19, Alexander E. Patrakov wrote:
> >> Hi Torkil,
> >
> > Hi Alexander
> >
> >> I have looked at the CRUSH rules, and the equivalent rules work on my
> >> test cluster. So this cannot be the cause of the blockage.
> >
> > Thank you for taking the time =)
> >
> >> What happens if you increase the osd_max_backfills setting
> >> temporarily?
> >
> > We already had the mclock override option in place and I re-enabled
> > our
> > babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
> > on how full they are. Active backfills went from 16 to 53 which is
> > probably because default osd_max_backfills for mclock is 1.
> >
> > I think 53 is still a low number of active backfills given the large
> > percentage misplaced.
> >
> >> It may be a good idea to investigate a few of the stalled PGs. Please
> >> run commands similar to this one:
> >>
> >> ceph pg 37.0 query > query.37.0.txt
> >> ceph pg 37.1 query > query.37.1.txt
> >> ...
> >> and the same for the other affected pools.
> >
> > A few samples attached.
> >
> >> Still, I must say that some of your rules are actually unsafe.
> >>
> >> The 4+2 rule as used by rbd_ec_data will not survive a
> >> datacenter-offline incident. Namely, for each PG, it chooses OSDs
> >> from
> >> two hosts in each datacenter, so 6 OSDs total. When a datacenter is
> >> offline, you will, therefore, have only 4 OSDs up, which is exactly
> >> the number of data chunks. However, the pool requires min_size 5, so
> 

[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread Alexander E. Patrakov
On Mon, Mar 25, 2024 at 11:01 PM John Mulligan
 wrote:
>
> On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote:
> > Hi John,
> >
> > > A few major features we have planned include:
> > > * Standalone servers (internally defined users/groups)
> >
> > No concerns here
> >
> > > * Active Directory Domain Member Servers
> >
> > In the second case, what is the plan regarding UID mapping? Is NFS
> > coexistence planned, or a concurrent mount of the same directory using
> > CephFS directly?
>
> In the immediate future the plan is to have a very simple, fairly
> "opinionated" idmapping scheme based on the autorid backend.

OK, the docs for clustered SAMBA do mention the autorid backend in
examples. It's a shame that the manual page does not explicitly list
it as compatible with clustered setups.

However, please consider that the majority of Linux distributions
(tested: CentOS, Fedora, Alt Linux, Ubuntu, OpenSUSE) use "realmd" to
join AD domains by default (where "default" means a pointy-clicky way
in a workstation setup), which uses SSSD, and therefore, by this
opinionated choice of the autorid backend, you create mappings that
disagree with the supposed majority and the default. This will create
problems in the future when you do consider NFS coexistence.

Well, it's a different topic that most organizations that I have seen
seem to ignore this default. Maybe those that don't have any problems
don't have any reason to talk to me? I think that more research is
needed here on whether RedHat's and GNOME's push of SSSD is something
not-ready or indeed the de-facto standard setup.

Even if you don't want to use SSSD, providing an option to provision a
few domains with idmap rid backend with statically configured ranges
(as an override to autorid) would be a good step forward, as this can
be made compatible with the default RedHat setup.

> Sharing the same directories over both NFS and SMB at the same time, also
> known as "multi-protocol", is not planned for now, however we're all aware
> that there's often a demand for this feature and we're aware of the complexity
> it brings. I expect we'll work on that at some point but not initially.
> Similarly, sharing the same directories over a SMB share and directly on a
> cephfs mount won't be blocked but we won't recommend it.

OK. Feature request: in the case if there are several CephFS
filesystems, support configuration of which one to serve.

>
> >
> > In fact, I am quite skeptical, because, at least in my experience,
> > every customer's SAMBA configuration as a domain member is a unique
> > snowflake, and cephadm would need an ability to specify arbitrary UID
> > mapping configuration to match what the customer uses elsewhere - and
> > the match must be precise.
> >
>
> I agree - our initial use case is something along the lines:
> Users of a Ceph Cluster that have Windows systems, Mac systems, or appliances
> that are joined to an existing AD
> but are not currently interoperating with the Ceph cluster.
>
> I expect to add some idpapping configuration and agility down the line,
> especially supporting some form of rfc2307 idmapping (where unix IDs are
> stored in AD).

Yes, for whatever reason, people do this, even though it is cumbersome
to manage.

>
> But those who already have idmapping schemes and samba accessing ceph will
> probably need to just continue using the existing setups as we don't have an
> immediate plan for migrating those users.
>
> > Here is what I have seen or was told about:
> >
> > 1. We don't care about interoperability with NFS or CephFS, so we just
> > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
> > idmap backend. It's completely OK that workstations get different UIDs
> > and GIDs, as only SIDs traverse the wire.
>
> This is pretty close to our initial plan but I'm not clear why you'd think
> that "workstations get different UIDs and GIDs". For all systems acessing the
> (same) ceph cluster the id mapping should be consistent.
> You did make me consider multi-cluster use cases with something like cephfs
> volume mirroring - that's something that I hadn't thought of before *but*
> using an algorithmic mapping backend like autorid (and testing) I think we're
> mostly OK there.

The tdb2 backend (used in my example) is not algorithmic, it is
allocating. That is, it sequentially allocates IDs on the
first-seen-first-allocated basis. Yet this is what this customer uses,
presumably because it is the only backend that explicitly specifies
clustering operation in its manual page.

And the "autorid" backend is also not fully algorithmic, it allocates
ranges to domains on the same sequential basis (see
https://github.com/samba-team/samba/blob/6fb98f70c6274e172787c8d5f73aa93920171e7c/source3/winbindd/idmap_autorid_tdb.c#L82),
and therefore can create mismatching mappings if two workstations or
servers have seen the users DOMA\usera and DOMB\userb in a different
order. It is even mentioned in the manual page. 

[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread John Mulligan
On Monday, March 25, 2024 1:46:26 PM EDT Ralph Boehme wrote:
> Hi John,
> 
> On 3/21/24 20:12, John Mulligan wrote:
> 
> > I'd like to formally let the wider community know of some work I've been
> > involved with for a while now: adding Managed SMB Protocol Support to
> > Ceph.
 SMB being the well known network file protocol native to Windows
> > systems and supported by MacOS (and Linux). The other key word "managed"
> > meaning integrating with Ceph management tooling - in this particular
> > case cephadm for orchestration and eventually a new MGR module for
> > managing SMB shares. 
> > The effort is still in it's very early stages. We have a PR adding
> > initial
> > support for Samba Containers to cephadm [1] and a prototype for an smb
> > MGR
> > module [2]. We plan on using container images based on the
> > samba-container
> > project [3] - a team I am already part of. What we're aiming for is a
> > feature
 set similar to the current NFS integration in Ceph, but with a
> > focus on bridging non-Linux/Unix clients to CephFS using a protocol built
> > into those systems.
> > 
> > A few major features we have planned include:
> > * Standalone servers (internally defined users/groups)
> > * Active Directory Domain Member Servers
> > * Clustered Samba support
> > * Exporting Samba stats via Prometheus metrics
> > * A `ceph` cli workflow loosely based on the nfs mgr module
> > 
> > I wanted to share this information in case there's wider community
> > interest in
 this effort.
> 
> 
> certainly! :)
> 
> If it makes sense, you may want to pull in samba-technical where it 
> makes sense.

Absolutely.  I'm currently focusing on the basics and those are mostly good-
to-go for our needs in current samba releases.  In the future, I'm sure we'll 
run into times where technical help or changes will be needed.

> If there's a need, you can also pull me in directly  into 
> meetings or other channels to discuss things.
> 

Thanks! I appreciate it!


> Looking forward to seeing you at SambaXP, at least virtually.

You too. :-)

> Any plans  to attend SDC from you or others from your team?
> 

I'm unsure. I'll ask around.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread Ralph Boehme

Hi John,

On 3/21/24 20:12, John Mulligan wrote:

I'd like to formally let the wider community know of some work I've been
involved with for a while now: adding Managed SMB Protocol Support to Ceph.
SMB being the well known network file protocol native to Windows systems and
supported by MacOS (and Linux). The other key word "managed" meaning
integrating with Ceph management tooling - in this particular case cephadm for
orchestration and eventually a new MGR module for managing SMB shares.

The effort is still in it's very early stages. We have a PR adding initial
support for Samba Containers to cephadm [1] and a prototype for an smb MGR
module [2]. We plan on using container images based on the samba-container
project [3] - a team I am already part of. What we're aiming for is a feature
set similar to the current NFS integration in Ceph, but with a focus on
bridging non-Linux/Unix clients to CephFS using a protocol built into those
systems.

A few major features we have planned include:
* Standalone servers (internally defined users/groups)
* Active Directory Domain Member Servers
* Clustered Samba support
* Exporting Samba stats via Prometheus metrics
* A `ceph` cli workflow loosely based on the nfs mgr module

I wanted to share this information in case there's wider community interest in
this effort.


certainly! :)

If it makes sense, you may want to pull in samba-technical where it 
makes sense. If there's a need, you can also pull me in directly  into 
meetings or other channels to discuss things.


Looking forward to seeing you at SambaXP, at least virtually. Any plans 
to attend SDC from you or others from your team?


Cheers!
-slow



OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Spam in log file

2024-03-25 Thread Patrick Donnelly
Nope.

On Mon, Mar 25, 2024 at 8:33 AM Albert Shih  wrote:
>
> Le 25/03/2024 à 08:28:54-0400, Patrick Donnelly a écrit
> Hi,
>
> >
> > The fix is in one of the next releases. Check the tracker ticket:
> > https://tracker.ceph.com/issues/63166
>
> Oh thanks. Didn't find it with google.
>
> Is they are any risk/impact for the cluster ?
>
> Regards.
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> lun. 25 mars 2024 13:31:27 CET
>


-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread John Mulligan
On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote:
> Hi John,
> 
> > A few major features we have planned include:
> > * Standalone servers (internally defined users/groups)
> 
> No concerns here
> 
> > * Active Directory Domain Member Servers
> 
> In the second case, what is the plan regarding UID mapping? Is NFS
> coexistence planned, or a concurrent mount of the same directory using
> CephFS directly?

In the immediate future the plan is to have a very simple, fairly 
"opinionated" idmapping scheme based on the autorid backend.
Sharing the same directories over both NFS and SMB at the same time, also 
known as "multi-protocol", is not planned for now, however we're all aware 
that there's often a demand for this feature and we're aware of the complexity 
it brings. I expect we'll work on that at some point but not initially. 
Similarly, sharing the same directories over a SMB share and directly on a 
cephfs mount won't be blocked but we won't recommend it.

> 
> In fact, I am quite skeptical, because, at least in my experience,
> every customer's SAMBA configuration as a domain member is a unique
> snowflake, and cephadm would need an ability to specify arbitrary UID
> mapping configuration to match what the customer uses elsewhere - and
> the match must be precise.
> 

I agree - our initial use case is something along the lines:
Users of a Ceph Cluster that have Windows systems, Mac systems, or appliances 
that are joined to an existing AD
but are not currently interoperating with the Ceph cluster.

I expect to add some idpapping configuration and agility down the line, 
especially supporting some form of rfc2307 idmapping (where unix IDs are 
stored in AD).

But those who already have idmapping schemes and samba accessing ceph will 
probably need to just continue using the existing setups as we don't have an 
immediate plan for migrating those users.

> Here is what I have seen or was told about:
> 
> 1. We don't care about interoperability with NFS or CephFS, so we just
> let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2"
> idmap backend. It's completely OK that workstations get different UIDs
> and GIDs, as only SIDs traverse the wire.

This is pretty close to our initial plan but I'm not clear why you'd think 
that "workstations get different UIDs and GIDs". For all systems acessing the 
(same) ceph cluster the id mapping should be consistent.
You did make me consider multi-cluster use cases with something like cephfs 
volume mirroring - that's something that I hadn't thought of before *but* 
using an algorithmic mapping backend like autorid (and testing) I think we're 
mostly OK there.

> 2. [not seen in the wild, the customer did not actually implement it,
> it's a product of internal miscommunication, and I am not sure if it
> is valid at all] We don't care about interoperability with CephFS,
> and, while we have NFS, security guys would not allow running NFS
> non-kerberized. Therefore, no UIDs or GIDs traverse the wire, only
> SIDs and names. Therefore, all we need is to allow both SAMBA and NFS
> to use shared UID mapping allocated on as-needed basis using the
> "tdb2" idmap module, and it doesn't matter that these UIDs and GIDs
> are inconsistent with what clients choose.

Unfortunately, I don't really understand this item. Fortunately, you say it 
was only considered not implemented. :-)

> 3. We don't care about ACLs at all, and don't care about CephFS
> interoperability. We set ownership of all new files to root:root 0666
> using whatever options are available [well, I would rather use a
> dedicated nobody-style uid/gid here]. All we care about is that only
> authorized workstations or authorized users can connect to each NFS or
> SMB share, and we absolutely don't want them to be able to set custom
> ownership or ACLs.

Some times known as the "drop-box" use case I think (not to be confused with 
the cloud app of a similar name).
We could probably implement something like that as an option but I had not 
considered it before.

> 4. We care about NFS and CephFS file ownership being consistent with
> what Windows clients see. We store all UIDs and GIDs in Active
> Directory using the rfc2307 schema, and it's mandatory that all
> servers (especially SAMBA - thanks to the "ad" idmap backend) respect
> that and don't try to invent anything [well, they do - BUILTIN/Users
> gets its GID through tdb2]. Oh, and by the way, we have this strangely
> low-numbered group that everybody gets wrong unless they set "idmap
> config CORP : range = 500-99".

This is oh so similar to a project I worked on prior to working with Ceph.
I think we'll need to do this one eventually but maybe not this year.
One nice side-effect of running in containers is that the low-id number is less 
of an issue because the ids only matter within the container context (and only 
then if using the kernel file system access methods). We have much more 
flexibility with IDs in a container.

> 

[ceph-users] March Ceph Science Virtual User Group

2024-03-25 Thread Kevin Hrpcek

Hey All,

We will be having a Ceph science/research/big cluster call on Wednesday 
March 27th. If anyone wants to discuss something specific they can add 
it to the pad linked below. If you have questions or comments you can 
contact me.


This is an informal open call of community members mostly from 
hpc/htc/research/big cluster environments (though anyone is welcome) 
where we discuss whatever is on our minds regarding ceph. Updates, 
outages, features, maintenance, etc...there is no set presenter but I do 
attempt to keep the conversation lively.


Pad URL:
https://pad.ceph.com/p/Ceph_Science_User_Group_20240327

Virtual event details:
March 27, 2024
14:00 UTC
3pm Central European
9am Central US

Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

Meetings will be recorded and posted to the Ceph Youtube channel.

To join the meeting on a computer or mobile phone: 
https://meet.jit.si/ceph-science-wg


Kevin

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS/TROPICS
Space Science & Engineering Center
University of Wisconsin-Madison
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Spam in log file

2024-03-25 Thread Albert Shih
Le 25/03/2024 à 08:28:54-0400, Patrick Donnelly a écrit
Hi, 

> 
> The fix is in one of the next releases. Check the tracker ticket:
> https://tracker.ceph.com/issues/63166

Oh thanks. Didn't find it with google. 

Is they are any risk/impact for the cluster ? 

Regards. 
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
lun. 25 mars 2024 13:31:27 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Spam in log file

2024-03-25 Thread Patrick Donnelly
Hi Albert,

The fix is in one of the next releases. Check the tracker ticket:
https://tracker.ceph.com/issues/63166

On Mon, Mar 25, 2024 at 8:23 AM Albert Shih  wrote:
>
> Hi everyone.
>
> On my cluster I got spam by my cluster with message like
>
> Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
> metadata for mds.cephfs.cthulhu2.dqahyt: (2) No such file or directory
> Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
> metadata for mds.cephfs.cthulhu3.xvboir: (2) No such file or directory
> Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
> metadata for mds.cephfs.cthulhu5.kwmyyg: (2) No such file or directory
>
> I got 5 server for the service (cthulhu 1->5) and indeed when from cthulhu1 
> (or 2) I try :
>
> something:
>
> root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu2.dqahyt
> {}
> Error ENOENT:
> root@cthulhu2:
>
> but that works on 1 or 4
>
> root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu1.sikvjf
> {
> "addr": 
> "[v2:145.238.187.184:6800/1315478297,v1:145.238.187.184:6801/1315478297]",
> "arch": "x86_64",
> "ceph_release": "quincy",
> "ceph_version": "ceph version 17.2.7 
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)",
> "ceph_version_short": "17.2.7",
> "container_hostname": "cthulhu1",
> "container_image": 
> "quay.io/ceph/ceph@sha256:62465e744a80832bde6a57120d3ba076613e8a19884b274f9cc82580e249f6e1",
> "cpu": "Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz",
> "distro": "centos",
> "distro_description": "CentOS Stream 8",
> "distro_version": "8",
> "hostname": "cthulhu1",
> "kernel_description": "#1 SMP Debian 5.10.209-2 (2024-01-31)",
> "kernel_version": "5.10.0-28-amd64",
> "mem_swap_kb": "16777212",
> "mem_total_kb": "263803496",
> "os": "Linux"
> }
> root@cthulhu2:/etc/ceph#
>
> I check the caps and don't see anything special.
>
> I got also (I don't know if it's related) those message :
>
> Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready 
> for session (expect reconnect)
> Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready 
> for session (expect reconnect)
> Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready 
> for session (expect reconnect)
> Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready 
> for session (expect reconnect)
> Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready 
> for session (expect reconnect)
> Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
> from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready 
> for session (expect reconnect)
>
> Regards.
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> lun. 25 mars 2024 13:08:33 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Spam in log file

2024-03-25 Thread Albert Shih
Hi everyone.

On my cluster I got spam by my cluster with message like 

Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
metadata for mds.cephfs.cthulhu2.dqahyt: (2) No such file or directory
Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
metadata for mds.cephfs.cthulhu3.xvboir: (2) No such file or directory
Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return 
metadata for mds.cephfs.cthulhu5.kwmyyg: (2) No such file or directory

I got 5 server for the service (cthulhu 1->5) and indeed when from cthulhu1 (or 
2) I try :

something:

root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu2.dqahyt
{}
Error ENOENT:
root@cthulhu2:

but that works on 1 or 4

root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu1.sikvjf
{
"addr": 
"[v2:145.238.187.184:6800/1315478297,v1:145.238.187.184:6801/1315478297]",
"arch": "x86_64",
"ceph_release": "quincy",
"ceph_version": "ceph version 17.2.7 
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)",
"ceph_version_short": "17.2.7",
"container_hostname": "cthulhu1",
"container_image": 
"quay.io/ceph/ceph@sha256:62465e744a80832bde6a57120d3ba076613e8a19884b274f9cc82580e249f6e1",
"cpu": "Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz",
"distro": "centos",
"distro_description": "CentOS Stream 8",
"distro_version": "8",
"hostname": "cthulhu1",
"kernel_description": "#1 SMP Debian 5.10.209-2 (2024-01-31)",
"kernel_version": "5.10.0-28-amd64",
"mem_swap_kb": "16777212",
"mem_total_kb": "263803496",
"os": "Linux"
}
root@cthulhu2:/etc/ceph#

I check the caps and don't see anything special.

I got also (I don't know if it's related) those message :

Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready 
for session (expect reconnect)
Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready 
for session (expect reconnect)
Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready 
for session (expect reconnect)
Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready 
for session (expect reconnect)
Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready 
for session (expect reconnect)
Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open 
from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready 
for session (expect reconnect)

Regards.
-- 
Albert SHIH 嶺 
France
Heure locale/Local time:
lun. 25 mars 2024 13:08:33 CET
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Torkil Svensgaard



On 24/03/2024 01:14, Torkil Svensgaard wrote:

On 24-03-2024 00:31, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


Thanks for the update. Even though the improvement is small, it is
still an improvement, consistent with the osd_max_backfills value, and
it proves that there are still unsolved peering issues.

I have looked at both the old and the new state of the PG, but could
not find anything else interesting.

I also looked again at the state of PG 37.1. It is known what blocks
the backfill of this PG; please search for "blocked_by." However, this
is just one data point, which is insufficient for any conclusions. Try
looking at other PGs. Is there anything too common in the non-empty
"blocked_by" blocks?


I'll take a look at that tomorrow, perhaps we can script something 
meaningful.


Hi Alexander

While working on a script querying all PGs and making a list of all OSDs 
found in a blocked_by list, and how many times for each, I discovered 
something odd about pool 38:


"
[root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
OSDs blocking other OSDs:
OSD 425: 5 instance(s)
OSD 426: 6 instance(s)
OSD 34: 7 instance(s)
OSD 36: 5 instance(s)
OSD 146: 3 instance(s)
OSD 6: 2 instance(s)
OSD 5: 8 instance(s)
OSD 131: 7 instance(s)
OSD 4: 9 instance(s)
OSD 3: 5 instance(s)
OSD 2: 5 instance(s)
OSD 1: 2 instance(s)
OSD 0: 4 instance(s)
OSD 167: 1 instance(s)
OSD 168: 3 instance(s)
OSD 450: 2 instance(s)
OSD 46: 6 instance(s)
OSD 154: 3 instance(s)
OSD 156: 2 instance(s)
OSD 90: 2 instance(s)
OSD 227: 4 instance(s)
OSD 10: 4 instance(s)
OSD 15: 6 instance(s)
OSD 449: 4 instance(s)
OSD 192: 2 instance(s)
OSD 67: 3 instance(s)
"

All PGs in the pool are active+clean so why are there any blocked_by at 
all? One example attached.


Mvh.

Torkil


I think we have to look for patterns in other ways, too. One tool that
produces good visualizations is TheJJ balancer. Although it is called
a "balancer," it can also visualize the ongoing backfills.

The tool is available at
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py

Run it as follows:

./placementoptimizer.py showremapped --by-osd | tee remapped.txt


Output attached.

Thanks again.

Mvh.

Torkil

On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard  
wrote:


Hi Alex

New query output attached after restarting both OSDs. OSD 237 is no
longer mentioned but it unfortunately made no difference for the number
of backfills which went 59->62->62.

Mvh.

Torkil

On 23-03-2024 22:26, Alexander E. Patrakov wrote:

Hi Torkil,

I have looked at the files that you attached. They were helpful: pool
11 is problematic, it complains about degraded objects for no obvious
reason. I think that is the blocker.

I also noted that you mentioned peering problems, and I suspect that
they are not completely resolved. As a somewhat-irrational move, to
confirm this theory, you can restart osd.237 (it is mentioned at the
end of query.11.fff.txt, although I don't understand why it is there)
and then osd.298 (it is the primary for that pg) and see if any
additional backfills are unblocked after that. Also, please re-query
that PG again after the OSD restart.

On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard  
wrote:




On 23-03-2024 21:19, Alexander E. Patrakov wrote:

Hi Torkil,


Hi Alexander


I have looked at the CRUSH rules, and the equivalent rules work on my
test cluster. So this cannot be the cause of the blockage.


Thank you for taking the time =)

What happens if you increase the osd_max_backfills setting 
temporarily?


We already had the mclock override option in place and I re-enabled 
our

babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
on how full they are. Active backfills went from 16 to 53 which is
probably because default osd_max_backfills for mclock is 1.

I think 53 is still a low number of active backfills given the large
percentage misplaced.


It may be a good idea to investigate a few of the stalled PGs. Please
run commands similar to this one:

ceph pg 37.0 query > query.37.0.txt
ceph pg 37.1 query > query.37.1.txt
...
and the same for the other affected pools.


A few samples attached.


Still, I must say that some of your rules are actually unsafe.

The 4+2 rule as used by rbd_ec_data will not survive a
datacenter-offline incident. Namely, for each PG, it chooses OSDs 
from

two hosts in each datacenter, so 6 OSDs total. When a datacenter is
offline, you will, therefore, have only 4 OSDs up, which is exactly
the number of data chunks. However, the pool requires min_size 5, so
all PGs will be inactive (to prevent data corruption) and will stay
inactive until the datacenter comes up again. However, please don't
set min_size to 4 - then, any additional incident (like a defective
disk) will lead to data loss, and the shards in the datacenter which
went offline would be useless because they do not correspond to the
updated shards written by the clients.


Thanks for the explanation. This is an 

[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-25 Thread Alexander E. Patrakov
Hi Denis,

As the vast majority of OSDs have bluestore_min_alloc_size = 65536, I
think you can safely ignore https://tracker.ceph.com/issues/64715. The
only consequence will be that 58 OSDs will be less full than others.
In other words, please use either the hybrid approach or the built-in
balancer right away.

As for migrating to the modern defaults for bluestore_min_alloc_size,
yes, recreating OSDs host-by-host (once you have the cluster balanced)
is the only way. You can keep using the built-in balancer while doing
that.

On Mon, Mar 25, 2024 at 5:04 PM Denis Polom  wrote:
>
> Hi Alexander,
>
> that sounds pretty promising to me.
>
> I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536.
>
> You mentioned: "You will have to do that weekly until you redeploy all
> OSDs that were created with 64K bluestore_min_alloc_size"
>
> Is it the only way to approach this, that each OSD has to be recreated?
>
> Thank you for reply
>
> dp
>
> On 3/24/24 12:44 PM, Alexander E. Patrakov wrote:
> > Hi Denis,
> >
> > My approach would be:
> >
> > 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
> > bluestore_min_alloc_size. If so, you cannot really use the built-in
> > balancer, as it would result in a bimodal distribution instead of a
> > proper balance, see https://tracker.ceph.com/issues/64715, but let's
> > ignore this little issue if you have enough free space.
> > 2. Change the weights as appropriate. Make absolutely sure that there
> > are no reweights other than 1.0. Delete all dead or destroyed OSDs
> > from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
> > warnings that appear, they will be gone during the next step.
> > 3. Run this little script from Cern to stop the data movement that was
> > just initiated:
> > https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
> > pipe its output to bash. This should cancel most of the data movement,
> > but not all - the script cannot stop the situation when two OSDs want
> > to exchange their erasure-coded shards, like this: [1,2,3,4] ->
> > [1,3,2,4].
> > 4. Set the "target max misplaced ratio" option for MGR to what you
> > think is appropriate. The default is 0.05, and this means that the
> > balancer will enable at most 5% of the PGs to participate in the data
> > movement. I suggest starting with 0.01 and increasing if there is no
> > visible impact of the balancing on the client traffic.
> > 5. Enable the balancer.
> >
> > If you think that https://tracker.ceph.com/issues/64715 is a problem
> > that would prevent you from using the built-in balancer:
> >
> > 4. Download this script:
> > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
> > 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
> > device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash
> >
> > This will move at most 500 PGs to better places, starting with the
> > fullest OSDs. All weights are ignored, and the switches take care of
> > the bluestore_min_alloc_size overhead mismatch. You will have to do
> > that weekly until you redeploy all OSDs that were created with 64K
> > bluestore_min_alloc_size.
> >
> > A hybrid approach (initial round of balancing with TheJJ, then switch
> > to the built-in balancer) may also be viable.
> >
> > On Sun, Mar 24, 2024 at 7:09 PM Denis Polom  wrote:
> >> Hi guys,
> >>
> >> recently I took over a care of Ceph cluster that is extremely
> >> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
> >> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.
> >>
> >> Crush failure domain is datacenter (there are 3), data pool is EC 3+3.
> >>
> >> This cluster had and has balancer disabled for years. And was "balanced"
> >> manually by changing OSDs crush weights. So now it is complete mess and
> >> I would like to change it to have OSDs crush weight same (3.63898)  and
> >> to enable balancer with upmap.
> >>
> >>   From `ceph osd df ` sorted from the least used to most used OSDs:
> >>
> >> IDCLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
> >> AVAIL %USE   VAR   PGS  STATUS
> >> MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
> >>TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
> >> TiB   1.5 PiB  71.50
> >>428hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB1 KiB  5.6
> >> GiB   1.7 TiB  54.55  0.76   96  up
> >>223hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB3 KiB  5.6
> >> GiB   1.7 TiB  54.58  0.76   95  up
> >> ...
> >>
> >> ...
> >>
> >> ...
> >>
> >>591hdd  3.53999   1.0  3.6 TiB  3.0 TiB  3.0 TiB1 KiB  7.0
> >> GiB   680 GiB  81.74  1.14  125  up
> >>832hdd  3.5   1.0  3.6 TiB  3.0 TiB  3.0 TiB4 KiB  6.9
> >> GiB   680 GiB  81.75  1.14  114  up
> >>248hdd  3.63898   1.0  3.6 TiB  3.0 TiB  3.0 TiB3 KiB  7.2
> >> GiB   646 GiB  82.67  1.16  121  

[ceph-users] Re: ceph cluster extremely unbalanced

2024-03-25 Thread Denis Polom

Hi Alexander,

that sounds pretty promising to me.

I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536.

You mentioned: "You will have to do that weekly until you redeploy all 
OSDs that were created with 64K bluestore_min_alloc_size"


Is it the only way to approach this, that each OSD has to be recreated?

Thank you for reply

dp

On 3/24/24 12:44 PM, Alexander E. Patrakov wrote:

Hi Denis,

My approach would be:

1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K
bluestore_min_alloc_size. If so, you cannot really use the built-in
balancer, as it would result in a bimodal distribution instead of a
proper balance, see https://tracker.ceph.com/issues/64715, but let's
ignore this little issue if you have enough free space.
2. Change the weights as appropriate. Make absolutely sure that there
are no reweights other than 1.0. Delete all dead or destroyed OSDs
from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL
warnings that appear, they will be gone during the next step.
3. Run this little script from Cern to stop the data movement that was
just initiated:
https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py,
pipe its output to bash. This should cancel most of the data movement,
but not all - the script cannot stop the situation when two OSDs want
to exchange their erasure-coded shards, like this: [1,2,3,4] ->
[1,3,2,4].
4. Set the "target max misplaced ratio" option for MGR to what you
think is appropriate. The default is 0.05, and this means that the
balancer will enable at most 5% of the PGs to participate in the data
movement. I suggest starting with 0.01 and increasing if there is no
visible impact of the balancing on the client traffic.
5. Enable the balancer.

If you think that https://tracker.ceph.com/issues/64715 is a problem
that would prevent you from using the built-in balancer:

4. Download this script:
https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
5. Run it as follows: ./placementoptimizer.py -v balance --osdsize
device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash

This will move at most 500 PGs to better places, starting with the
fullest OSDs. All weights are ignored, and the switches take care of
the bluestore_min_alloc_size overhead mismatch. You will have to do
that weekly until you redeploy all OSDs that were created with 64K
bluestore_min_alloc_size.

A hybrid approach (initial round of balancing with TheJJ, then switch
to the built-in balancer) may also be viable.

On Sun, Mar 24, 2024 at 7:09 PM Denis Polom  wrote:

Hi guys,

recently I took over a care of Ceph cluster that is extremely
unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus ->
Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it.

Crush failure domain is datacenter (there are 3), data pool is EC 3+3.

This cluster had and has balancer disabled for years. And was "balanced"
manually by changing OSDs crush weights. So now it is complete mess and
I would like to change it to have OSDs crush weight same (3.63898)  and
to enable balancer with upmap.

  From `ceph osd df ` sorted from the least used to most used OSDs:

IDCLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
AVAIL %USE   VAR   PGS  STATUS
MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97
   TOTAL  5.1 PiB  3.7 PiB  3.7 PiB  2.9 MiB  8.5
TiB   1.5 PiB  71.50
   428hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB1 KiB  5.6
GiB   1.7 TiB  54.55  0.76   96  up
   223hdd  3.63898   1.0  3.6 TiB  2.0 TiB  2.0 TiB3 KiB  5.6
GiB   1.7 TiB  54.58  0.76   95  up
...

...

...

   591hdd  3.53999   1.0  3.6 TiB  3.0 TiB  3.0 TiB1 KiB  7.0
GiB   680 GiB  81.74  1.14  125  up
   832hdd  3.5   1.0  3.6 TiB  3.0 TiB  3.0 TiB4 KiB  6.9
GiB   680 GiB  81.75  1.14  114  up
   248hdd  3.63898   1.0  3.6 TiB  3.0 TiB  3.0 TiB3 KiB  7.2
GiB   646 GiB  82.67  1.16  121  up
   559hdd  3.63799   1.0  3.6 TiB  3.0 TiB  3.0 TiB  0 B  7.0
GiB   644 GiB  82.70  1.16  123  up
   TOTAL  5.1 PiB  3.7 PiB  3.6 PiB  2.9 MiB  8.5
TiB   1.5 PiB  71.50
MIN/MAX VAR: 0.76/1.16  STDDEV: 5.97


crush rule:

{
  "rule_id": 10,
  "rule_name": "ec33hdd_rule",
  "type": 3,
  "steps": [
  {
  "op": "set_chooseleaf_tries",
  "num": 5
  },
  {
  "op": "set_choose_tries",
  "num": 100
  },
  {
  "op": "take",
  "item": -2,
  "item_name": "default~hdd"
  },
  {
  "op": "choose_indep",
  "num": 3,
  "type": "datacenter"
  },
  {
  "op": "choose_indep",
  "num": 2,
  "type": "osd"
  },
  {
  "op": "emit"
  }
  ]

[ceph-users] Re: Call for Interest: Managed SMB Protocol Support

2024-03-25 Thread Robert Sander

Hi,

On 3/22/24 19:56, Alexander E. Patrakov wrote:


In fact, I am quite skeptical, because, at least in my experience,
every customer's SAMBA configuration as a domain member is a unique
snowflake, and cephadm would need an ability to specify arbitrary UID
mapping configuration to match what the customer uses elsewhere - and
the match must be precise.


Yes, there has to be a great flexibility possible in the configuration 
of the SMB service.


BTW: It would be great of the orchestrator could configure Ganesha to 
export NFs shares with Kerberos security, but this is off-topic in this 
thread.



Oh, and by the way, we have this strangely
low-numbered group that everybody gets wrong unless they set "idmap
config CORP : range = 500-99".


This is because Debian changed the standard minimum uid/gid somewhere in 
the 2000s. And if you have an "old" company running Debian since before 
then you have user IDs and group IDs in the range 500 - 1000.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 220009 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io