[ceph-users] Re: Encryption per user Howto

2023-06-06 Thread Stefan Kooman

On 6/6/23 15:33, Frank Schilder wrote:

Yes, would be interesting. I understood that it mainly helps with buffered 
writes, but ceph is using direct IO for writes and that's where bypassing the 
queues helps.


Yeah, that makes sense.



Are there detailed instructions somewhere how to set up a host to disable the 
queues? I don't have time to figure this out myself. It should be detailed 
enough so that I just need to edit some configs, reboot et voila.


Do you want instructions for package based Ceph install, or container 
based? I tested it with both deployment types. Container based 
(Cephadm)) is a little bit more involved, but certainly doable.




I have a number of new hosts to deploy and I could use one of these to run a 
test. They have a mix of NVMe, SSD and HDD and I can run fio benchmarks before 
deploying OSDs in the way you did


That would be great,

Gr. Stefan
.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about xattr and subvolumes

2023-06-06 Thread Kotresh Hiremath Ravishankar
On Tue, Jun 6, 2023 at 4:30 PM Dario Graña  wrote:

> Hi,
>
> I'm installing a new instance (my first) of Ceph. Our cluster runs
> AlmaLinux9 + Quincy. Now I'm dealing with CephFS and quotas. I read
> documentation about setting up quotas with virtual attributes (xattr) and
> creating volumes and subvolumes with a prefixed size. I cannot distinguish
> which is the best option for us.
>

Creating a volume would create a fs and subvolumes are essentially
directories inside the fs which are managed through
mgr subvolume APIs. The subvolumes are introduced for openstack and
openshift use case which expect these subvolumes
to be programmatically managed via APIs.

Answering the quota question, in cephfs, quota is set using the virtual
xattr. The subvolume creation with size essentially
uses the same virtual xattr interface to set the quota size.


> Currently we create a directory with a project name and some subdirectories
> inside.
>

You can explore subvolumegroup and subvolume mgr APIs if it fits your use
case. Please note that it's mainly designed for
openstack/openshift kind of use cases where each subvolume is per PVC and
the data distinction is maintained e.g., there won't
be hardlinks created across the subvolumes.


> I would like to understand the difference between both options.
>
> Thanks in advance.
>
> --
> Dario Graña
> PIC (Port d'Informació Científica)
> Campus UAB, Edificio D
> E-08193 Bellaterra, Barcelona
> http://www.pic.es
> Avis - Aviso - Legal Notice: http://legal.ifae.es
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Workload Separation in Ceph RGW Cluster - Recommended or Not?

2023-06-06 Thread Ramin Najjarbashi
Thank you for your response and for raising an important question regarding
the potential bottlenecks within the RGW or the overall Ceph cluster. I
appreciate your insight and would like to provide more information about
the issues I have been experiencing. In my deployment, RGW instances 17-20
have been encountering problems such as hanging or returning errors,
including "failed to read header: The socket was closed due to a timeout"
and "res_query() failed." These issues have led to disruptions and
congestions within the cluster. The index pool is indeed placed on a large
number of NVMe SSDs to ensure fast access and efficient indexing of data.
The number of Placement Groups (PGs) allocated for the index pool is also
configured to be sufficient for the workload

On Tue, Jun 6, 2023 at 21:27 Anthony D'Atri  wrote:

> Do you have reason to believe that your bottlenecks are within RGW not
> within the cluster?
>
> e.g. is your index pool on a large number of NVMe SSDs with sufficient
> PGs? Is your bucket data on SSD as well?
>
>
> On Jun 6, 2023, at 13:52, Ramin Najjarbashi 
> wrote:
>
> I would like to seek your insights and recommendations regarding the
> practice of workload separation in a Ceph RGW (RADOS Gateway) cluster. I
> have been facing challenges with large queues in my deployment and would
> appreciate your expertise in determining whether workload separation is a
> recommended approach or not.
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The pg_num from 1024 reduce to 32 spend much time, is there way to shorten the time?

2023-06-06 Thread Wesley Dillingham
Can you send along the responses from "ceph df detail" and ceph "ceph osd
pool ls detail"

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Tue, Jun 6, 2023 at 1:03 PM Eugen Block  wrote:

> I suspect the target_max_misplaced_ratio (default 0.05). You could try
> setting it to 1 and see if it helps. This has been discussed multiple
> times on this list, check out the archives for more details.
>
> Zitat von Louis Koo :
>
> > Thanks for your responses, I want to know why it spend much time to
> > reduce the pg num?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Workload Separation in Ceph RGW Cluster - Recommended or Not?

2023-06-06 Thread Ramin Najjarbashi
Hi

I would like to seek your insights and recommendations regarding the
practice of workload separation in a Ceph RGW (RADOS Gateway) cluster. I
have been facing challenges with large queues in my deployment and would
appreciate your expertise in determining whether workload separation is a
recommended approach or not.

In my current Ceph cluster, I have 20 RGW instances. Client requests are
directed to RGW1-16, while RGW17-20 are dedicated to administrative tasks
and backend usage. However, I have been encountering errors and congestion
issues due to the accumulation of large queues within the RGW instances.

Considering the above scenario, I would like to inquire about your opinions
on workload separation as a potential solution. Specifically, I am
interested in knowing whether workload separation is recommended in a Ceph
RGW cluster.

To address the queue congestion and improve performance, my proposed
solution includes separating the RGW instances based on their specific
purposes. This entails allocating dedicated instances for client requests,
backend usage, administrative tasks, metadata synchronization with other
zone groups, garbage collection (GC), and lifecycle (LC) operations.

I kindly request your feedback and insights on the following points:

1. Is workload separation considered a recommended practice in Ceph RGW
deployments?
2. What are the potential benefits and drawbacks of workload separation in
terms of performance, resource utilization, and manageability?
3. Are there any specific considerations or best practices to keep in mind
while implementing workload separation in a Ceph RGW cluster?
4. Can you share your experiences or any references/documentation that
highlight successful implementations of workload separation in Ceph RGW
deployments?

I truly value your expertise and appreciate your time and effort in
providing guidance on this matter. Your insights will contribute
significantly to optimizing the performance and stability of my Ceph RGW
cluster.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: The pg_num from 1024 reduce to 32 spend much time, is there way to shorten the time?

2023-06-06 Thread Eugen Block
I suspect the target_max_misplaced_ratio (default 0.05). You could try  
setting it to 1 and see if it helps. This has been discussed multiple  
times on this list, check out the archives for more details.


Zitat von Louis Koo :

Thanks for your responses, I want to know why it spend much time to  
reduce the pg num?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy release -Swift integration with Keystone

2023-06-06 Thread Eugen Block

Hi,
it's not really useful to create multiple threads for the same  
question. I wrote up some examples [1] which worked for me to  
integrate keystone and radosgw.


From the debug logs below, it appears that radosgw is still trying  
to authenticate with Swift instead of Keystone.

Any pointers will be appreciated.


Do you mean because you see the "swift" string? That should be just  
the keystone endpoint url for your service, I wouldn't expect that to  
be the issue here. At least not from a first glance.

Hope this helps.
Eugen

[1]  
https://serverfault.com/questions/1118004/cephadm-openstack-keystone-integration



Zitat von fs...@yahoo.com:


Hi folks,

My ceph cluster with Quincy and Rocky9 is up and running.
But I'm having issues with swift authenticating with keystone.
Was wondering if I'm missed anything in the configuration.
From the debug logs below, it appears that radosgw is still trying  
to authenticate with Swift instead of Keystone.

Any pointers will be appreciated.

thanks,

Here is my configuration.

# ceph config dump | grep rgw
client
   advanced  debug_rgw  20/20
client
   advanced  rgw_keystone_accepted_rolesadmin,user
  
 *
client
   advanced  rgw_keystone_admin_domain  Default   
  
 *
client
   advanced  rgw_keystone_admin_password  
  
 *
client
   advanced  rgw_keystone_admin_project service   
  
 *
client
   advanced  rgw_keystone_admin_userceph-ks-svc   
  
 *
client
   advanced  rgw_keystone_api_version   3
client
   advanced  rgw_keystone_implicit_tenants  false 
  
 *
client
   advanced  rgw_keystone_token_cache_size  0
client
   basic rgw_keystone_url   
   *
client
   advanced  rgw_s3_auth_use_keystone   true
client
   advanced  rgw_swift_account_in_url   true
client
   basic rgw_thread_pool_size   512
client.rgw.s_rgw.dev-ipp1-u1-control01.ojmddc 
   basic rgw_frontends  beast port=7480   
  
 *
client.rgw.s_rgw.dev-ipp1-u1-control02.adnjrx 
   basic rgw_frontends  beast port=7480



Here's the debug log.
If I interpret it correctly, it is trying to do a swift  
authentication and failing.

Am I missing any configuration for Keystone based authentication ?

Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: beast:  
0x7fddeb8e7710: 10.117.53.10 - - [03/Jun/2023:18:47:03.060 +]  
"GET /swift/v1/AUTH_c668ed224e434c88a9e0fce125056112?format=json  
HTTP/1.1" 401 119 - "openstacksdk/0.52.0 keystoneauth1/4.0.0  
python-requests/2.22.0 CPython/3.8.10" - latency=0.0s

Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_ACCEPT=*/*
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]:  
HTTP_ACCEPT_ENCODING=gzip, deflate

Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_CONNECTION=close
Jun 03 11:47:03 dev-ipp1-u1-control02.radosgw[2802861]:  
HTTP_HOST=dev-ipp1-u1-object-store
Jun 03 11:47:03 dev-ipp1-u1-control02radosgw[2802861]:  
HTTP_USER_AGENT=openstacksdk/0.52.0 keystoneauth1/4.0.0  
python-requests/2.22.0 CPython/3.8.10

Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_VERSION=1.1
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]:  
HTTP_X_AUTH_TOKEN=gABke4qn779UQ_XMz0EDL3P3TgjBQsGG6p-MNhviJxLZTuMTnTDmpT5Yfi9UpgO_T3LOOsPjQAw6zoMUIaC22wPeryp5x-UumB3XwXOWp-qSXLbuN3b9oj_Qg5kCZWA0waWNRHzQ1mwtlEmmpTgvTXbU5V1ym6hEBOn6Q3RWhn34Hj3c

[ceph-users] Re: The pg_num from 1024 reduce to 32 spend much time, is there way to shorten the time?

2023-06-06 Thread Louis Koo
Thanks for your responses, I want to know why it spend much time to reduce the 
pg num?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RADOSGW not authenticating with Keystone. Quincy release

2023-06-06 Thread fsbiz
Hi folks,

My ceph cluster with Quincy and Rocky9 is up and running.
But I'm having issues with RADOSGW authenticating with keystone.
Was wondering if I'm missed anything in the configuration.
>From the debug logs below, it appears that radosgw is still trying to 
>authenticate with
Swift instead of Keystone.
Any pointers will be appreciated.  

thanks,

Here is my configuration.

# ceph config dump | grep rgw
client  
advanced 
debug_rgw  20/20
  
 
client  
advanced 
rgw_keystone_accepted_rolesadmin,user   
  
   *
client  
advanced 
rgw_keystone_admin_domain  Default  
  
   *
client  
advanced 
rgw_keystone_admin_password 

 *
client  
advanced 
rgw_keystone_admin_project service  
  
   *
client  
advanced 
rgw_keystone_admin_userceph-ks-svc  
  
   *
client  
advanced 
rgw_keystone_api_version   3
  
 
client  
advanced 
rgw_keystone_implicit_tenants  false
  
   *
client  
advanced 
rgw_keystone_token_cache_size  0
  
 
client  basic   
 
rgw_keystone_url  

  *
client  
advanced 
rgw_s3_auth_use_keystone   true 
  
 
client  
advanced 
rgw_swift_account_in_url   true 
  
 
client  basic   
 
rgw_thread_pool_size   512
client.rgw.s_rgw.dev-ipp1-u1-control01.ojmddc   basic   
 
rgw_frontends  beast port=7480  
  
   *
client.rgw.s_rgw.dev-ipp1-u1-control02.adnjrx   basic   
 
rgw_frontends  beast port=7480


Here's the debug log.  
If I interpret it correctly, it is trying to do a swift authentication and 
failing.
Am I missing any configuration for Keystone based authentication ?

Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: beast: 0x7fddeb8e7710:
10.117.53.10 - - [03/Jun/2023:18:47:03.060 +] "GET
/swift/v1/AUTH_c668ed224e434c88a9e0fce125056112?format=json HTTP/1.1" 401 119 -
"openstacksdk/0.52.0 keystoneauth1/4.0.0 python-requests/2.22.0 CPython/3.8.10"
- latency=0.0s
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_ACCEPT=*/*
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: 
HTTP_ACCEPT_ENCODING=gzip,
deflate
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_CONNECTION=close
Jun 03 11:47:03 dev-ipp1-u1-control02.radosgw[2802861]:
HTTP_HOST=dev-ipp1-u1-object-store
Jun 03 11:47:03 dev-ipp1-u1-control02radosgw[2802861]: 
HTTP_USER_AGENT=openstacksdk/0.52.0
keystoneauth1/4.0.0 python-requests/2.22.0 CPython/3.8.10
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: HTTP_VERSION=1.1
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]:
HTTP_X_AUTH_TOKEN=gABke4qn779UQ_XMz0EDL3P3TgjBQsGG6p-MNhviJxLZTuMTnTDmpT5Yfi9UpgO_T3LOOsPjQAw6zoMUIaC22wPeryp5x-UumB3XwXOWp-qSXLbuN3b9oj_Qg5kCZWA0waWNRHzQ1mwtlEmmpTgvTXbU5V1ym6hEBOn6Q3RWhn34Hj3cF9o
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: 
HTTP_X_FORWARDED_FOR=10.117.148.3
Jun 03 11:47:03 dev-ipp1-u1-control02 radosgw[2802861]: QUERY_STRING=format=json
Jun 03 11:47:03 dev-ipp1-u1-control02.radosgw[2802861]: REMOTE_ADDR=10.117.53.10
Jun 03 11:47:03 dev-ipp1-u1-control02.radosgw[2802861]: REQUEST_METHOD=GET
Jun 03 11:47:03 dev-ipp1-u1-con

[ceph-users] RADOSGW integration with Keystone not working in Quincy release ??

2023-06-06 Thread fs...@yahoo.com
I have a ceph cluster installed using cephadm.
The cluster is up and running but I'm unable to get Keystone integration 
working with RADOSGW.
Is this a known issue?
thanks,Fred.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-06-06 Thread Frank Schilder
Yes, would be interesting. I understood that it mainly helps with buffered 
writes, but ceph is using direct IO for writes and that's where bypassing the 
queues helps.

Are there detailed instructions somewhere how to set up a host to disable the 
queues? I don't have time to figure this out myself. It should be detailed 
enough so that I just need to edit some configs, reboot et voila.

I have a number of new hosts to deploy and I could use one of these to run a 
test. They have a mix of NVMe, SSD and HDD and I can run fio benchmarks before 
deploying OSDs in the way you did.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: Tuesday, June 6, 2023 3:20 PM
To: Frank Schilder; Anthony D'Atri; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Encryption per user Howto

On 6/6/23 14:26, Frank Schilder wrote:
> Hi Stefan,
>
> there are still users with large HDD installations and I think this will not 
> change anytime soon. What is the impact of encryption with the new settings 
> for HDD? Is it as bad as their continued omission from any statement suggests?

We only tested flash, as we don't have any spinners to test on. So not
sure what the impact would be. From the blog post I got that the queuing
was there to mainly help workloads with spinning disks. But all that was
valid in 2015. It specifically mentions benefit for CFQ scheduler, that
since has been removed from the linux kernel.

So, it would be interesting to know if it is still needed on spinners
when used with Ceph ...

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-06-06 Thread Stefan Kooman

On 6/6/23 14:26, Frank Schilder wrote:

Hi Stefan,

there are still users with large HDD installations and I think this will not 
change anytime soon. What is the impact of encryption with the new settings for 
HDD? Is it as bad as their continued omission from any statement suggests?


We only tested flash, as we don't have any spinners to test on. So not 
sure what the impact would be. From the blog post I got that the queuing 
was there to mainly help workloads with spinning disks. But all that was 
valid in 2015. It specifically mentions benefit for CFQ scheduler, that 
since has been removed from the linux kernel.


So, it would be interesting to know if it is still needed on spinners 
when used with Ceph ...


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Encryption per user Howto

2023-06-06 Thread Frank Schilder
Hi Stefan,

there are still users with large HDD installations and I think this will not 
change anytime soon. What is the impact of encryption with the new settings for 
HDD? Is it as bad as their continued omission from any statement suggests?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Kooman 
Sent: Friday, June 2, 2023 5:11 PM
To: Anthony D'Atri; ceph-users@ceph.io
Subject: [ceph-users] Re: Encryption per user Howto

On 6/2/23 16:33, Anthony D'Atri wrote:
> Stefan, how do you have this implemented? Earlier this year I submitted
> https://tracker.ceph.com/issues/58569
>  asking to enable just this.

Lol, I have never seen that tracker otherwise I would have informed you
about it. I see the PR and tracker are updated by you / Joshua, thanks
for that..

So yes, we have this implemented and running in production (currently
re-provisioning all OSDs). It's a locally patched 16.2.11 ceph-volume
for that matter. The PR [1] needs some fixing (I need to sit down and
make it happen, just so many other things that take up my time). But
then this would be enabled by default for flash devices
(non-rotational). If used with cryptsetup 2.4.x also the appropriate
sector size is used (based on the physical sector size). We use 4K on NVMe.

Added benefit of using cryptsetup 2.4.x is that is uses Argon2id as
PBKDF for LUKS2.

We created a backport of cryptsetup 2.4.3 for use in Ubuntu Focal (based
on Jammy) [2].

We are converting our whole cluster using LUKS2 with the work queues
bypassed. For the nodes that have been converted already it works just
fine. So, as multiple users seem to be waiting for this to be available
in Ceph ... I should hurry up and make sure the PR gets in proper shape
and merged in main.

Gr. Stefan

[1]: https://github.com/ceph/ceph/pull/49554
[2]: https://obit.bit.nl/ubuntu/focal/cryptsetup/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about xattr and subvolumes

2023-06-06 Thread Dario Graña
Hi,

I'm installing a new instance (my first) of Ceph. Our cluster runs
AlmaLinux9 + Quincy. Now I'm dealing with CephFS and quotas. I read
documentation about setting up quotas with virtual attributes (xattr) and
creating volumes and subvolumes with a prefixed size. I cannot distinguish
which is the best option for us.
Currently we create a directory with a project name and some subdirectories
inside.
I would like to understand the difference between both options.

Thanks in advance.

-- 
Dario Graña
PIC (Port d'Informació Científica)
Campus UAB, Edificio D
E-08193 Bellaterra, Barcelona
http://www.pic.es
Avis - Aviso - Legal Notice: http://legal.ifae.es
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-06 Thread Janek Bevendorff
I guess the mailing list didn't preserve the embedded image. Here's an 
Imgur link: https://imgur.com/a/WSmAOaG


I checked the logs as far back as we have them. The issue started 
appearing only after my last Ceph upgrade on 2 May, which introduced the 
new corruption assertion.



On 06/06/2023 09:16, Janek Bevendorff wrote:
I checked our Prometheus logs and the number of log events of 
individual MONs are indeed randomly starting to increase dramatically 
all of a sudden. I attached a picture of the curves.


The first incidence you see there was when our metadata store filled 
up entirely. The second, smaller one was the more controlled fill-up. 
The last instance with only one runaway MDS is what I have just reported.


My unqualified wild guess is that the new safeguard to prevent the MDS 
from committing corrupt dentries is holding up the queue, so all of a 
sudden events are starting to pile up until the store is full.





On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB 
to 5.4GB. But we also have a few remapped PGs right now. Not sure if 
that would have an influence.



On 05/06/2023 17:48, Janek Bevendorff wrote:

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB 
RAM each (just to make sure it's not that) and then let them do 
their thing. The routine goes as follows: both replay the journal, 
then rank 4 goes into the "resolve" state, but as soon as rank 3 
also starts resolving, they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 
3 using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they 
are not super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB 
back to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 
and 4 always have huge journals. An inspection using 
ceph-journal-tool takes forever and consumes 50GB of memory in the 
process. Listing the events in the journal is impossible without 
running out of RAM. Ranks 0, 1, and 2 don't have this problem and 
this wasn't a problem for ranks 3 and 4 either before the fill-ups 
started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting 
slow metadata IO and the pool was slowly growing. Hence I restarted 
the MDS and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea 
what the reason could be? I checked whether they have enough RAM, 
which seems to be the case (unless they try to allocate tens of GB 
in one allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re 
using ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds 
metadata

pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a 
fix for

the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:
I checked our logs from yesterd

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-06 Thread Janek Bevendorff
I checked our Prometheus logs and the number of log events of individual 
MONs are indeed randomly starting to increase dramatically all of a 
sudden. I attached a picture of the curves.


The first incidence you see there was when our metadata store filled up 
entirely. The second, smaller one was the more controlled fill-up. The 
last instance with only one runaway MDS is what I have just reported.


My unqualified wild guess is that the new safeguard to prevent the MDS 
from committing corrupt dentries is holding up the queue, so all of a 
sudden events are starting to pile up until the store is full.





On 05/06/2023 18:03, Janek Bevendorff wrote:
That said, our MON store size has also been growing slowly from 900MB 
to 5.4GB. But we also have a few remapped PGs right now. Not sure if 
that would have an influence.



On 05/06/2023 17:48, Janek Bevendorff wrote:

Hi Patrick, hi Dan!

I got the MDS back and I think the issue is connected to the "newly 
corrupt dentry" bug [1]. Even though I couldn't see any particular 
reason for the SIGABRT at first, I then noticed one of these awfully 
familiar stack traces.


I rescheduled the two broken MDS ranks on two machines with 1.5TB RAM 
each (just to make sure it's not that) and then let them do their 
thing. The routine goes as follows: both replay the journal, then 
rank 4 goes into the "resolve" state, but as soon as rank 3 also 
starts resolving, they both crash.


Then I set

ceph config mds mds_abort_on_newly_corrupt_dentry false
ceph config mds mds_go_bad_corrupt_dentry false

and this time I was able to recover the ranks, even though "resolve" 
and "clientreplay" took forever. I uploaded a compressed log of rank 
3 using ceph-post-file [2]. It's a log of several crash cycles, 
including the final successful attempt after changing the settings. 
The log decompresses to 815MB. I didn't censor any paths and they are 
not super-secret, but please don't share.


While writing this, the metadata pool size has reduced from 6TiB back 
to 440GiB. I am starting to think that the fill-ups may also be 
connected to the corruption issue. I also noticed that the ranks 3 
and 4 always have huge journals. An inspection using 
ceph-journal-tool takes forever and consumes 50GB of memory in the 
process. Listing the events in the journal is impossible without 
running out of RAM. Ranks 0, 1, and 2 don't have this problem and 
this wasn't a problem for ranks 3 and 4 either before the fill-ups 
started happening.


Hope that helps getting to the bottom of this. I reset the guardrail 
settings in the meantime.


Cheers
Janek


[1] "Newly corrupt dentry" ML link: 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/JNZ6V5WSYKQTNQPQPLWRBM2GEP2YSCRV/#PKQVZYWZCH7P76Q75D5WD5JEAVWOKJE3


[2] ceph-post-file ID: 7c039483-49fd-468c-ba40-fb10337aa7d6



On 05/06/2023 16:08, Janek Bevendorff wrote:
I just had the problem again that MDS were constantly reporting slow 
metadata IO and the pool was slowly growing. Hence I restarted the 
MDS and now ranks 4 and 5 don't come up again.


Every time, they get to the resolve stage, the crash with a SIGABRT 
without an error message (not even at debug_mds = 20). Any idea what 
the reason could be? I checked whether they have enough RAM, which 
seems to be the case (unless they try to allocate tens of GB in one 
allocation).


Janek


On 31/05/2023 21:57, Janek Bevendorff wrote:

Hi Dan,

Sorry, I meant Pacific. The version number was correct, the name 
wasn’t. ;-)


Yes, I have five active MDS and five hot standbys. Static pinning 
isn’t really an options for our directory structure, so we’re using 
ephemeral pins.


Janek


On 31. May 2023, at 18:44, Dan van der Ster 
 wrote:


Hi Janek,

A few questions and suggestions:
- Do you have multi-active MDS? In my experience back in nautilus if
something went wrong with mds export between mds's, the mds
log/journal could grow unbounded like you observed until that export
work was done. Static pinning could help if you are not using it
already.
- You definitely should disable the pg autoscaling on the mds 
metadata

pool (and other pools imho) -- decide the correct number of PGs for
your pools and leave it.
- Which version are you running? You said nautilus but wrote 16.2.12
which is pacific... If you're running nautilus v14 then I recommend
disabling pg autoscaling completely -- IIRC it does not have a fix 
for

the OSD memory growth "pg dup" issue which can occur during PG
splitting/merging.

Cheers, Dan

__
Clyso GmbH | https://www.clyso.com


On Wed, May 31, 2023 at 4:03 AM Janek Bevendorff
 wrote:
I checked our logs from yesterday, the PG scaling only started 
today,
perhaps triggered by the snapshot trimming. I disabled it, but it 
didn't

change anything.

What did change something was restarting the MDS one by one, 
which had
got far behind with trimming their caches and with a bunch of 
stuck ops.

After restarting them, the pool size decreased qu