[ceph-users] Re: Ceph GWCLI issue

Kardos László Wed, 01 Oct 2025 07:08:37 -0700

Hello,

is it possible to restart the rbd-target-api without restarting the entire 
container?


Ceph version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)

root@vxxx-sx-xx-iscsi0:~# docker ps
CONTAINER ID   IMAGE                                     COMMAND 
CREATED        STATUS        PORTS     NAMES
949dabe059eb   quay.io/ceph/ceph 
"/usr/bin/rbd-target…"   2 months ago   Up 2 months 
ceph-c404fafe-767c-11ee-bc37-0509d00921ba-iscsi-sx-xx-vxxxgw-pool0-vxxx-sx-xx-iscsi0-kwjqrn
43a298fe835b   quay.io/ceph/ceph 
"/usr/bin/tcmu-runner"   2 months ago   Up 2 months 
ceph-c404fafe-767c-11ee-bc37-0509d00921ba-iscsi-sx-xx-vxxxgw-pool0-vxxx-sx-xx-iscsi0-kwjqrn-tcmu
1eef0e6084a2   quay.io/prometheus/node-exporter:v1.5.0 
  "/bin/node_exporter …"   6 months ago   Up 6 months 
ceph-c404fafe-767c-11ee-bc37-0509d00921ba-node-exporter-vxxx-sx-xx-iscsi0
4d94c24cebec   quay.io/ceph/ceph 
"/usr/bin/ceph-crash…"   6 months ago   Up 6 months 
ceph-c404fafe-767c-11ee-bc37-0509d00921ba-crash-vxxx-sx-xx-iscsi0


root@vxxx-sx-xx-iscsi0:~# docker exec -it 949dabe059eb /bin/bash
[root@vxxx-sx-xx-iscsi0/]# ps auxf

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      134572  0.2  0.0  14152  3284 pts/4    Ss   13:42   0:00 /bin/bash
root      134590  0.0  0.0  46800  3460 pts/4    R+   13:42   0:00  \_ ps 
auxf
root           1  0.0  0.0   1020   676 ?        Ss   Jul14   6:09 
/sbin/docker-init -- /usr/bin/rbd-target-api
root           8  0.1  0.3 3785056 253040 ?      Sl   Jul14 152:47 
/usr/bin/python3.6 -s /usr/bin/rbd-target-api



Best Regards,
Laszlo Kardos

-----Original Message-----
From: Anthony D'Atri <a...@dreamsnake.net>
Sent: Tuesday, September 30, 2025 6:05 PM
To: Laszlo Budai <las...@componentsoft.eu>
Cc: Kardos László <laszlo.kar...@acetelecom.hu>; ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph GWCLI issue



> The PG numbers are still very low in my opinion. you have 42 OSDs and only 
> 614 PGs that makes roughly 15 PG / OSD. That's quite far from the rule of 
> thumb of 100 PG/OSD.

I've been trying to clear up this nuance in the docs when I can.  The PGS 
field in `ceph osd df` and the target value (aka "PG ratio") are for PG 
replicas, not PGs, so one has to factor in replication.  For EC, the 
replication factor is k+m.

For a cluster with one pool:

pg_num = (#OSDs * ratio) / replication

ratio = (pg_num * replication) / #OSDs

Round to the nearest power of 2, if in doubt round up.

When you have multiple pools it gets more complicated.  One can use the 
pgcalc, or leverage the PG autoscaler.  That said the default target of 100 
is way too low, especially since it's a max not a target as such.

global     advanced  mon_max_pg_per_osd                         600
global     advanced  mon_target_pg_per_osd                      300





> But maybe your problem is located in a different place. You may want to 
> check whether all your `rbd-target-api` services are up and running. gwcli 
> relies on them.
>
> Kind regards,
> Laszlo Budai
>
>
> On 9/30/25 10:31, Kardos László wrote:
>> Hello,
>> I apologize for sending the wrong pool details earlier.
>> We store the data in the following data pool:  xxxx0-data
>>
>> pool 15 'xxxx0-data' erasure profile laurel_ec size 4 min_size 3
>> crush_rule

If this is a 3+1 pool, note that a value of m=1 is ... fraught If this is a 
2+2 pool, note that with current releases, EC for RBD is usually a 
significant latency liability.  Tentacle's fast EC improves that dynamic.


>> 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off
>> last_change
>> 30830 lfor 0/0/30825 flags hashpspool,ec_overwrites,selfmanaged_snaps
>> stripe_width 12288 application rbd,rgw
>>
>> cluster:
>>     id:     c404fafe-767c-11ee-bc37-0509d00921ba
>>     health: HEALTH_OK
>>
>>   services:
>>     mon:         5 daemons, quorum
>> v188-ceph-mgr0,v188-ceph-mgr1,v188-ceph-iscsigw2,v188-ceph6,v188-ceph
>> 5 (age
>> 5d)
>>     mgr:         v188-ceph-mgr0.rxcecw(active, since 11w), standbys:
>> v188-ceph-mgr1.hmbuma
>>     mds:         1/1 daemons up, 1 standby
>>     osd:         42 osds: 42 up (since 2M), 42 in (since 3M)
>>     tcmu-runner: 10 portals active (4 hosts)
>>
>>   data:
>>     volumes: 1/1 healthy
>>     pools:   11 pools, 614 pgs
>>     objects: 13.63M objects, 51 TiB
>>     usage:   75 TiB used, 71 TiB / 147 TiB avail
>>     pgs:     613 active+clean
>>              1   active+clean+scrubbing+deep
>>
>>   io:
>>     client:   8.1 MiB/s rd, 105 MiB/s wr, 320 op/s rd, 2.31k op/s wr
>>
>>
>> Best Regards,
>> Laszlo Kardos
>>
>> -----Original Message-----
>> From: Eugen Block<ebl...@nde.ag>
>> Sent: Tuesday, September 30, 2025 9:03 AM To:ceph-users@ceph.io
>> Subject: [ceph-users] Re: Ceph GWCLI issue
>>
>>
>> Hi,
>>
>> I don't have an answer why the image is in unknown state, but I'd be
>> concerned about the pool's pg_num. You have Terabytes in a pool with
>> a single PG? That's awful and should be increased to a more suitable
>> value. I can't say if that would fix anything regarding the unknown
>> issue, but that's definitely not good at all.
>>
>> What is the overall Ceph status (ceph -s)?
>>
>> Regards,
>> Eugen
>>
>>
>> Zitat von Kardos László<laszlo.kar...@acetelecom.hu>:
>>
>>> Hello,
>>>
>>> We have encountered the following issue in our production environment:
>>>
>>> A new RBD Image was created within an existing pool, and its status
>>> is reported as "unknown" in GWCLI. Based on our tests, this does not
>>> appear to cause operational issues, but we would like to investigate
>>> the root cause. No relevant information regarding this issue was
>>> found in the logs.
>>>
>>> GWCLI output:
>>>
>>>
>>>
>>> o- /
>>> ..........................................................................
>>> ............................................... [...]
>>>
>>>   o- cluster
>>> ..........................................................................
>>> ............................... [Clusters: 1]
>>>
>>>   | o- ceph
>>> ..........................................................................
>>> .................................. [HEALTH_OK]
>>>
>>>   |   o- pools
>>> ..........................................................................
>>> ............................... [Pools: 11]
>>>
>>>   |   | o- .mgr
>>> ................................................................
>>> [(x3),
>>> Commit: 0.00Y/15591725M (0%), Used: 194124K]
>>>
>>>  |   | o- .nfs
>>> .................................................................
>>> [(x3),
>>> Commit: 0.00Y/15591725M (0%), Used: 16924b]
>>>
>>>   |   | o- xxxx-test
>>> .............................................................
>>> [(2+1),
>>> Commit: 0.00Y/23727198M (0%), Used: 0.00Y]
>>>
>>>   |   | o- xxxxx-erasure-0 ............................................
>>> [(2+1), Commit: 0.00Y/23727198M (0%), Used: 61519257668K]
>>>
>>>   |   | o- xxxxxx-repl
>>> ...................................................... [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 130084b]
>>>
>>>   |   | o- cephfs.cephfs-test.data
>>> ............................................ [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 9090444K]
>>>
>>>   |   | o- cephfs.cephfs-test.meta
>>> .......................................... [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 516415713b]
>>>
>>>   |   | o- xxxxx-data
>>> ..................................................... [(3+1), Commit:
>>> 0.00Y/9604386M (0%), Used: 7547753556K]
>>>
>>>   |   | o- xxxxx-rpl
>>> .......................................................... [(x3), 
>>> Commit:
>>> 12.0T/4268616M (294%), Used: 85265b]
>>>
>>>   |   | o- xxxxx-data 
>>> ...................................................
>>> [(3+1), Commit: 0.00Y/5011626M (0%), Used: 10955179612K]
>>>
>>>   |   | o- replicated_xxxx 
>>> ...............................................
>>> [(x3), Commit: 25.0T/2280846592K (1176%), Used: 46912b]
>>>
>>>   |   o- topology
>>> ..........................................................................
>>> ..................... [OSDs: 42,MONs: 5]
>>>
>>>   o- disks
>>> ..........................................................................
>>> ............................. [37.0T, Disks: 3]
>>>
>>>  | o- xxxx-rpl
>>> ..........................................................................
>>> ................... [xxxx-rpl (12.0T)]
>>>
>>>   | | o- xxxxx_lun0
>>> ........................................................................
>>> [xxxx-rpl/xxxxx_lun0 (Online, 12.0T)]
>>>
>>>   | o- replicated_xxxx
>>> ..........................................................................
>>> ..... [replicated_xxxx (25.0T)]
>>>
>>>   |   o- xxxx_lun0
>>> ...............................................................
>>> [replicated_xxxx/xxxx_lun0 (Online, 12.0T)]
>>>
>>>   |   o- xxxx_lun_new
>>> ........................................................
>>> [replicated_xxxx/xxxx_lun_new (Unknown, 13.0T)]
>>>
>>>
>>>
>>> The image (xxxx_lun_new) is provisioned to multiple ESXi hosts,
>>> mounted, and formatted with VMFS6. The datastore is writable and
>>> readable by the hosts.
>>>
>>> There is a change in the block size of the RBD Image: the older RBD
>>> Images use a 4 MiB block size, while the new RBD Image uses a 512
>>> KiB block size.
>>>
>>> RBD Image Parameters:
>>>
>>> For replicated_xxxx / xxxx_lun0 (Online status in GWCLI):
>>>
>>>
>>>
>>> rbd image 'xxxx_lun0':
>>>
>>>         size 12 TiB in 3145728 objects
>>>
>>>         order 22 (4 MiB objects)
>>>
>>>         snapshot_count: 0
>>>
>>>         id: 5c1b5ecfdfa46
>>>
>>>         data_pool: xxxx0-data
>>>
>>>         block_name_prefix: rbd_data.14.5c1b5ecfdfa46
>>>
>>>         format: 2
>>>
>>>         features: exclusive-lock, data-pool
>>>
>>>         op_features:
>>>
>>>         flags:
>>>
>>>         create_timestamp: Tue Jul  8 13:02:11 2025
>>>
>>>         access_timestamp: Thu Sep 25 13:49:47 2025
>>>
>>>         modify_timestamp: Thu Sep 25 13:50:05 2025
>>>
>>>
>>>
>>>
>>>
>>> For replicated_xxxx / xxxx_lun_new (Unknown status in GWCLI):
>>>
>>> rbd image 'xxxx_lun_new':
>>>         size 13 TiB in 27262976 objects
>>>         order 19 (512 KiB objects)
>>>         snapshot_count: 0
>>>         id: 1945d9cf9f41ab
>>>         data_pool: xxxx0-data
>>>         block_name_prefix: rbd_data.14.1945d9cf9f41ab
>>>         format: 2
>>>         features: exclusive-lock, data-pool
>>>         op_features:
>>>         flags:
>>>         create_timestamp: Wed Sep 24 11:21:21 2025
>>>         access_timestamp: Thu Sep 25 13:50:42 2025
>>>         modify_timestamp: Thu Sep 25 13:49:48 2025
>>>
>>>
>>>
>>> Pool Parameters:
>>>
>>> pool 14 'replicated_xxxx' replicated size 3 min_size 2 crush_rule 7
>>> object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on
>>> last_change
>>> 30743 flags hashpspool stripe_width 0 application rbd,rgw
>>>
>>> Ceph version:
>>>
>>> ceph --version
>>>
>>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
>>> quincy
>>> (stable)
>>>
>>>
>>>
>>> Question:
>>>
>>> What could be causing the RBD Image (xxxx_lun_new) to appear in an
>>> "unknown" state in GWCLI?
>>
>> _______________________________________________
>> ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an
>> email toceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an
>> email toceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email 
to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph GWCLI issue

Reply via email to