[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread Lo Re Giuseppe
Hi David,
Thanks a lot for your reply.
Yes we have heavy load from clients on the same subtree. We have multiple MDSs 
that were setup with the hope to distribute the load among them, but this is 
not really happening, in moments of high load we see most of the load on one 
MDS.
We don't use pinning today.
We have placed >1 MDSs in the same servers as we noticed that the memory 
consumption was allowing this.
Right now I have scaled down the MDS services  to 1 as I have learnt that the 
use of multiple MDSs could have been a risky move. Though it was working not 
bad until we upgraded from 17.2.5 up to 17.2.7 and now 18.2.0.
I'll look more in the client stability as per your suggestion.

Best,

Giuseppe

On 27.11.2023, 10:41, "David C." mailto:david.cas...@aevoo.fr>> wrote:


Hi Guiseppe,


Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?


Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]


How are your 4 active MDS configured (pinning?) ?


Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?





Cordialement,


*David CASIER*


________






Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe mailto:giuseppe.l...@cscs.ch>> a
écrit :


> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
> cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE MDS ACTIVITY DNS
> INOS DIRS CAPS
> 0 active cephfs.naret-monitor01.nuakzo Reqs: 0 /s 17.2k
> 16.2k 1892 14.3k
> 1 active cephfs.naret-monitor02.ztdghf Reqs: 0 /s 28.1k
> 10.3k 752 6881
> 2 clientreplay cephfs.naret-monitor02.exceuo 63.0k
> 6491 541 66
> 3 active cephfs.naret-monitor03.lqppte Reqs: 0 /s 16.7k
> 13.4k 8233 990
> POOL TYPE USED AVAIL
> cephfs.cephfs.meta metadata 5888M 18.5T
> cephfs.cephfs.data data 119G 215T
> cephfs.cephfs.data.e_4_2 data 2289G 3241T
> cephfs.cephfs.data.e_8_3 data 9997G 470T
> STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>
>
___
ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-le...@ceph.io 
<mailto:ceph-users-le...@ceph.io>



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread Lo Re Giuseppe
Hi,
We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are 
having CephFS issues.
For example this morning:
“””
[root@naret-monitor01 ~]# ceph -s
  cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_WARN
1 filesystem is degraded
3 clients failing to advance oldest client/flush tid
3 MDSs report slow requests
6 pgs not scrubbed in time
29 daemons have recently crashed
…
“””

The ceph orch, ceph crash and ceph fs status commands were hanging.

After a “ceph mgr fail” those commands started to respond.
Then I have noticed that there was one mds with most of the slow operations,

“””
[WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked > 30 
secs
mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are blocked > 
30 secs
mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked > 30 
secs
“””

Then I tried to restart it with

“””
[root@naret-monitor01 ~]# ceph orch daemon restart 
mds.cephfs.naret-monitor01.uvevbf
Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host 'naret-monitor01'
“””

After the cephfs entered into this situation:
“””
[root@naret-monitor01 ~]# ceph fs status
cephfs - 198 clients
==
RANK STATE   MDS  ACTIVITY DNSINOS  
 DIRS   CAPS
0   active cephfs.naret-monitor01.nuakzo  Reqs:0 /s  17.2k  16.2k  
1892   14.3k
1   active cephfs.naret-monitor02.ztdghf  Reqs:0 /s  28.1k  10.3k   
752   6881
2clientreplay  cephfs.naret-monitor02.exceuo 63.0k  6491
541 66
3   active cephfs.naret-monitor03.lqppte  Reqs:0 /s  16.7k  13.4k  
8233990
  POOL  TYPE USED  AVAIL
   cephfs.cephfs.meta metadata  5888M  18.5T
   cephfs.cephfs.data   data 119G   215T
cephfs.cephfs.data.e_4_2data2289G  3241T
cephfs.cephfs.data.e_8_3data9997G   470T
 STANDBY MDS
cephfs.naret-monitor03.eflouf
cephfs.naret-monitor01.uvevbf
MDS version: ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) 
reef (stable)
“””

The file system is totally unresponsive (we can mount it on client nodes but 
any operations like a simple ls hangs).

During the night we had a lot of mds crashes, I can share the content.

Does anybody have an idea on how to tackle this problem?

Best,

Giuseppe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: librbd 4k read/write?

2023-08-11 Thread Lo Re Giuseppe
Hi,
In case of pool a cluster where most pools are with erasure code 4+2, what 
would you consider as value for cluster_size?

Giuseppe

On 10.08.23, 21:06, "Zakhar Kirpichenko" mailto:zak...@gmail.com>> wrote:




Hi,


You can use the following formula to roughly calculate the IOPS you can get
from a cluster: (Drive_IOPS * Number_of_Drives * 0.75) / Cluster_Size.


For example, for 60 10K rpm SAS drives each capable of 200 4K IOPS and a
replicated pool with size 3: (~200 * 60 * 0.75) / 3 = ~3000 IOPS with block
size = 4K.


That's what the OP is getting, give or take.


/Z


On Thu, 10 Aug 2023 at 20:20, Anthony D'Atri mailto:a...@dreamsnake.net>> wrote:


>
>
> >
> > Good afternoon everybody!
> >
> > I have the following scenario:
> > Pool RBD replication x3
> > 5 hosts with 12 SAS spinning disks each
>
> Old hardware? SAS is mostly dead.
>
> > I'm using exactly the following line with FIO to test:
> > fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> > -iodepth=16 -rw=write -filename=./test.img
>
> On what kind of client?
>
> > If I increase the blocksize I can easily reach 1.5 GBps or more.
> >
> > But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> > which is quite annoying. I achieve the same rate if rw=read.
>
> If your client is VM especially, check if you have IOPS throttling. With
> small block sizes you'll throttle IOPS long before bandwidth.
>
> > Note: I tested it on another smaller cluster, with 36 SAS disks and got
> the
> > same result.
>
> SAS has a price premium over SATA, and still requires an HBA. Many
> chassis vendors really want you to buy an anachronistic RoC HBA.
>
> Eschewing SAS and the HBA helps close the gap to justify SSDs, the TCO
> just doesn't favor spinners.
>
> > Maybe the 5 host cluster is not
> > saturated by your current fio test. Try running 2 or 4 in parallel.
>
>
> Agreed that Ceph is a scale out solution, not DAS, but note the difference
> reported with a larger block size.
>
> >How is this related to 60 drives? His test is only on 3 drives at a time
> not?
>
> RBD volumes by and large will live on most or all OSDs in the pool.
>
>
>
>
> >
> > I don't know exactly what to look for or configure to have any
> improvement.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io 
> > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > 
> ___
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> 
>
___
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io 




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 16.2.7. to 16.2.11 failing on OSDs

2023-03-30 Thread Lo Re Giuseppe
To add, on this, the issue seemed related to a process (ceph-volume) which was 
doing check operations on all devices. The systemctl osd service was timing out 
because of that and the osd daemon was going into error state.
We noticed that version 17.2.5 had a change related to ceph-volume, in 
particular https://tracker.ceph.com/issues/57627.
We decided to skip 16.2.11 and jump to 17.2.5. This second attempt went well, 
so the issue is now solved.
Note: the upgrade 16.2.7 -> 16.2.11 went smoothly in a TDS cluster with 
identical OS/software, but much smaller, 3 nodes with a couple of disk each, so 
the issue seems really to be about the number of devices and nodes.

Regards,

Giuseppe 

On 30.03.23, 16:56, "Lo Re Giuseppe" mailto:giuseppe.l...@cscs.ch>> wrote:


Dear all,


On one of our clusters I started the upgrade process from 16.2.7 to 16.2.11.
Mon and mgr and crash processes were done easily/quickly, then at the first 
attempt of upgrading a OSD container the upgrade process stopped because of the 
OSD process is not able to start after the upgrade.


Does anyone have any hint on how to unblock the upgrade?
Some details below:
Regards,


Giuseppe


I started the upgrade process with the cephadm command:


“””
[root@naret-monitor01 ~]# ceph orch upgrade start --ceph-version 16.2.11
Initiating upgrade to quay.io/ceph/ceph:v16.2.11
“””


After a short time:


“””
[root@naret-monitor01 ~]# ceph orch upgrade status
{
"target_image": 
quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add<mailto:quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add>,
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"crash",
"mon",
"mgr"
],
"progress": "64/2039 daemons upgraded",
"message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host 
naret-osd01 failed.",
"is_paused": true
}
“””


The ceph health command reports:


“””
[root@naret-monitor01 ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 
2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs 
undersized; Upgrading daemon osd.4 on host naret-osd01 failed.
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.22 on naret-osd01 is in error state
[WRN] OSD_DOWN: 1 osds down
osd.4 (root=default,host=naret-osd01) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 2654362/6721382840 objects 
degraded (0.039%), 14 pgs degraded, 14 pgs undersized
pg 28.88 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1373,1337,1508,852,2147483647,483]
pg 28.528 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1063,793,2147483647,931,338,1777]
pg 28.594 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1208,891,1651,364,2147483647,53]
pg 28.8b4 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [521,1273,1238,138,1539,2147483647]
pg 28.a90 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [237,1665,1836,2147483647,192,1410]
pg 28.ad6 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [870,466,350,885,1601,2147483647]
pg 28.b34 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [920,1596,2147483647,115,201,941]
pg 28.c14 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1389,424,2147483647,268,1646,632]
pg 28.dba is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1099,561,2147483647,1806,1874,1145]
pg 28.ee2 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1621,1904,1044,2147483647,1545,722]
pg 29.163 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1883,2147483647,1509,1697,1187,235]
pg 29.1c1 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [122,1226,962,1254,1215,2147483647]
pg 29.254 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [1782,1839,1545,412,196,2147483647]
pg 29.2a1 is stuck undersized for 6m, current state active+undersized+degraded, 
last acting [370,2147483647,575,1423,1755,446]
[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 
failed.
Upgrade daemon: osd.4: cephadm exited with an error code: 1, stderr:Redeploy 
daemon osd.4 ...
Non-zero exit code 1 from systemctl start 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>
systemctl: stderr Job for ceph-63334166-d991-11eb-99de-40a6b7210...@osd.4.serv 
<mailto:ceph-63334166-d991-11eb-99de-40a6b7210...@osd.4.serv>ice<mailto:ceph-633341

[ceph-users] Upgrade from 16.2.7. to 16.2.11 failing on OSDs

2023-03-30 Thread Lo Re Giuseppe
Dear all,

On one of our clusters I started the upgrade process from 16.2.7 to 16.2.11.
Mon and mgr and crash processes were done easily/quickly, then at the first 
attempt of upgrading a OSD container the upgrade process stopped because of the 
OSD process is not able to start after the upgrade.

Does anyone have any hint on how to unblock the upgrade?
Some details below:
Regards,

Giuseppe

I started the upgrade process with the cephadm command:

“””
[root@naret-monitor01 ~]# ceph orch upgrade start --ceph-version 16.2.11
Initiating upgrade to quay.io/ceph/ceph:v16.2.11
“””

After a short time:

“””
[root@naret-monitor01 ~]# ceph orch upgrade status
{
"target_image": 
quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add,
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"crash",
"mon",
"mgr"
],
"progress": "64/2039 daemons upgraded",
"message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host 
naret-osd01 failed.",
"is_paused": true
}
“””

The ceph health command reports:

“””
[root@naret-monitor01 ~]# ceph health  detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 
2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs 
undersized; Upgrading daemon osd.4 on host naret-osd01 failed.
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.22 on naret-osd01 is in error state
[WRN] OSD_DOWN: 1 osds down
osd.4 (root=default,host=naret-osd01) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 2654362/6721382840 objects 
degraded (0.039%), 14 pgs degraded, 14 pgs undersized
pg 28.88 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1373,1337,1508,852,2147483647,483]
pg 28.528 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1063,793,2147483647,931,338,1777]
pg 28.594 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1208,891,1651,364,2147483647,53]
pg 28.8b4 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [521,1273,1238,138,1539,2147483647]
pg 28.a90 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [237,1665,1836,2147483647,192,1410]
pg 28.ad6 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [870,466,350,885,1601,2147483647]
pg 28.b34 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [920,1596,2147483647,115,201,941]
pg 28.c14 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1389,424,2147483647,268,1646,632]
pg 28.dba is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1099,561,2147483647,1806,1874,1145]
pg 28.ee2 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1621,1904,1044,2147483647,1545,722]
pg 29.163 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1883,2147483647,1509,1697,1187,235]
pg 29.1c1 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [122,1226,962,1254,1215,2147483647]
pg 29.254 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [1782,1839,1545,412,196,2147483647]
pg 29.2a1 is stuck undersized for 6m, current state 
active+undersized+degraded, last acting [370,2147483647,575,1423,1755,446]
[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 
failed.
Upgrade daemon: osd.4: cephadm exited with an error code: 1, 
stderr:Redeploy daemon osd.4 ...
Non-zero exit code 1 from systemctl start 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4
systemctl: stderr Job for 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service
 failed because a timeout was exceeded.
systemctl: stderr See "systemctl status 
ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service"
 and "journalctl -xe" for details.
Traceback (most recent call last):
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 9248, in 
main()
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 9236, in main
r = ctx.func(ctx)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2",
 line 1990, in _default_image
return func(ctx)
  File 
"/var/lib/ceph/63334166-d991-11eb-99de-40a6b7

[ceph-users] Mds crash at cscs

2023-01-19 Thread Lo Re Giuseppe
Dear all,

We have started to use more intensively cephfs for some wlcg related workload.
We have 3 active mds instances spread on 3 servers, mds_cache_memory_limit=12G, 
most of the other configs are default ones.
One of them has crashed this night leaving the log below.
Do you have any hint on what could be the cause and how to avoid it?

Regards,

Giuseppe

[root@naret-monitor03 ~]# journalctl -u 
ceph-63334166-d991-11eb-99de-40a6b72108d0@mds.cephfs.naret-monitor03.lqppte.service
...
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific >
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  1: /lib64/libpthread.so.0(+0x12ce0) [0x7fe291e4fce0]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  2: abort()
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  3: /lib64/libstdc++.so.6(+0x987ba) [0x7fe2912567ba]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  4: /lib64/libstdc++.so.6(+0x9653c) [0x7fe29125453c]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  5: /lib64/libstdc++.so.6(+0x95559) [0x7fe291253559]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  6: __gxx_personality_v0()
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  7: /lib64/libgcc_s.so.1(+0x10b03) [0x7fe290c34b03]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  8: _Unwind_Resume()
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  9: /usr/bin/ceph-mds(+0x18c104) [0x5638351e7104]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  10: /lib64/libpthread.so.0(+0x12ce0) [0x7fe291e4fce0]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  11: gsignal()
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  12: abort()
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  13: /lib64/libstdc++.so.6(+0x9009b) [0x7fe29124e09b]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  14: /lib64/libstdc++.so.6(+0x9653c) [0x7fe29125453c]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  15: /lib64/libstdc++.so.6(+0x96597) [0x7fe291254597]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  16: /lib64/libstdc++.so.6(+0x967f8) [0x7fe2912547f8]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  17: /lib64/libtcmalloc.so.4(+0x19fa4) [0x7fe29bae6fa4]
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  18: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned int, int, vo>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  19: (std::shared_ptr > InodeSt>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  20: (CInode::_decode_base(ceph::buffer::v15_2_0::list::iterator_impl
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  21: (CInode::decode_import(ceph::buffer::v15_2_0::list::iterator_impl
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  22: (Migrator::decode_import_inode(CDentry*, ceph::buffer::v15_2_0::lis>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  23: (Migrator::decode_import_dir(ceph::buffer::v15_2_0::list::iterator_>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  24: (Migrator::handle_export_dir(boost::intrusive_ptr>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  25: (Migrator::dispatch(boost::intrusive_ptr const&)+0x1>
Jan 19 04:49:40 naret-monitor03 
ceph-63334166-d991-11eb-99de-40a6b72108d0-mds-cephfs-naret-monitor03-lqppte[4397]:
  26: (MDSRank::handle_message(boo

[ceph-users] Re: MGR failures and pg autoscaler

2022-10-25 Thread Lo Re Giuseppe
I have found the logs showing the progress module failure:

debug 2022-10-25T05:06:08.877+ 7f40868e7700  0 [rbd_support INFO root] 
execute_trash_remove: task={"sequence": 150, "id": 
"fcc864a0-9bde-4512-9f84-be10976613db", "message": "Removing i
mage fulen-hdd/f3f237d2f7e304 from trash", "refs": {"action": "trash remove", 
"pool_name": "fulen-hdd", "pool_namespace": "", "image_id": "f3f237d2f7e304"}, 
"in_progress": true, "progress"
: 0.0}
debug 2022-10-25T05:06:08.884+ 7f4106e90700 -1 log_channel(cluster) log 
[ERR] : Unhandled exception from module 'progress' while running on 
mgr.naret-monitor03.escwyg: ('42efb95d-ceaa-4a91-a9b2-b91f65f1834d',)
debug 2022-10-25T05:06:08.884+ 7f4106e90700 -1 progress.serve:
debug 2022-10-25T05:06:08.897+ 7f4139e96700  0 log_channel(audit) log [DBG] 
: from='client.22182342 -' entity='client.combin' 
cmd=[{"format":"json","group_name":"combin","prefix":"fs subvolume 
info","sub_name":"combin-4b53e28d-2f59-11ed-8aa5-9aa9e2c5aae2","vol_name":"cephfs"}]:
 dispatch
debug 2022-10-25T05:06:08.884+ 7f4106e90700 -1 Traceback (most recent call 
last):
  File "/usr/share/ceph/mgr/progress/module.py", line 716, in serve
self._process_pg_summary()
  File "/usr/share/ceph/mgr/progress/module.py", line 629, in 
_process_pg_summary
ev = self._events[ev_id]
KeyError: '42efb95d-ceaa-4a91-a9b2-b91f65f1834d'




On 25.10.22, 09:58, "Lo Re  Giuseppe"  wrote:

Hi,
Since some weeks we started to us pg autoscale on our pools.
We run with version 16.2.7.
Maybe a coincidence, maybe not,  from some weeks we started to experience 
mgr progress module failures:

“””
[root@naret-monitor01 ~]# ceph -s
  cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_ERR
Module 'progress' has failed: 
('346ee7e0-35f0-4fdf-960e-a36e7e2441e4',)
1 pool(s) full  services:
mon: 3 daemons, quorum naret-monitor01,naret-monitor02,naret-monitor03 
(age 5d)
mgr: naret-monitor02.ciqvgv(active, since 6d), standbys: 
naret-monitor03.escwyg, naret-monitor01.suwugf
mds: 1/1 daemons up, 2 standby
osd: 760 osds: 760 up (since 4d), 760 in (since 4d); 10 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)  data:
volumes: 1/1 healthy
pools:   32 pools, 6250 pgs
objects: 977.79M objects, 3.6 PiB
usage:   5.7 PiB used, 5.1 PiB / 11 PiB avail
pgs: 4602612/5990777501 objects misplaced (0.077%)
 6214 active+clean
 25   active+clean+scrubbing+deep
 10   active+remapped+backfilling
 1active+clean+scrubbing  io:
client:   243 MiB/s rd, 292 MiB/s wr, 1.68k op/s rd, 842 op/s wr
recovery: 430 MiB/s, 109 objects/s  progress:
Global Recovery Event (14h)
  [===.] (remaining: 70s)
“””

In the mgr logs I see:
“””

debug 2022-10-20T23:09:03.859+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 2 has overlapping roots: {-60, -1}

debug 2022-10-20T23:09:03.863+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 3 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.866+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 5 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.870+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 6 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.873+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 10 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.877+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 11 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.880+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 12 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.884+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 13 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.887+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 14 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.891+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 15 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.894+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 26 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.898+ 7fba5f300700  0 [pg_autoscaler ERROR 
root] pool 28 has overlapping roots: {-105, -60, -1, -2}

d

[ceph-users] MGR failures and pg autoscaler

2022-10-25 Thread Lo Re Giuseppe
Hi,
Since some weeks we started to us pg autoscale on our pools.
We run with version 16.2.7.
Maybe a coincidence, maybe not,  from some weeks we started to experience mgr 
progress module failures:

“””
[root@naret-monitor01 ~]# ceph -s
  cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_ERR
Module 'progress' has failed: 
('346ee7e0-35f0-4fdf-960e-a36e7e2441e4',)
1 pool(s) full  services:
mon: 3 daemons, quorum naret-monitor01,naret-monitor02,naret-monitor03 (age 
5d)
mgr: naret-monitor02.ciqvgv(active, since 6d), standbys: 
naret-monitor03.escwyg, naret-monitor01.suwugf
mds: 1/1 daemons up, 2 standby
osd: 760 osds: 760 up (since 4d), 760 in (since 4d); 10 remapped pgs
rgw: 3 daemons active (3 hosts, 1 zones)  data:
volumes: 1/1 healthy
pools:   32 pools, 6250 pgs
objects: 977.79M objects, 3.6 PiB
usage:   5.7 PiB used, 5.1 PiB / 11 PiB avail
pgs: 4602612/5990777501 objects misplaced (0.077%)
 6214 active+clean
 25   active+clean+scrubbing+deep
 10   active+remapped+backfilling
 1active+clean+scrubbing  io:
client:   243 MiB/s rd, 292 MiB/s wr, 1.68k op/s rd, 842 op/s wr
recovery: 430 MiB/s, 109 objects/s  progress:
Global Recovery Event (14h)
  [===.] (remaining: 70s)
“””

In the mgr logs I see:
“””

debug 2022-10-20T23:09:03.859+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 2 has overlapping roots: {-60, -1}

debug 2022-10-20T23:09:03.863+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 3 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.866+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 5 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.870+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 6 has overlapping roots: {-60, -1, -2}

debug 2022-10-20T23:09:03.873+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 10 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.877+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 11 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.880+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 12 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.884+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 13 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.887+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 14 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.891+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 15 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.894+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 26 has overlapping roots: {-105, -60,

-1, -2}

debug 2022-10-20T23:09:03.898+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 28 has overlapping roots: {-105, -60, -1, -2}

debug 2022-10-20T23:09:03.901+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 29 has overlapping roots: {-105, -60, -1, -2}

debug 2022-10-20T23:09:03.905+ 7fba5f300700  0 [pg_autoscaler ERROR root] 
pool 30 has overlapping roots: {-105, -60, -1, -2}

...
“””
Do you have any explanation/fix for this errors?
Regards,

Giuseppe

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-19 Thread Lo Re Giuseppe
Hi Eugen,

After the reboot of the two mgr servers the mgr service is back to normal (not 
restarting every 3 mins) activity.
I noticed that also the trash purge activity was stuck, and after the mgr 
service started to be stable also the purge operations resumed.
Now I guess I'll have to try again the upgrade procedure with cephadm and test 
if this time it starts...

Giuseppe

On 18.05.22, 14:19, "Eugen Block"  wrote:

Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
check the mgr logs for any hints.


    Zitat von Lo Re  Giuseppe :

> Hi,
>
> We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
> cephadm on a test cluster made of 3 nodes and everything went  
> smoothly.
> Today we started the very same operation on the production one (20  
> OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
> at all…
>
> To be more specific, we have issued the command
>
> ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7
>
> and soon after “ceph -s” reports
>
> Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
>   []
>
> But only for few seconds, after that
>
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum  
> naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
> mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
> naret-monitor02.btynnb
> mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
> osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
> rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
> cscs-realm.naret-zone.naret-rgw02.pduagk,  
> cscs-realm.naret-zone.naret-rgw03.aqdkkb)
>
>   task status:
>
>   data:
> pools:   30 pools, 16497 pgs
> objects: 833.14M objects, 3.1 PiB
> usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
> pgs: 16460 active+clean
>  37active+clean+scrubbing+deep
>
>   io:
> client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr
>
>   progress:
> Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
>   [] (remaining: 81m)
>
>
>
> The command “ceph orch upgrade status” says:
>
> {
> "target_image": "quay.io/ceph/ceph:v16.2.7",
> "in_progress": true,
> "services_complete": [],
> "message": ""
> }
>
> It doesn’t even pull the container image.
> I have tested that the podman pull command works, I was able to pull  
> quay.io/ceph/ceph:v16.2.7.
>
> “ceph -w” and “ceph -W cephadm” don’t report any activity related to  
> the upgrade.
>
>
> Does anyone have seen anything similar?
> Do you have advises on how to understand what’s holding the upgrade  
> process to actually start?
>
> Thanks in advance,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: S3 and RBD backup

2022-05-19 Thread Lo Re Giuseppe
Hi Sanjeev,

This is something we have started on a test cluster for now and if proven to be 
robust will bring to production.
We are using the ceph functionality described here 
https://docs.ceph.com/en/pacific/mgr/nfs/, available starting from Pacific.

Best,

Giuseppe

From: Sanjeev Jha 
Date: Thursday, 19 May 2022 at 09:41
To: Lo Re Giuseppe , stéphane chalansonnet 

Cc: "ceph-users@ceph.io" 
Subject: Re: [ceph-users] Re: S3 and RBD backup

Hi Giuseppe,

Thanks for your suggesion.

Could you please elaborate more the term "exporting bucket as NFS share"? How 
you are exporting the bucket? Are you using S3FS for this or some other 
mechanism?

Best regards,
Sanjeev
____________
From: Lo Re Giuseppe 
Sent: Thursday, May 19, 2022 11:45 AM
To: stéphane chalansonnet ; Sanjeev Jha 

Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: S3 and RBD backup

Hi,

We are doing exactly the same, exporting bucket as NFS share and run on it our 
backup software to get data to tape.
Given the data volumes replication to another S3 disk based endpoint is not 
viable for us.
Regards,

Giuseppe

On 18.05.22, 23:14, "stéphane chalansonnet"  wrote:

Hello,

In fact S3 should be replicated on another region or AZ , and backup should
be managed with versioning on bucket.

But, in our case, we needed to secure the backup of databases (on K8S) into
our external backup solution (EMC Networker)

We implemented Ganesha and create an export NFS link to the bucket of some
users S3.
NFS export was mounted into storage backup Node and backup .

Not the simpler solution but it works ;)

Regards,
Stephane



Le mer. 18 mai 2022 à 22:34, Sanjeev Jha  a écrit :

> Thanks Janne for the information in detail.
>
> We have RHCS 4.2 non-collocated setup in one DC only. There are few RBD
> volumes mapped to MariaDB Database.
> Also, S3 endpoint with bucket is being used to upload objects. There is no
> multisite zone has been implemented yet.
> My Requirement is to take backup of RBD images and database.
> How can S3 bucket backup and restore be possible?
> We are looking for many opensource tool like rclone for S3 and Benji for
> RBD but not able to make sure whether these tools would be enough to
> achieve backup goal.
> Your suggestion based on the above case would be much appreciated.
>
> Best,
> Sanjeev
>
> 
> From: Janne Johansson 
> Sent: Tuesday, May 17, 2022 1:01 PM
> To: Sanjeev Jha 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] S3 and RBD backup
>
> Den mån 16 maj 2022 kl 13:41 skrev Sanjeev Jha :
> > Could someone please let me know how to take S3 and RBD backup from Ceph
> side and possibility to take backup from Client/user side?
> > Which tool should I use for the backup?
>
> Backing data up, or replicating it is a choice between a lot of
> variables and options, and choosing something that has the least
> negative effects for your own environment and your own demands. Some
> options will cause a lot of network traffic, others will use a lot of
> CPU somewhere, others will waste disk on the destination for
> performance reasons and some will have long and complicated restore
> procedures. Some will be realtime copies but those might put extra
> load on the cluster while running, others will be asynchronous but
> might need a database at all times to keep track of what not to copy
> because it is already at the destination. Some synchronous options
> might even cause writes to be slower in order to guarantee that ALL
> copies are in place before sending clients an ACK, some will not and
> those might lose data that the client thought was delivered 100% ok.
>
> Without knowing what your demands are, or knowing what situation and
> environment you are in, it will be almost impossible to match the
> above into something that is good for you.
> Some might have a monetary cost, some may require a complete second
> cluster of equal size, some might have a cost in terms of setup work
> from clueful ceph admins that will take a certain amount of time and
> effort. Some options might require clients to change how they write
> data into the cluster in order to help the backup/replication system.
>
> There is unfortunately not a single best choice for all clusters,
> there might even not exist a good option just to cover both S3 and RBD
> since they are inherently very different.
> RBD will almost certainly be only full restores of a large complete
> image, S3 users might want to have the object
> foo

[ceph-users] Re: Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-18 Thread Lo Re Giuseppe

Hi,

I didn’t notice anything suspicious in mgr logs, neither in the cephadm.log one 
(attaching an extract of the latest).
What I have noticed is that one the mgr container, the active one, gets 
restarted about every 3 minutes (as reported by ceph -w)
"""
2022-05-18T15:30:49.883238+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:30:49.889294+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:30:50.832200+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:34:16.979735+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:34:16.985531+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:34:18.246784+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:37:34.576159+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:37:34.582935+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:37:35.821200+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
2022-05-18T15:40:00.000148+0200 mon.naret-monitor01 [INF] overall HEALTH_OK
2022-05-18T15:40:52.456182+0200 mon.naret-monitor01 [INF] Active manager daemon 
naret-monitor01.tvddjv restarted
2022-05-18T15:40:52.461826+0200 mon.naret-monitor01 [INF] Activating manager 
daemon naret-monitor01.tvddjv
2022-05-18T15:40:53.787353+0200 mon.naret-monitor01 [INF] Manager daemon 
naret-monitor01.tvddjv is now available
"""
Attaching also the active mgr proc logs.
The cluster is working fine, but I wonder if this behaviour of mgr/cephadm is 
itself wrong and might cause the stall of the upgrade.

Thanks,

Giuseppe 
 

On 18.05.22, 14:19, "Eugen Block"  wrote:

Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
check the mgr logs for any hints.


Zitat von Lo Re  Giuseppe :

> Hi,
>
> We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
> cephadm on a test cluster made of 3 nodes and everything went  
> smoothly.
> Today we started the very same operation on the production one (20  
> OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
> at all…
>
> To be more specific, we have issued the command
>
> ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7
>
> and soon after “ceph -s” reports
>
> Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
>   []
>
> But only for few seconds, after that
>
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum  
> naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
> mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
> naret-monitor02.btynnb
> mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
> osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
> rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
> cscs-realm.naret-zone.naret-rgw02.pduagk,  
> cscs-realm.naret-zone.naret-rgw03.aqdkkb)
>
>   task status:
>
>   data:
> pools:   30 pools, 16497 pgs
> objects: 833.14M objects, 3.1 PiB
> usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
> pgs: 16460 active+clean
>  37active+clean+scrubbing+deep
>
>   io:
> client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr
>
>   progress:
> Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
>   [] (remaining: 81m)
>
>
>
> The command “ceph orch upgrade status” says:
>
> {
> "target_image": "quay.io/ceph/ceph:v16.2.7",
> "in_progress": true,
> "services_complete": [],
> "message": ""
> }
>
> It doesn’t even pull the container image.
> I have tested that the podman pull command works, I was able to pull  
> quay.io/ceph/ceph:v16.2.7.
>
> “ceph -w” and “ceph -W cephadm” don’t report any activity related to  
> the upgrade.
>
>
> Does anyone have seen anything similar?
> Do you have advises on how to understand what’s holding the upgrade  
> process to actually start?
>
> Thanks in advance,
>
> Giuseppe

[ceph-users] Re: S3 and RBD backup

2022-05-18 Thread Lo Re Giuseppe
Hi,

We are doing exactly the same, exporting bucket as NFS share and run on it our 
backup software to get data to tape.
Given the data volumes replication to another S3 disk based endpoint is not 
viable for us.
Regards,

Giuseppe

On 18.05.22, 23:14, "stéphane chalansonnet"  wrote:

Hello,

In fact S3 should be replicated on another region or AZ , and backup should
be managed with versioning on bucket.

But, in our case, we needed to secure the backup of databases (on K8S) into
our external backup solution (EMC Networker)

We implemented Ganesha and create an export NFS link to the bucket of some
users S3.
NFS export was mounted into storage backup Node and backup .

Not the simpler solution but it works ;)

Regards,
Stephane



Le mer. 18 mai 2022 à 22:34, Sanjeev Jha  a écrit :

> Thanks Janne for the information in detail.
>
> We have RHCS 4.2 non-collocated setup in one DC only. There are few RBD
> volumes mapped to MariaDB Database.
> Also, S3 endpoint with bucket is being used to upload objects. There is no
> multisite zone has been implemented yet.
> My Requirement is to take backup of RBD images and database.
> How can S3 bucket backup and restore be possible?
> We are looking for many opensource tool like rclone for S3 and Benji for
> RBD but not able to make sure whether these tools would be enough to
> achieve backup goal.
> Your suggestion based on the above case would be much appreciated.
>
> Best,
> Sanjeev
>
> 
> From: Janne Johansson 
> Sent: Tuesday, May 17, 2022 1:01 PM
> To: Sanjeev Jha 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] S3 and RBD backup
>
> Den mån 16 maj 2022 kl 13:41 skrev Sanjeev Jha :
> > Could someone please let me know how to take S3 and RBD backup from Ceph
> side and possibility to take backup from Client/user side?
> > Which tool should I use for the backup?
>
> Backing data up, or replicating it is a choice between a lot of
> variables and options, and choosing something that has the least
> negative effects for your own environment and your own demands. Some
> options will cause a lot of network traffic, others will use a lot of
> CPU somewhere, others will waste disk on the destination for
> performance reasons and some will have long and complicated restore
> procedures. Some will be realtime copies but those might put extra
> load on the cluster while running, others will be asynchronous but
> might need a database at all times to keep track of what not to copy
> because it is already at the destination. Some synchronous options
> might even cause writes to be slower in order to guarantee that ALL
> copies are in place before sending clients an ACK, some will not and
> those might lose data that the client thought was delivered 100% ok.
>
> Without knowing what your demands are, or knowing what situation and
> environment you are in, it will be almost impossible to match the
> above into something that is good for you.
> Some might have a monetary cost, some may require a complete second
> cluster of equal size, some might have a cost in terms of setup work
> from clueful ceph admins that will take a certain amount of time and
> effort. Some options might require clients to change how they write
> data into the cluster in order to help the backup/replication system.
>
> There is unfortunately not a single best choice for all clusters,
> there might even not exist a good option just to cover both S3 and RBD
> since they are inherently very different.
> RBD will almost certainly be only full restores of a large complete
> image, S3 users might want to have the object
> foo/bar/MyImportantWriting.doc from last wednesday back only and not
> revert the whole bucket or the whole S3 setup.
>
> I'm quite certain that there will not be a single
> cheap,fast,efficient,scalable,unnoticeable,easy solution that solves
> all these problems at once, but rather you will have to focus on what
> the toughest limitations are (money, time, disk, rackspace, network
> capacity, client and IO demands?) and look for solutions (or products)
> that work well with those restrictions.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Upgrade from v15.2.16 to v16.2.7 not starting

2022-05-18 Thread Lo Re Giuseppe
Hi,

We have happily tested the upgrade from v15.2.16 to v16.2.7 with cephadm on a 
test cluster made of 3 nodes and everything went smoothly.
Today we started the very same operation on the production one (20 OSD servers, 
720 HDDs) and the upgrade process doesn’t do anything at all…

To be more specific, we have issued the command

ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7

and soon after “ceph -s” reports

Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
  []

But only for few seconds, after that

[root@naret-monitor01 ~]# ceph -s
  cluster:
id: 63334166-d991-11eb-99de-40a6b72108d0
health: HEALTH_OK

  services:
mon: 3 daemons, quorum naret-monitor01,naret-monitor02,naret-monitor03 (age 
7d)
mgr: naret-monitor01.tvddjv(active, since 60s), standbys: 
naret-monitor02.btynnb
mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi, 
cscs-realm.naret-zone.naret-rgw02.pduagk, 
cscs-realm.naret-zone.naret-rgw03.aqdkkb)

  task status:

  data:
pools:   30 pools, 16497 pgs
objects: 833.14M objects, 3.1 PiB
usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
pgs: 16460 active+clean
 37active+clean+scrubbing+deep

  io:
client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr

  progress:
Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
  [] (remaining: 81m)



The command “ceph orch upgrade status” says:

{
"target_image": "quay.io/ceph/ceph:v16.2.7",
"in_progress": true,
"services_complete": [],
"message": ""
}

It doesn’t even pull the container image.
I have tested that the podman pull command works, I was able to pull 
quay.io/ceph/ceph:v16.2.7.

“ceph -w” and “ceph -W cephadm” don’t report any activity related to the 
upgrade.


Does anyone have seen anything similar?
Do you have advises on how to understand what’s holding the upgrade process to 
actually start?

Thanks in advance,

Giuseppe
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RBD map issue

2022-02-14 Thread Lo Re Giuseppe
Unfortunately nothing
Is there any way to make it more verbose?

On 14.02.22, 11:48, "Eugen Block"  wrote:

What does 'dmesg' reveal?


    Zitat von Lo Re  Giuseppe :

> root@fulen-w006:~# ll client.fulen.keyring
> -rw-r--r-- 1 root root 69 Feb 11 15:30 client.fulen.keyring
> root@fulen-w006:~# ll ceph.conf
> -rw-r--r-- 1 root root 118 Feb 11 19:15 ceph.conf
> root@fulen-w006:~# rbd -c ceph.conf --id fulen --keyring  
> client.fulen.keyring map fulen-nvme-meta/test-loreg-3
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (1) Operation not permitted
>
> where test-loreg-3 is an image in a EC pool.
>
> root@fulen-w006:~# rbd -c ceph.conf --id fulen --keyring  
> client.fulen.keyring info fulen-nvme-meta/test-loreg-3
> rbd image 'test-loreg-3':
>   size 1 GiB in 256 objects
>   order 22 (4 MiB objects)
>   snapshot_count: 0
>   id: e2375bbddf414a
>   data_pool: fulen-hdd-data
>   block_name_prefix: rbd_data.36.e2375bbddf414a
>   format: 2
>   features: layering, exclusive-lock, data-pool
>   op_features:
>   flags:
>   create_timestamp: Thu Feb 10 18:17:42 2022
>   access_timestamp: Thu Feb 10 18:17:42 2022
>   modify_timestamp: Thu Feb 10 18:17:42 2022
>
> Giuseppe
>
> On 11.02.22, 14:52, "Eugen Block"  wrote:
>
> How are the permissions of the client keyring on both systems?
>
> Zitat von Lo Re  Giuseppe :
>
> > Hi,
> >
> > It's a single ceph cluster, I'm testing from 2 different client 
nodes.
> > The caps are below.
> > I think is unlikely that caps are the cause as they work from one
> > client node, same ceph user, and not from the other one...
> >
> > Cheers,
> >
> > Giuseppe
> >
> >
> > [root@naret-monitor01 ~]# ceph auth get client.fulen
> > exported keyring for client.fulen
> > [client.fulen]
> > key = 
> > caps mgr = "profile rbd pool=fulen-hdd, profile rbd
> > pool=fulen-nvme, profile rbd pool=fulen-dcache, profile rbd
> > pool=fulen-dcache-data, profile rbd pool=fulen-dcache-meta, profile
> > rbd pool=fulen-hdd-data, profile rbd pool=fulen-nvme-meta"
> > caps mon = "profile rbd"
> > caps osd = "profile rbd pool=fulen-hdd, profile rbd
> > pool=fulen-nvme, profile rbd pool=fulen-dcache, profile rbd
> > pool=fulen-dcache-data, profile rbd pool=fulen-dcache-meta, profile
> > rbd pool=fulen-hdd-data, profile rbd pool=fulen-nvme-meta"
> >
> >
> >
> > On 11.02.22, 13:22, "Eugen Block"  wrote:
> >
    > > Hi,
> >
> > the first thing coming to mind are the user's caps. Which  
> permissions
> > do they have? Have you compared 'ceph auth get  
> client.fulen' on both
> > clusters? Please paste the output from both clusters and redact
> > sensitive information.
> >
> >
> > Zitat von Lo Re  Giuseppe :
> >
> > > Hi all,
> > >
> > > This is my first post to this user group, I’m not a ceph 
expert,
> > > sorry if I say/ask anything trivial.
> > >
> > > On a Kubernetes cluster I have an issue in creating  
> volumes from a
> > > (csi) ceph EC pool.
> > >
> > > I can reproduce the problem from rbd cli like this from  
> one of the
> > > k8s worker nodes:
> > >
> > > “””
> > > root@fulen-w006:~# ceph -v
> > > ceph version 15.2.14 
(cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
> > > octopus (stable)
> > >
> > > root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen  
> --keyfile
> > > key create test-loreg-3 --data-pool fulen-hdd-data --pool
> > > fulen-nvme-meta --size 1G
> > >
> > > root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen  
> --keyfile
> > >

[ceph-users] Re: RBD map issue

2022-02-11 Thread Lo Re Giuseppe
root@fulen-w006:~# ll client.fulen.keyring
-rw-r--r-- 1 root root 69 Feb 11 15:30 client.fulen.keyring
root@fulen-w006:~# ll ceph.conf
-rw-r--r-- 1 root root 118 Feb 11 19:15 ceph.conf
root@fulen-w006:~# rbd -c ceph.conf --id fulen --keyring client.fulen.keyring 
map fulen-nvme-meta/test-loreg-3
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted

where test-loreg-3 is an image in a EC pool.

root@fulen-w006:~# rbd -c ceph.conf --id fulen --keyring client.fulen.keyring 
info fulen-nvme-meta/test-loreg-3
rbd image 'test-loreg-3':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: e2375bbddf414a
data_pool: fulen-hdd-data
block_name_prefix: rbd_data.36.e2375bbddf414a
format: 2
features: layering, exclusive-lock, data-pool
op_features:
flags:
create_timestamp: Thu Feb 10 18:17:42 2022
access_timestamp: Thu Feb 10 18:17:42 2022
modify_timestamp: Thu Feb 10 18:17:42 2022

Giuseppe 

On 11.02.22, 14:52, "Eugen Block"  wrote:

How are the permissions of the client keyring on both systems?

Zitat von Lo Re  Giuseppe :

> Hi,
>
> It's a single ceph cluster, I'm testing from 2 different client nodes.
> The caps are below.
> I think is unlikely that caps are the cause as they work from one  
> client node, same ceph user, and not from the other one...
>
> Cheers,
>
> Giuseppe
>
>
> [root@naret-monitor01 ~]# ceph auth get client.fulen
> exported keyring for client.fulen
> [client.fulen]
>   key = 
>   caps mgr = "profile rbd pool=fulen-hdd, profile rbd  
> pool=fulen-nvme, profile rbd pool=fulen-dcache, profile rbd  
> pool=fulen-dcache-data, profile rbd pool=fulen-dcache-meta, profile  
> rbd pool=fulen-hdd-data, profile rbd pool=fulen-nvme-meta"
>   caps mon = "profile rbd"
>   caps osd = "profile rbd pool=fulen-hdd, profile rbd  
> pool=fulen-nvme, profile rbd pool=fulen-dcache, profile rbd  
> pool=fulen-dcache-data, profile rbd pool=fulen-dcache-meta, profile  
> rbd pool=fulen-hdd-data, profile rbd pool=fulen-nvme-meta"
>
>
>
> On 11.02.22, 13:22, "Eugen Block"  wrote:
>
> Hi,
>
> the first thing coming to mind are the user's caps. Which permissions
> do they have? Have you compared 'ceph auth get client.fulen' on both
> clusters? Please paste the output from both clusters and redact
> sensitive information.
>
>
> Zitat von Lo Re  Giuseppe :
>
> > Hi all,
> >
> > This is my first post to this user group, I’m not a ceph expert,
> > sorry if I say/ask anything trivial.
> >
> > On a Kubernetes cluster I have an issue in creating volumes from a
> > (csi) ceph EC pool.
> >
> > I can reproduce the problem from rbd cli like this from one of the
> > k8s worker nodes:
> >
> > “””
> > root@fulen-w006:~# ceph -v
> > ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
> > octopus (stable)
> >
> > root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile
> > key create test-loreg-3 --data-pool fulen-hdd-data --pool
> > fulen-nvme-meta --size 1G
> >
> > root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile
> > key map fulen-nvme-meta/test-loreg-3
> > ...
> > rbd: sysfs write failed
> > ...
> > In some cases useful info is found in syslog - try "dmesg | tail".
> > rbd: map failed: (1) Operation not permitted
> > “””
> >
> > The same sequence of operations works on a different node (not part
> > of the k8s cluster, completely different setup):
> >
> > “””
> > root@storagesmw: # ceph -v
> > ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2)
> > octopus (stable)
> >
> > root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile
> > client.fulen.key create test-loreg-4 --data-pool fulen-hdd-data
> > --pool fulen-nvme-meta --size 1G
> >
> > root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile
> > client.fulen.key info fulen-

[ceph-users] Re: RBD map issue

2022-02-11 Thread Lo Re Giuseppe
Hi,

It's a single ceph cluster, I'm testing from 2 different client nodes.
The caps are below.
I think is unlikely that caps are the cause as they work from one client node, 
same ceph user, and not from the other one...

Cheers,

Giuseppe


[root@naret-monitor01 ~]# ceph auth get client.fulen
exported keyring for client.fulen
[client.fulen]
key = 
caps mgr = "profile rbd pool=fulen-hdd, profile rbd pool=fulen-nvme, 
profile rbd pool=fulen-dcache, profile rbd pool=fulen-dcache-data, profile rbd 
pool=fulen-dcache-meta, profile rbd pool=fulen-hdd-data, profile rbd 
pool=fulen-nvme-meta"
caps mon = "profile rbd"
caps osd = "profile rbd pool=fulen-hdd, profile rbd pool=fulen-nvme, 
profile rbd pool=fulen-dcache, profile rbd pool=fulen-dcache-data, profile rbd 
pool=fulen-dcache-meta, profile rbd pool=fulen-hdd-data, profile rbd 
pool=fulen-nvme-meta"



On 11.02.22, 13:22, "Eugen Block"  wrote:

Hi,

the first thing coming to mind are the user's caps. Which permissions  
do they have? Have you compared 'ceph auth get client.fulen' on both  
clusters? Please paste the output from both clusters and redact  
sensitive information.


Zitat von Lo Re  Giuseppe :

> Hi all,
>
> This is my first post to this user group, I’m not a ceph expert,  
> sorry if I say/ask anything trivial.
>
> On a Kubernetes cluster I have an issue in creating volumes from a  
> (csi) ceph EC pool.
>
> I can reproduce the problem from rbd cli like this from one of the  
> k8s worker nodes:
>
> “””
> root@fulen-w006:~# ceph -v
> ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)  
> octopus (stable)
>
> root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> key create test-loreg-3 --data-pool fulen-hdd-data --pool  
> fulen-nvme-meta --size 1G
>
> root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> key map fulen-nvme-meta/test-loreg-3
> ...
> rbd: sysfs write failed
> ...
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (1) Operation not permitted
> “””
>
> The same sequence of operations works on a different node (not part  
> of the k8s cluster, completely different setup):
>
> “””
> root@storagesmw: # ceph -v
> ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2)  
> octopus (stable)
>
> root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> client.fulen.key create test-loreg-4 --data-pool fulen-hdd-data  
> --pool fulen-nvme-meta --size 1G
>
> root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> client.fulen.key info fulen-nvme-meta/test-loreg-4
> rbd image 'test-loreg-4':
> size 1 GiB in 256 objects
> order 22 (4 MiB objects)
> snapshot_count: 0
> id: cafc436ff3573
> data_pool: fulen-hdd-data
> block_name_prefix: rbd_data.36.cafc436ff3573
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff,  
> deep-flatten, data-pool
> op_features:
> flags:
> create_timestamp: Thu Feb 10 18:23:26 2022
> access_timestamp: Thu Feb 10 18:23:26 2022
> modify_timestamp: Thu Feb 10 18:23:26 2022
>
> root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> client.fulen.key map fulen-nvme-meta/test-loreg-4
> RBD image feature set mismatch. You can disable features unsupported  
> by the kernel with "rbd feature disable fulen-nvme-meta/test-loreg-4  
> object-map fast-diff deep-flatten".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
>
> root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> client.fulen.key feature disable fulen-nvme-meta/test-loreg-4  
> object-map fast-diff deep-flatten
>
> root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile  
> client.fulen.key map fulen-nvme-meta/test-loreg-4
> /dev/rbd0
> “””
>
> The 2 nodes OS release and kernel are below.
>
> Does anyone have any advice on how to debug this?
>
> Thanks in advance,
>
> Giuseppe
>
> == Fulen-w006:
>
> root@fulen-w006:~# cat /etc/os-release
> NAME="Ubuntu"
> VERSION="20.04.3 

[ceph-users] RBD map issue

2022-02-11 Thread Lo Re Giuseppe
Hi all,

This is my first post to this user group, I’m not a ceph expert, sorry if I 
say/ask anything trivial.

On a Kubernetes cluster I have an issue in creating volumes from a (csi) ceph 
EC pool.

I can reproduce the problem from rbd cli like this from one of the k8s worker 
nodes:

“””
root@fulen-w006:~# ceph -v
ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be) octopus (stable)

root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile key create 
test-loreg-3 --data-pool fulen-hdd-data --pool fulen-nvme-meta --size 1G

root@fulen-w006:~# rbd -m 148.187.20.141:6789 --id fulen --keyfile key map 
fulen-nvme-meta/test-loreg-3
...
rbd: sysfs write failed
...
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (1) Operation not permitted
“””

The same sequence of operations works on a different node (not part of the k8s 
cluster, completely different setup):

“””
root@storagesmw: # ceph -v
ceph version 15.2.13 (c44bc49e7a57a87d84dfff2a077a2058aa2172e2) octopus (stable)

root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile 
client.fulen.key create test-loreg-4 --data-pool fulen-hdd-data --pool 
fulen-nvme-meta --size 1G

root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile 
client.fulen.key info fulen-nvme-meta/test-loreg-4
rbd image 'test-loreg-4':
size 1 GiB in 256 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: cafc436ff3573
data_pool: fulen-hdd-data
block_name_prefix: rbd_data.36.cafc436ff3573
format: 2
features: layering, exclusive-lock, object-map, fast-diff, 
deep-flatten, data-pool
op_features:
flags:
create_timestamp: Thu Feb 10 18:23:26 2022
access_timestamp: Thu Feb 10 18:23:26 2022
modify_timestamp: Thu Feb 10 18:23:26 2022

root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile 
client.fulen.key map fulen-nvme-meta/test-loreg-4
RBD image feature set mismatch. You can disable features unsupported by the 
kernel with "rbd feature disable fulen-nvme-meta/test-loreg-4 object-map 
fast-diff deep-flatten".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile 
client.fulen.key feature disable fulen-nvme-meta/test-loreg-4 object-map 
fast-diff deep-flatten

root@storagesmw: # rbd -m 148.187.20.141:6789 --id fulen --keyfile 
client.fulen.key map fulen-nvme-meta/test-loreg-4
/dev/rbd0
“””

The 2 nodes OS release and kernel are below.

Does anyone have any advice on how to debug this?

Thanks in advance,

Giuseppe

== Fulen-w006:

root@fulen-w006:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/";
SUPPORT_URL="https://help.ubuntu.com/";
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/";
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy";
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root@fulen-w006:~# uname -a
Linux fulen-w006.cscs.ch 5.4.0-96-generic #109-Ubuntu SMP Wed Jan 12 16:49:16 
UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Storagesmw:

root@storagesmw:~/loreg/ceph_conf # cat /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/";
BUG_REPORT_URL="https://bugzilla.redhat.com/";

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"
root@storagesmw:~/loreg/ceph_conf # uname -a
Linux storagesmw.cscs.ch 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 11 19:12:04 
EDT 2020 x86_64 x86_64 x86_64 GNU/Linux





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io