[ceph-users] Re: Phhantom host

2024-06-21 Thread Adam King
I don't remember how connected the dashboard is to the orchestrator in
pacific, but the only thing I could think to do here is just restart it.
(ceph mgr module disable dashboard, ceph mgr module enable dashboard). You
could also totally fail over the mgr (ceph mgr fail) although that might
change the url you need for the dashboard by changing where the active mgr
is.

On Fri, Jun 21, 2024 at 10:14 AM Tim Holloway  wrote:

> Ceph Pacific
>
> Thanks to some misplaced thumbs-on-keyboard, I inadvertently managed to
> alias a non-ceph system's ip as a ceph host and ceph adopted it
> somehow.
>
> I fixed the fat-fingered IP, and have gone through the usual motions to
> delete a host, but some parts of the ceph ecosystem haven't caught up.
>
> The host no longer shows on "ceph orch host ls', but on the web control
> panel, it's still there and thinks it has an OSD attached. Ditto for
> the "ceph health detail". On the other hand, the webapp shows not one,
> but THREE OSD's associated with the phantom host on the dashboard/hosts
> detail expansion. It's claiming to own OSD daemons that are actually on
> other machines.
>
> Any assistance would be much appreciated!
>
> Tim
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rgw zone create fails EINVAL

2024-06-19 Thread Adam King
I think this is at least partially a code bug in the rgw module. Where it's
actually failing in the traceback is generating the return message for the
user at the end, because it assumes `created_zones` will always be a list
of strings and that seems to not be the case in any error scenario. That
caused that error to be the one printed instead of the actual underlying
failure. From what I can see looking at the code, 1 of 3 error conditions
happened. 1) It may have actually failed parsing the spec. You may be able
to test this by using `orch apply` instead with the --dry-run flag so it
doesn't actually make daemons but should still have to parse the spec. 2)
If either the inbuf (-i ) or zone_name and realm_token aren't
provided it errors out. That one shouldn't be the case since you used -i in
your command. 3) It actually failed creating the zone. Unfortunately, there
seems to be a whole bunch of reasons it could fail doing so, as it does
much more than just creating the zone. From what I can see, this can fail
if there is no realm token provided, there is no zone provided, the zone
already exists, the token contains no endpoint, secreate, access key or
realm name, it fails to pull the realm (radosgw-admin realm pull --url
 --access-key  --secret ), it fail to find the master zonegroup (using a `radosgw-admin
zonegroup get...` command with the realm from the spec passed in), it fails
actually creating the zone (using a `radosgw-admin zone create` command
with the --master, --access-key, --secret and --endpoints params all filled
in), or it fails updating the period after creating the zone. I wish I knew
this module a bit better so I could provide something more useful than a
massive list of potential failure causes, but unfortunately I do not. I am
at least going to create a patch to fix the error handling issue here
though.

On Wed, Jun 19, 2024 at 2:16 PM Matthew Vernon 
wrote:

> Hi,
>
> I'm running cephadm/reef 18.2.2. I'm trying to set up multisite.
>
> I created realm/zonegroup/master zone OK (I think!), edited the
> zonegroup json to include hostnames. I have this spec file for the
> secondary zone:
>
> rgw_zone: codfw
> rgw_realm_token: "SECRET"
> placement:
>label: "rgw"
>
> [I get "SECRET" by doing ceph rgw realm tokens on the master, and C
> the field labelled "token"]
>
> If I then try and apply this with:
> ceph rgw zone create -i /root/rgw_secondary.yaml
>
> It doesn't work, and I get an unhelpful backtrace:
> Error EINVAL: Traceback (most recent call last):
>File "/usr/share/ceph/mgr/mgr_module.py", line 1811, in _handle_command
>  return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
> ^
>File "/usr/share/ceph/mgr/mgr_module.py", line 474, in call
>  return self.func(mgr, **kwargs)
> 
>File "/usr/share/ceph/mgr/rgw/module.py", line 96, in wrapper
>  return func(self, *args, **kwargs)
> ^^^
>File "/usr/share/ceph/mgr/rgw/module.py", line 304, in
> _cmd_rgw_zone_create
>  return HandleCommandResult(retval=0, stdout=f"Zones {',
> '.join(created_zones)} created successfully")
>
> 
> TypeError: sequence item 0: expected str instance, int found
>
> I assume I've messed up the spec file, but it looks like the one in the
> docs[0]. Can anyone point me in the right direction, please?
>
> [if the underlying command emits anything useful, I can't find it in the
> logs]
>
> Thanks,
>
> Matthew
>
> [0] https://docs.ceph.com/en/reef/mgr/rgw/#realm-credentials-token
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm basic questions: image config, OS reimages

2024-05-16 Thread Adam King
At least for the current up-to-date reef branch (not sure what reef version
you're on) when --image is not provided to the shell, it should try to
infer the image in this order

   1. from the CEPHADM_IMAGE env. variable
   2. if you pass --name with a daemon name to the shell command, it will
   try to get the image that daemon uses
   3. next it tries to find the image being used by any ceph container on
   the host
   4. The most recently built ceph image it can find on the host (by
   CreatedAt metadata field for the image)


There is a `ceph cephadm osd activate ` command that is meant to do
something similar in terms of OSD activation. If I'm being honest I haven't
looked at it in some time, but it does have some CI test coverage via
https://github.com/ceph/ceph/blob/main/qa/suites/orch/cephadm/osds/2-ops/rmdir-reactivate.yaml

On Thu, May 16, 2024 at 11:45 AM Matthew Vernon 
wrote:

> Hi,
>
> I've some experience with Ceph, but haven't used cephadm much before,
> and am trying to configure a pair of reef clusters with cephadm. A
> couple of newbie questions, if I may:
>
> * cephadm shell image
>
> I'm in an isolated environment, so pulling from a local repository. I
> bootstrapped OK with
> cephadm --image docker-registry.wikimedia.org/ceph bootstrap ...
>
> And that worked nicely, but if I want to run cephadm shell (to do any
> sort of admin), then I have to specify
> cephadm --image docker-registry.wikimedia.org/ceph shell
>
> (otherwise it just hangs failing to talk to quay.io).
>
> I found the docs, which refer to setting lots of other images, but not
> the one that cephadm uses:
>
> https://docs.ceph.com/en/reef/cephadm/install/#deployment-in-an-isolated-environment
>
> I found an old tracker in this area: https://tracker.ceph.com/issues/47274
>
> ...but is there a good way to arrange for cephadm to use the
> already-downloaded image without having to remember to specify --image
> each time?
>
> * OS reimages
>
> We do OS upgrades by reimaging the server (which doesn't touch the
> storage disks); on an old-style deployment you could then use
> ceph-volume to re-start the OSDs and away you went; how does one do this
> in a cephadm cluster?
> [I presume involves telling cephadm to download a new image for podman
> to use and suchlike]
>
> Would the process be smoother if we arranged to leave /var/lib/ceph
> intact between reimages?
>
> Thanks,
>
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT meeting notes May 6th 2024

2024-05-06 Thread Adam King
   - DigitalOcean credits


   - things to ask


   - what would promotional material require


   - how much are credits worth


   - Neha to ask


   - 19.1.0 centos9 container status


   - close to being ready


   - will be building centos 8 and 9 containers simultaneously


   - should test on orch and upgrade suites before publishing RC


   - should first RC be tested on LRC?


   - skip first RC on LRC


   - performance differences to 18.2.2 with cephfs being investigated


   - 18.2.3


   - fix for https://tracker.ceph.com/issues/65733 almost ready


   - will upgrade LRC to version with fix


   - will happen before 19.1.0 most likely
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph recipe for nfs exports

2024-04-24 Thread Adam King
>
> - Although I can mount the export I can't write on it
>
> What error are you getting trying to do the write? The way you set things
up doesn't look to different than one of our integration tests for ingress
over nfs (
https://github.com/ceph/ceph/blob/main/qa/suites/orch/cephadm/smoke-roleless/2-services/nfs-ingress.yaml)
and that test tests a simple read/write to the export after
creating/mounting it.


> - I can't understand how to use the sdc disks for journaling
>
>
you should be able to specify a `journal_devices` section in an OSD spec.
For example












*service_type: osdservice_id: fooplacement:  hosts:  - vm-00spec:
data_devices:paths:- /dev/vdb  journal_devices:paths:-
/dev/vdc*

that will make non-colocated OSDs where the devices from the
journal_devices section are used as journal devices for the OSDs on the
devices in the data_devices section. Although I'd recommend looking through
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications

and
see if there are any other filtering options than the path that can be used
first. It's possible the path the device gets can change on reboot and you
could end up with cepadm using a device you don't want it to for this as
that other device gets the path another device held previously.

- I can't understand the concept of "pseudo path"
>

I don't know at a low level either, but it seems to just be the path
nfs-ganesha will present to the user. There is another argument to `ceph
nfs export create` which is just "path" rather than pseudo-path that marks
what actual path within the cephfs the export is mounted on. It's optional
and defaults to "/" (so the export you made is mounted at the root of the
fs). I think that's the one that really matters. The pseudo-path seems to
just act like a user facing name for the path.

On Wed, Apr 24, 2024 at 3:40 AM Roberto Maggi @ Debian 
wrote:

> Hi you all,
>
> I'm almost new to ceph and I'm understanding, day by day, why the
> official support is so expansive :)
>
>
> I setting up a ceph nfs network cluster whose recipe can be found here
> below.
>
> ###
>
> --> cluster creation cephadm bootstrap --mon-ip 10.20.20.81
> --cluster-network 10.20.20.0/24 --fsid $FSID --initial-dashboard-user adm
> \
> --initial-dashboard-password 'Hi_guys' --dashboard-password-noupdate
> --allow-fqdn-hostname --ssl-dashboard-port 443 \
> --dashboard-crt /etc/ssl/wildcard.it/wildcard.it.crt --dashboard-key
> /etc/ssl/wildcard.it/wildcard.it.key \
> --allow-overwrite --cleanup-on-failure
> cephadm shell --fsid $FSID -c /etc/ceph/ceph.conf -k
> /etc/ceph/ceph.client.admin.keyring
> cephadm add-repo --release reef && cephadm install ceph-common
> --> adding hosts and set labels
> for IP in $(grep ceph /etc/hosts | awk '{print $1}') ; do ssh-copy-id -f
> -i /etc/ceph/ceph.pub root@$IP ; done
> ceph orch host add cephstage01 10.20.20.81 --labels
> _admin,mon,mgr,prometheus,grafana
> ceph orch host add cephstage02 10.20.20.82 --labels
> _admin,mon,mgr,prometheus,grafana
> ceph orch host add cephstage03 10.20.20.83 --labels
> _admin,mon,mgr,prometheus,grafana
> ceph orch host add cephstagedatanode01 10.20.20.84 --labels
> osd,nfs,prometheus
> ceph orch host add cephstagedatanode02 10.20.20.85 --labels
> osd,nfs,prometheus
> ceph orch host add cephstagedatanode03 10.20.20.86 --labels
> osd,nfs,prometheus
> --> network setup and daemons deploy
> ceph config set mon public_network 10.20.20.0/24,192.168.7.0/24
> ceph orch apply mon
>
> --placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83"
> ceph orch apply mgr
>
> --placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83"
> ceph orch apply prometheus
>
> --placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83,cephstagedatanode01:10.20.20.84,cephstagedatanode02:10.20.20.85,cephstagedatanode03:10.20.20.86"
> ceph orch apply grafana
>
> --placement="cephstage01:10.20.20.81,cephstage02:10.20.20.82,cephstage03:10.20.20.83,cephstagedatanode01:10.20.20.84,cephstagedatanode02:10.20.20.85,cephstagedatanode03:10.20.20.86"
> ceph orch apply node-exporter
> ceph orch apply alertmanager
> ceph config set mgr mgr/cephadm/secure_monitoring_stack true
> --> disks and osd setup
> for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ; do
> ssh root@$IP "hostname && wipefs -a -f /dev/sdb&& wipefs -a -f
> /dev/sdc"; done
> ceph config set mgr mgr/cephadm/device_enhanced_scan true
> for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
> doceph orch device ls --hostname=$IP --wide --refresh ; done
> for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
> doceph orch device zap $IP /dev/sdb; done
> for IP in $(grep cephstagedatanode/etc/hosts | awk '{print $1}') ;
> doceph orch device zap $IP /dev/sdc ; done
> for IP in $(grep 

[ceph-users] Re: which grafana version to use with 17.2.x ceph version

2024-04-23 Thread Adam King
FWIW, cephadm uses `quay.io/ceph/ceph-grafana:9.4.7` as the default grafana
image in the quincy branch

On Tue, Apr 23, 2024 at 11:59 AM Osama Elswah 
wrote:

> Hi,
>
>
> in quay.io I can find a lot of grafana versions for ceph (
> https://quay.io/repository/ceph/grafana?tab=tags) how can I find out
> which version should be used when I upgrade my cluster to 17.2.x ? Can I
> simply take the latest grafana version? Or is there a specfic grafana
> version I need to use?
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-16 Thread Adam King
>
> Orchestrator:
> 1. https://tracker.ceph.com/issues/64208 - test_cephadm.sh: Container
> version mismatch causes job to fail.
>

Not a blocker issue. Just a problem with the test itself that will be fixed
by https://github.com/ceph/ceph/pull/56714
<https://github.com/ceph/ceph/pull/56714>

On Tue, Apr 16, 2024 at 1:39 PM Laura Flores  wrote:

> On behalf of @Radoslaw Zarzynski , rados approved.
>
> Below is the summary of the rados suite failures, divided by component. @Adam
> King  @Venky Shankar  PTAL at the
> orch and cephfs failures to see if they are blockers.
>
> Failures, unrelated:
>
> RADOS:
> 1. https://tracker.ceph.com/issues/65183 - Overriding an EC pool
> needs the "--yes-i-really-mean-it" flag in addition to "force"
> 3. https://tracker.ceph.com/issues/62992 - Heartbeat crash in
> reset_timeout and clear_timeout
> 4. https://tracker.ceph.com/issues/58893 - test_map_discontinuity:
> AssertionError: wait_for_clean: failed before timeout expired
> 5. https://tracker.ceph.com/issues/61774 - centos 9 testing reveals
> rocksdb "Leak_StillReachable" memory leak in mons
> 7. https://tracker.ceph.com/issues/62776 - rados: cluster [WRN]
> overall HEALTH_WARN - do not have an application enabled
> 8. https://tracker.ceph.com/issues/59196 - ceph_test_lazy_omap_stats
> segfault while waiting for active+clean
>
> Orchestrator:
> 1. https://tracker.ceph.com/issues/64208 - test_cephadm.sh: Container
> version mismatch causes job to fail.
>
> CephFS:
> 1. https://tracker.ceph.com/issues/64946 - qa: unable to locate
> package libcephfs1
>
> Teuthology:
> 1. https://tracker.ceph.com/issues/64727 - suites/dbench.sh: Socket
> exception: No route to host (113)
>
> On Tue, Apr 16, 2024 at 9:22 AM Yuri Weinstein 
> wrote:
>
>> And approval is needed for:
>>
>> fs - Venky approved?
>> powercycle - seems fs related, Venky, Brad PTL
>>
>> On Mon, Apr 15, 2024 at 5:55 PM Yuri Weinstein 
>> wrote:
>> >
>> > Still waiting for approvals:
>> >
>> > rados - Radek, Laura approved? Travis?  Nizamudeen?
>> >
>> > ceph-volume issue was fixed by https://github.com/ceph/ceph/pull/56857
>> >
>> > We plan not to upgrade the LRC to 18.2.3 as we are very close to the
>> > first squid RC and will be using it for this purpose.
>> > Please speak up if this may present any issues.
>> >
>> > Thx
>> >
>> > On Fri, Apr 12, 2024 at 11:37 AM Yuri Weinstein 
>> wrote:
>> > >
>> > > Details of this release are summarized here:
>> > >
>> > > https://tracker.ceph.com/issues/65393#note-1
>> > > Release Notes - TBD
>> > > LRC upgrade - TBD
>> > >
>> > > Seeking approvals/reviews for:
>> > >
>> > > smoke - infra issues, still trying, Laura PTL
>> > >
>> > > rados - Radek, Laura approved? Travis?  Nizamudeen?
>> > >
>> > > rgw - Casey approved?
>> > > fs - Venky approved?
>> > > orch - Adam King approved?
>> > >
>> > > krbd - Ilya approved
>> > > powercycle - seems fs related, Venky, Brad PTL
>> > >
>> > > ceph-volume - will require
>> > >
>> https://github.com/ceph/ceph/pull/56857/commits/63fe3921638f1fb7fc065907a9e1a64700f8a600
>> > > Guillaume is fixing it.
>> > >
>> > > TIA
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage <https://ceph.io>
>
> Chicago, IL
>
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.3 QE validation status

2024-04-14 Thread Adam King
orch approved

On Fri, Apr 12, 2024 at 2:38 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/65393#note-1
> Release Notes - TBD
> LRC upgrade - TBD
>
> Seeking approvals/reviews for:
>
> smoke - infra issues, still trying, Laura PTL
>
> rados - Radek, Laura approved? Travis?  Nizamudeen?
>
> rgw - Casey approved?
> fs - Venky approved?
> orch - Adam King approved?
>
> krbd - Ilya approved
> powercycle - seems fs related, Venky, Brad PTL
>
> ceph-volume - will require
>
> https://github.com/ceph/ceph/pull/56857/commits/63fe3921638f1fb7fc065907a9e1a64700f8a600
> Guillaume is fixing it.
>
> TIA
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-04-09 Thread Adam King
The same experiment with the mds daemons pulling 4GB instead of the 16GB,
and me fixing the starting total memory (I accidentally used the
memory_available_kb instead of memory_total_kb the first time) gives us























*DEBUGcephadm.autotune:autotune.py:35 Autotuning OSD memory with given
parameters:Total memory: 23530995712Daemons: [(crash.a),
(grafana.a), (mds.a),
(mds.b), (mds.c),
(mgr.a), (mon.a),
(node-exporter.a), (osd.1),
(osd.2), (osd.3),
(osd.4), (prometheus.a)]DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 134217728 from total for crash
daemonDEBUGcephadm.autotune:autotune.py:52 new total: 23396777984DEBUG
   cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for
grafana daemonDEBUGcephadm.autotune:autotune.py:52 new total:
22323036160DEBUGcephadm.autotune:autotune.py:40 Subtracting 4294967296
from total for mds daemonDEBUGcephadm.autotune:autotune.py:42 new
total: 18028068864DEBUGcephadm.autotune:autotune.py:40 Subtracting
4294967296 from total for mds daemonDEBUG
 cephadm.autotune:autotune.py:42 new total: 13733101568DEBUG
 cephadm.autotune:autotune.py:40 Subtracting 4294967296 from total for mds
daemonDEBUGcephadm.autotune:autotune.py:42 new total: 9438134272DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 4294967296 from total for mgr
daemonDEBUGcephadm.autotune:autotune.py:52 new total: 5143166976DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for mon
daemonDEBUGcephadm.autotune:autotune.py:52 new total: 4069425152DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for
node-exporter daemonDEBUGcephadm.autotune:autotune.py:52 new total:
2995683328DEBUGcephadm.autotune:autotune.py:50 Subtracting 1073741824
from total for prometheus daemonDEBUGcephadm.autotune:autotune.py:52
new total: 1921941504DEBUGcephadm.autotune:autotune.py:66 Final total
is 1921941504 to be split among 4 OSDsDEBUG
 cephadm.autotune:autotune.py:68 Result is 480485376 per OSD*

My understanding is, given starting memory_total_kb of *32827840*, we get
*33615708160* total bytes. We multiply that by the 0.7 autotune ratio to
get *23530995712 *bytes to be split among the daemons (something like 23-24
GB). Then the mgr and mds daemons all get 4GB, the mon, node-exporter, and
prometheus all take 1GB, and the crash daemon gets 128KB. That leaves us
with only 2GB to split among the 4 OSDs. That's how we arrive at that
"480485376" number per OSD from the original error message you posted.

Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing
> value: Value '480485376' is below minimum 939524096


As that value is well below the minimum (it's only about half a GB), it
reports that error when trying to set it.

On Tue, Apr 9, 2024 at 12:58 PM Mads Aasted  wrote:

> Hi Adam
>
> Seems like the mds_cache_memory_limit both set globally through cephadm
> and the hosts mds daemons are all set to approx. 4gb
> root@my-ceph01:/# ceph config get mds mds_cache_memory_limit
> 4294967296
> same if query the individual mds daemons running on my-ceph01, or any of
> the other mds daemons on the other hosts.
>
> On Tue, Apr 9, 2024 at 6:14 PM Mads Aasted  wrote:
>
>> Hi Adam
>>
>> Let me just finish tucking in a devlish tyke here and i’ll get to it
>> first thing
>>
>> tirs. 9. apr. 2024 kl. 18.09 skrev Adam King :
>>
>>> I did end up writing a unit test to see what we calculated here, as well
>>> as adding a bunch of debug logging (haven't created a PR yet, but probably
>>> will).  The total memory was set to (19858056 * 1024 * 0.7) (total memory
>>> in bytes * the autotune target ratio) = 14234254540. What ended up getting
>>> logged was (ignore the daemon id for the daemons, they don't affect
>>> anything. Only the types matter)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *DEBUGcephadm.autotune:autotune.py:35 Autotuning OSD memory with
>>> given parameters:Total memory: 14234254540Daemons:
>>> [(crash.a), (grafana.a),
>>> (mds.a), (mds.b),
>>> (mds.c), (mgr.a),
>>> (mon.a), (node-exporter.a),
>>> (osd.1), (osd.2),
>>> (osd.3), (osd.4),
>>> (prometheus.a)]DEBUGcephadm.autotune:autotune.py:50
>>> Subtracting 134217728 from total for crash daemonDEBUG
>>>  cephadm.autotune:autotune.py:52 new total: 14100036812DEBUG
>>>  cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for
>>> grafana daemonDEBUGcephadm.autotune:autotune.py:52 new total:
>>> 13026294988DEBUGc

[ceph-users] Re: Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-04-09 Thread Adam King
I did end up writing a unit test to see what we calculated here, as well as
adding a bunch of debug logging (haven't created a PR yet, but probably
will).  The total memory was set to (19858056 * 1024 * 0.7) (total memory
in bytes * the autotune target ratio) = 14234254540. What ended up getting
logged was (ignore the daemon id for the daemons, they don't affect
anything. Only the types matter)





















*DEBUGcephadm.autotune:autotune.py:35 Autotuning OSD memory with given
parameters:Total memory: 14234254540Daemons: [(crash.a),
(grafana.a), (mds.a),
(mds.b), (mds.c),
(mgr.a), (mon.a),
(node-exporter.a), (osd.1),
(osd.2), (osd.3),
(osd.4), (prometheus.a)]DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 134217728 from total for crash
daemonDEBUGcephadm.autotune:autotune.py:52 new total: 14100036812DEBUG
   cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for
grafana daemonDEBUGcephadm.autotune:autotune.py:52 new total:
13026294988DEBUGcephadm.autotune:autotune.py:40 Subtracting 17179869184
from total for mds daemonDEBUGcephadm.autotune:autotune.py:42 new
total: -4153574196DEBUGcephadm.autotune:autotune.py:40 Subtracting
17179869184 from total for mds daemonDEBUG
 cephadm.autotune:autotune.py:42 new total: -21333443380DEBUG
 cephadm.autotune:autotune.py:40 Subtracting 17179869184 from total for mds
daemonDEBUGcephadm.autotune:autotune.py:42 new total: -38513312564DEBUG
   cephadm.autotune:autotune.py:50 Subtracting 4294967296 from total for
mgr daemonDEBUGcephadm.autotune:autotune.py:52 new total:
-42808279860DEBUGcephadm.autotune:autotune.py:50 Subtracting 1073741824
from total for mon daemonDEBUGcephadm.autotune:autotune.py:52 new
total: -43882021684DEBUGcephadm.autotune:autotune.py:50 Subtracting
1073741824 from total for node-exporter daemonDEBUG
 cephadm.autotune:autotune.py:52 new total: -44955763508DEBUG
 cephadm.autotune:autotune.py:50 Subtracting 1073741824 from total for
prometheus daemonDEBUGcephadm.autotune:autotune.py:52 new total:
-46029505332*

It looks like it was taking pretty much all the memory away for the mds
daemons. The amount, however, is taken from the "mds_cache_memory_limit"
setting for each mds daemon. The number it was defaulting to for the test
is quite large. I guess I'd need to know what that comes out to for the mds
daemons in your cluster to get a full picture. Also, you can see the total
go well into the negatives here. When that happens cephadm just tries to
remove the osd_memory_target config settings for the OSDs on the host, but
given the error message from your initial post, it must be getting some
positive value when actually running on your system.

On Fri, Apr 5, 2024 at 2:21 AM Mads Aasted  wrote:

> Hi Adam
> No problem, i really appreciate your input :)
> The memory stats returned are as follows
>   "memory_available_kb": 19858056,
>   "memory_free_kb": 277480,
>   "memory_total_kb": 32827840,
>
> On Thu, Apr 4, 2024 at 10:14 PM Adam King  wrote:
>
>> Sorry to keep asking for more info, but can I also get what `cephadm
>> gather-facts` on that host returns for "memory_total_kb". Might end up
>> creating a unit test out of this case if we have a calculation bug here.
>>
>> On Thu, Apr 4, 2024 at 4:05 PM Mads Aasted  wrote:
>>
>>> sorry for the double send, forgot to hit reply all so it would appear on
>>> the page
>>>
>>> Hi Adam
>>>
>>> If we multiply by 0.7, and work through the previous example from that
>>> number, we would still arrive at roughly 2.5 gb for each osd. And the host
>>> in question is trying to set it to less than 500mb.
>>> I have attached a list of the processes running on the host. Currently
>>> you can even see that the OSD's are taking up the most memory by far, and
>>> at least 5x its proposed minimum.
>>> root@my-ceph01:/# ceph orch ps | grep my-ceph01
>>> crash.my-ceph01   my-ceph01   running (3w)
>>>  7m ago  13M9052k-  17.2.6
>>> grafana.my-ceph01 my-ceph01  *:3000   running (3w)
>>>  7m ago  13M95.6M-  8.3.5
>>> mds.testfs.my-ceph01.xjxfzd  my-ceph01   running (3w)
>>>  7m ago  10M 485M-  17.2.6
>>> mds.prodfs.my-ceph01.rplvac   my-ceph01   running (3w)
>>>  7m ago  12M26.9M-  17.2.6
>>> mds.prodfs.my-ceph01.twikzdmy-ceph01   running (3w)
>>>  7m ago  12M26.2M-  17.2.6
>>> mgr.my-ceph01.rxdefe  my-ceph01  *:8443,9283  running (3w)
>>>  7m ago  13M 907M-  17.2.6
>>> mon.my-ceph01 my-ceph01   running (3w)
>>>  7m ago  13M

[ceph-users] Re: Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-04-04 Thread Adam King
Sorry to keep asking for more info, but can I also get what `cephadm
gather-facts` on that host returns for "memory_total_kb". Might end up
creating a unit test out of this case if we have a calculation bug here.

On Thu, Apr 4, 2024 at 4:05 PM Mads Aasted  wrote:

> sorry for the double send, forgot to hit reply all so it would appear on
> the page
>
> Hi Adam
>
> If we multiply by 0.7, and work through the previous example from that
> number, we would still arrive at roughly 2.5 gb for each osd. And the host
> in question is trying to set it to less than 500mb.
> I have attached a list of the processes running on the host. Currently you
> can even see that the OSD's are taking up the most memory by far, and at
> least 5x its proposed minimum.
> root@my-ceph01:/# ceph orch ps | grep my-ceph01
> crash.my-ceph01   my-ceph01   running (3w)  7m
> ago  13M9052k-  17.2.6
> grafana.my-ceph01 my-ceph01  *:3000   running (3w)  7m
> ago  13M95.6M-  8.3.5
> mds.testfs.my-ceph01.xjxfzd  my-ceph01   running (3w)  7m
> ago  10M 485M-  17.2.6
> mds.prodfs.my-ceph01.rplvac   my-ceph01   running (3w)  7m
> ago  12M26.9M-  17.2.6
> mds.prodfs.my-ceph01.twikzdmy-ceph01   running (3w)
>  7m ago  12M26.2M-  17.2.6
> mgr.my-ceph01.rxdefe  my-ceph01  *:8443,9283  running (3w)  7m
> ago  13M 907M-  17.2.6
> mon.my-ceph01 my-ceph01   running (3w)  7m
> ago  13M 503M2048M  17.2.6
> node-exporter.my-ceph01   my-ceph01  *:9100   running (3w)  7m
> ago  13M20.4M-  1.5.0
> osd.3my-ceph01   running (3w)
>  7m ago  11M2595M4096M  17.2.6
> osd.5my-ceph01   running (3w)
>  7m ago  11M2494M4096M  17.2.6
> osd.6my-ceph01   running (3w)
>  7m ago  11M2698M4096M  17.2.6
> osd.9my-ceph01   running (3w)
>  7m ago  11M3364M4096M  17.2.6
> prometheus.my-ceph01  my-ceph01  *:9095   running (3w)  7m
> ago  13M 164M-  2.42.0
>
>
>
>
> On Thu, Mar 28, 2024 at 2:13 AM Adam King  wrote:
>
>>  I missed a step in the calculation. The total_memory_kb I mentioned
>> earlier is also multiplied by the value of the
>> mgr/cephadm/autotune_memory_target_ratio before doing the subtractions for
>> all the daemons. That value defaults to 0.7. That might explain it seeming
>> like it's getting a value lower than expected. Beyond that, I'd think 'i'd
>> need a list of the daemon types and count on that host to try and work
>> through what it's doing.
>>
>> On Wed, Mar 27, 2024 at 10:47 AM Mads Aasted  wrote:
>>
>>> Hi Adam.
>>>
>>> So doing the calculations with what you are stating here I arrive at a
>>> total sum for all the listed processes at 13.3 (roughly) gb, for everything
>>> except the osds, leaving well in excess of +4gb for each OSD.
>>> Besides the mon daemon which i can tell on my host has a limit of 2gb ,
>>> none of the other daemons seem to have a limit set according to ceph orch
>>> ps. Then again, they are nowhere near the values stated in min_size_by_type
>>> that you list.
>>> Obviously yes, I could disable the auto tuning, but that would leave me
>>> none the wiser as to why this exact host is trying to do this.
>>>
>>>
>>>
>>> On Tue, Mar 26, 2024 at 10:20 PM Adam King  wrote:
>>>
>>>> For context, the value the autotune goes with takes the value from
>>>> `cephadm gather-facts` on the host (the "memory_total_kb" field) and then
>>>> subtracts from that per daemon on the host according to
>>>>
>>>> min_size_by_type = {
>>>> 'mds': 4096 * 1048576,
>>>> 'mgr': 4096 * 1048576,
>>>> 'mon': 1024 * 1048576,
>>>> 'crash': 128 * 1048576,
>>>> 'keepalived': 128 * 1048576,
>>>> 'haproxy': 128 * 1048576,
>>>> 'nvmeof': 4096 * 1048576,
>>>> }
>>>> default_size = 1024 * 1048576
>>>>
>>>> what's left is then divided by the number of OSDs on the host to arrive
>>>> at the value. I'll also add, since it seems to be an issue on this
>>>> particular host,  if you add the "_no_autotune_memory" label to the host,
>>>> it will stop trying to do this on that 

[ceph-users] Re: CEPHADM_HOST_CHECK_FAILED

2024-04-04 Thread Adam King
First, I guess I would make sure that peon7 and peon12 actually could pass
the host check (you can run "cephadm check-host" on the host directly if
you have a copy of the cephadm binary there) Then I'd try a mgr failover
(ceph mgr fail) to clear out any in memory host values cephadm might have
and restart the module. If it still reproduces after that, then you might
have to set mgr/cephadm/log_to_cluster_level to debug, do another mgr
failover, wait until the module crashes and see if "ceph log last 100 debug
cephadm" gives more info on where the crash occurred (it might have an
actual traceback).

On Thu, Apr 4, 2024 at 4:51 AM  wrote:

> Hi,
>
> I’ve added some new nodes to our Ceph cluster. Only did the host add, had
> not added the OSD’s yet.
> Due to a configuration error I had to reinstall some of them. But I forgot
> to remove the nodes from Ceph first. I did a “ceph orch host rm peon7
> --offline —force” before re-adding them to the cluster.
>
> All the nodes are showing up in the host list (all the peons are the new
> ones):
>
> # ceph orch host ls
> HOST ADDR LABELS  STATUS
> ceph110.103.0.71
> ceph210.103.0.72
> ceph310.103.0.73
> ceph410.103.0.74
> compute1 10.103.0.11
> compute2 10.103.0.12
> compute3 10.103.0.13
> compute4 10.103.0.14
> controller1  10.103.0.8
> controller2  10.103.0.9
> controller3  10.103.0.10
> peon110.103.0.41
> peon210.103.0.42
> peon310.103.0.43
> peon410.103.0.44
> peon510.103.0.45
> peon610.103.0.46
> peon710.103.0.47
> peon810.103.0.48
> peon910.103.0.49
> peon10   10.103.0.50
> peon12   10.103.0.52
> peon13   10.103.0.53
> peon14   10.103.0.54
> peon15   10.103.0.55
> peon16   10.103.0.56
>
> But Ceph status still shows an error, which I can’t seem to get rid off.
>
> [WRN] CEPHADM_HOST_CHECK_FAILED: 2 hosts fail cephadm check
> host peon7 (10.103.0.47) failed check: Can't communicate with remote
> host `10.103.0.47`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.47', 22)
> host peon12 (10.103.0.52) failed check: Can't communicate with remote
> host `10.103.0.52`, possibly because python3 is not installed there or you
> are missing NOPASSWD in sudoers. [Errno 113] Connect call failed
> ('10.103.0.52', 22)
> [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: 'peon7'
> Module 'cephadm' has failed: ‘peon7'
>
> From the mgr log:
>
> Apr 04 08:33:46 controller2 bash[4031857]: debug
> 2024-04-04T08:33:46.876+ 7f2bb5710700 -1 mgr.server reply reply (5)
> Input/output error Module 'cephadm' has experienced an error and cannot
> handle commands: 'peon7'
>
> Any idea how to clear this error?
>
> # ceph --version
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus
> (stable)
>
>
> Regards,
> Arnoud de Jonge.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific Bug?

2024-04-02 Thread Adam King
https://tracker.ceph.com/issues/64428 should be it. Backports are done for
quincy, reef, and squid and the patch will be present in the next release
for each of those versions. There isn't a pacific backport as, afaik, there
are no more pacific releases planned.

On Fri, Mar 29, 2024 at 6:03 PM Alex  wrote:

> Hi again Adam :-)
>
> Would you happen to have the Bug Tracker issue for label bug?
>
> Thanks.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm shell version not consistent across monitors

2024-04-02 Thread Adam King
From what I can see with the most recent cephadm binary on pacific, unless
you have the CEPHADM_IMAGE env variable set, it does a `podman images
--filter label=ceph=True --filter dangling=false` (or docker) and takes the
first image in the list. It seems to be getting sorted by creation time by
default. If you want to guarantee what you get, you can run `cephadm
--image  shell` and it will try to use the image specified. You
could also try that env variable (although I haven't tried that in a very
long time if I'm honest, so hopefully it works correctly). If nothing else,
just seeing the output of that podman command and removing images that
appear before the 16.2.15 one on the list should work.

On Tue, Apr 2, 2024 at 5:03 PM J-P Methot 
wrote:

> Hi,
>
> We are still running ceph Pacific with cephadm and we have run into a
> peculiar issue. When we run the `cephadm shell` command on monitor1, the
> container we get runs ceph 16.2.9. However, when we run the same command
> on monitor2, the container runs 16.2.15, which is the current version of
> the cluster. Why does it do that and is there a way to force it to
> 16.2.15 on monitor1?
>
> Please note that both monitors have the same configuration. Cephadm has
> been pulled from GitHub for both monitors instead of the package
> manager's version.
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Failed adding back a node

2024-03-28 Thread Adam King
No, you can't use the image id for hte upgrade command, it has to be the
image name. So it should start, based on what you have,
registry.redhat.io/rhceph/. As for the full name, it depends which image
you want to go with. As for trying this on an OSD first, there is `ceph
orch daemon redeploy  --image ` you could run on
an OSD with a given image and see if it comes up. I would try the
upgrade before trying to remove the OSD. If it's really only failing
because it can't pull the image, the upgrade should try to make it deploy
with the image passed to the upgrade command, which could fix it as long as
it can pull that image on the host.

On Wed, Mar 27, 2024 at 10:42 PM Alex  wrote:

> Hi Adam!
>
> In addition to my earlier question of is there a way of trying a more
> targeted upgrade first so we don't risk accidentally breaking the
> entire production cluster,
>
> `ceph config dump | grep container_image` shows:
>
> global
>  basic container_image
>
> registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a193b0de114d19d2efd8750046b5d25da07e2c570e3c4eb4bd93e6de4b90a25a
>  *
>   mon.mon01
>  basic container_image
> registry.redhat.io/rhceph/rhceph-5-rhel8:latest
>*
>   mon.mon03
>  basic container_image
> registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
>*
>   mgr
>  advanced  mgr/cephadm/container_image_alertmanager
> registry.redhat.io/openshift4/ose-prometheus-alertmanager:v4.6
>*
>   mgr
>  advanced  mgr/cephadm/container_image_base
> registry.redhat.io/rhceph/rhceph-5-rhel8
>   mgr
>  advanced  mgr/cephadm/container_image_grafana
> registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8:5
>*
>   mgr
>  advanced  mgr/cephadm/container_image_node_exporter
> registry.redhat.io/openshift4/ose-prometheus-node-exporter:v4.6
>*
>   mgr
>  advanced  mgr/cephadm/container_image_prometheus
> registry.redhat.io/openshift4/ose-prometheus:v4.6
>*
>   mgr.mon01
>  basic container_image
> registry.redhat.io/rhceph/rhceph-5-rhel8:latest
>*
>   mgr.mon03
>  basic container_image
>
> registry.redhat.io/rhceph/rhceph-5-rhel8@sha256:a193b0de114d19d2efd8750046b5d25da07e2c570e3c4eb4bd93e6de4b90a25a
>  *
>
> and do you think I'd still need to rm that one osd that i successfully
> created but not added or would that get "pulled in" when I add the
> other 19 osds?
>
> `podman image list shows:
> REPOSITORY  TAG
>  IMAGE ID  CREATEDSIZE
> registry.redhat.io/rhceph/rhceph-5-rhel8latest
>  1d636b23ab3e  8 weeks ago1.02 GB
> `
> so would I be running `ceph orch upgrade start --image  1d636b23ab3e` ?
>
> Thanks again.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Failed adding back a node

2024-03-27 Thread Adam King
From the ceph versions output I can see

"osd": {
"ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
},

It seems like all the OSD daemons on this cluster are using that
16.2.10-160 image, and I'm guessing most of them are running, so it must
have existed at some point. Curious if `ceph config dump | grep
container_image` will show a different image setting for the OSD. Anyway,
in terms of moving forward it might be best to try to get all the daemons
onto an image you know works. I also see both 16.2.10-208 and 16.2.10-248
listed as versions, which implies there are two different images being used
even between the other daemons. Unless there's a reason for all these
different images, I'd just pick the most up to date one, that you know can
be pulled on all hosts, and do a `ceph orch upgrade start --image
`. That would get all the daemons on that single image, and
might fix the broken OSDs that are failing to pull the 16.2.10-160 image.

On Wed, Mar 27, 2024 at 8:56 PM Alex  wrote:

> Hello.
>
> We're rebuilding our OSD nodes.
> Once cluster worked without any issues, this one is being stubborn
>
> I attempted to add one back to the cluster and seeing the error below
> in out logs:
>
> cephadm ['--image',
> 'registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160', 'pull']
> 2024-03-27 19:30:53,901 7f49792ed740 DEBUG /bin/podman: 4.6.1
> 2024-03-27 19:30:53,905 7f49792ed740 INFO Pulling container image
> registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,045 7f49792ed740 DEBUG /bin/podman: Trying to pull
> registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,266 7f49792ed740 DEBUG /bin/podman: Error:
> initializing source
> docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
> manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
> manifest unknown
> 2024-03-27 19:30:54,270 7f49792ed740 INFO Non-zero exit code 125 from
> /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
> 2024-03-27
> 
> 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Trying
> to pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Error:
> initializing source
> docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
> manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
> manifest unknown
> 2024-03-27 19:30:54,270 7f49792ed740 ERROR ERROR: Failed command:
> /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
>
> $ ceph versions
> {
> "mon": {
> "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
> "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb64f2e0031b77e62f4896) pacific (stable)": 2
> },
> "mgr": {
> "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
> "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb64f2e0031b77e62f4896) pacific (stable)": 2
> },
> "osd": {
> "ceph version 16.2.10-160.el8cp
> (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
> },
> "mds": {},
> "rgw": {
> "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 3
> },
> "overall": {
> "ceph version 16.2.10-160.el8cp
> (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160,
> "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 5,
> "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb64f2e0031b77e62f4896) pacific (stable)": 4
> }
> }
>
> I don't understand why it's trying to pull 16.2.10-160 which doesn't exist.
>
> registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8 5 93b3137e7a65 11
> months ago 696 MB
> registry.redhat.io/rhceph/rhceph-5-rhel8 5-416 838cea16e15c 11 months
> ago 1.02 GB
> registry.redhat.io/openshift4/ose-prometheus v4.6 ec2d358ca73c 17
> months ago 397 MB
>
>
> This happens using cepadm-ansible as well as
> $ ceph orch ls --export --service_name xxx > xxx.yml
> $ sudo ceph orch apply -i xxx.yml
>
> I tried ceph orch daemon add osd host:/dev/sda
> which surprisingly created a volume on host:/dev/sda and created an
> osd i can see in
> $ ceph osd tree
>
> but It did not get added to host I suspect because of the same Podman
> error and now I'm unable remove it.
> $ ceph orch osd rm
> does not work even with the --force flag.
>
> I stopped the removal with
> $ ceph orch osd rm stop
> after 10+ minutes
>
> I'm considering running $ ceph osd purge osd# --force but worried it
> may only make things worse.
> ceph -s shows that osd but not up or in.
>
> Thanks, and looking forward to any advice!
> ___
> ceph-users mailing list -- 

[ceph-users] Re: Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-03-27 Thread Adam King
 I missed a step in the calculation. The total_memory_kb I mentioned
earlier is also multiplied by the value of the
mgr/cephadm/autotune_memory_target_ratio before doing the subtractions for
all the daemons. That value defaults to 0.7. That might explain it seeming
like it's getting a value lower than expected. Beyond that, I'd think 'i'd
need a list of the daemon types and count on that host to try and work
through what it's doing.

On Wed, Mar 27, 2024 at 10:47 AM Mads Aasted  wrote:

> Hi Adam.
>
> So doing the calculations with what you are stating here I arrive at a
> total sum for all the listed processes at 13.3 (roughly) gb, for everything
> except the osds, leaving well in excess of +4gb for each OSD.
> Besides the mon daemon which i can tell on my host has a limit of 2gb ,
> none of the other daemons seem to have a limit set according to ceph orch
> ps. Then again, they are nowhere near the values stated in min_size_by_type
> that you list.
> Obviously yes, I could disable the auto tuning, but that would leave me
> none the wiser as to why this exact host is trying to do this.
>
>
>
> On Tue, Mar 26, 2024 at 10:20 PM Adam King  wrote:
>
>> For context, the value the autotune goes with takes the value from
>> `cephadm gather-facts` on the host (the "memory_total_kb" field) and then
>> subtracts from that per daemon on the host according to
>>
>> min_size_by_type = {
>> 'mds': 4096 * 1048576,
>> 'mgr': 4096 * 1048576,
>> 'mon': 1024 * 1048576,
>> 'crash': 128 * 1048576,
>> 'keepalived': 128 * 1048576,
>> 'haproxy': 128 * 1048576,
>> 'nvmeof': 4096 * 1048576,
>> }
>> default_size = 1024 * 1048576
>>
>> what's left is then divided by the number of OSDs on the host to arrive
>> at the value. I'll also add, since it seems to be an issue on this
>> particular host,  if you add the "_no_autotune_memory" label to the host,
>> it will stop trying to do this on that host.
>>
>> On Mon, Mar 25, 2024 at 6:32 PM  wrote:
>>
>>> I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 hosts
>>> in it, each with 4 OSD's attached. The first 2 servers hosting mgr's have
>>> 32GB of RAM each, and the remaining have 24gb
>>> For some reason i am unable to identify, the first host in the cluster
>>> appears to constantly be trying to set the osd_memory_target variable to
>>> roughly half of what the calculated minimum is for the cluster, i see the
>>> following spamming the logs constantly
>>> Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing
>>> value: Value '480485376' is below minimum 939524096
>>> Default is set to 4294967296.
>>> I did double check and osd_memory_base (805306368) +
>>> osd_memory_cache_min (134217728) adds up to minimum exactly
>>> osd_memory_target_autotune is currently enabled. But i cannot for the
>>> life of me figure out how it is arriving at 480485376 as a value for that
>>> particular host that even has the most RAM. Neither the cluster or the host
>>> is even approaching max utilization on memory, so it's not like there are
>>> processes competing for resources.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm host keeps trying to set osd_memory_target to less than minimum

2024-03-26 Thread Adam King
For context, the value the autotune goes with takes the value from `cephadm
gather-facts` on the host (the "memory_total_kb" field) and then subtracts
from that per daemon on the host according to

min_size_by_type = {
'mds': 4096 * 1048576,
'mgr': 4096 * 1048576,
'mon': 1024 * 1048576,
'crash': 128 * 1048576,
'keepalived': 128 * 1048576,
'haproxy': 128 * 1048576,
'nvmeof': 4096 * 1048576,
}
default_size = 1024 * 1048576

what's left is then divided by the number of OSDs on the host to arrive at
the value. I'll also add, since it seems to be an issue on this particular
host,  if you add the "_no_autotune_memory" label to the host, it will stop
trying to do this on that host.

On Mon, Mar 25, 2024 at 6:32 PM  wrote:

> I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 hosts in
> it, each with 4 OSD's attached. The first 2 servers hosting mgr's have 32GB
> of RAM each, and the remaining have 24gb
> For some reason i am unable to identify, the first host in the cluster
> appears to constantly be trying to set the osd_memory_target variable to
> roughly half of what the calculated minimum is for the cluster, i see the
> following spamming the logs constantly
> Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing
> value: Value '480485376' is below minimum 939524096
> Default is set to 4294967296.
> I did double check and osd_memory_base (805306368) + osd_memory_cache_min
> (134217728) adds up to minimum exactly
> osd_memory_target_autotune is currently enabled. But i cannot for the life
> of me figure out how it is arriving at 480485376 as a value for that
> particular host that even has the most RAM. Neither the cluster or the host
> is even approaching max utilization on memory, so it's not like there are
> processes competing for resources.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Reef v18.2.1 to v18.2.2

2024-03-21 Thread Adam King
>
> Hi,
>
> On 3/21/24 14:50, Michael Worsham wrote:
> >
> > Now that Reef v18.2.2 has come out, is there a set of instructions on
> how to upgrade to the latest version via using Cephadm?
>
> Yes, there is: https://docs.ceph.com/en/reef/cephadm/upgrade/
>

Just a note on that docs section, it references --ceph-version primarily
but I'd recommend using --image, in this case with the image being
quay.io/ceph/ceph:v18.2.2. I had meant to update that. The --ceph-version
flag has occasionally not worked since the move from docker to quay for the
ceph images, whereas --image is much more consistent.

On Thu, Mar 21, 2024 at 10:06 AM Robert Sander 
wrote:

> Hi,
>
> On 3/21/24 14:50, Michael Worsham wrote:
> >
> > Now that Reef v18.2.2 has come out, is there a set of instructions on
> how to upgrade to the latest version via using Cephadm?
>
> Yes, there is: https://docs.ceph.com/en/reef/cephadm/upgrade/
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph-volume fails when adding spearate DATA and DATA.DB volumes

2024-03-06 Thread Adam King
If you want to be directly setting up the OSDs using ceph-volume commands
(I'll pretty much always recommend following
https://docs.ceph.com/en/latest/cephadm/services/osd/#dedicated-wal-db over
manual ceph-volume stuff in cephadm deployments unless what you're doing
can't be done with the spec files), you probably actually want to use
`cephadm ceph-volume -- ...` rather than `cephadm shell`. The `cephadm
ceph-volume` command mounts the provided keyring (or whatever keyring it
infers) at `/var/lib/ceph/bootstrap-osd/ceph.keyring` inside the container,
whereas the shell will not. So in theory you could try `cephadm ceph-volume
-- lvm prepare --bluestore  --data ceph-block-0/block-0 --block.db
ceph-db-0/db-0` and that might get you past the keyring issue it seems to
be complaining about.

On Wed, Mar 6, 2024 at 2:10 PM  wrote:

> Hi all!
> I;ve faced an issue I couldnt even google.
> Trying to create OSD with two separate LVM for data.db and data, gives me
> intresting error
>
> ```
> root@ceph-uvm2:/# ceph-volume lvm prepare --bluestore  --data
> ceph-block-0/block-0 --block.db ceph-db-0/db-0
> --> Incompatible flags were found, some values may get ignored
> --> Cannot use None (None) with --bluestore (bluestore)
> --> Incompatible flags were found, some values may get ignored
> --> Cannot use --bluestore (bluestore) with --block.db (bluestore)
> Running command: /usr/bin/ceph-authtool --gen-print-key
> Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd
> --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new
> e3b7b9e5-6399-4e61-8634-41a991bb1948
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1
> AuthRegistry(0x7f71bc064978) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.263+ 7f71c3a89700 -1
> AuthRegistry(0x7f71bc067fd0) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 auth: unable to find
> a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or
> directory
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1
> AuthRegistry(0x7f71c3a87ea0) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
>  stderr: 2024-03-04T11:45:52.267+ 7f71c1024700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c2026700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c1825700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [1]
>  stderr: 2024-03-04T11:45:52.267+ 7f71c3a89700 -1 monclient:
> authenticate NOTE: no keyring found; disabled cephx authentication
>  stderr: [errno 13] RADOS permission denied (error connecting to the
> cluster)
> -->  RuntimeError: Unable to create a new OSD id
> ```
>
> And here is output which shows my created volumes
>
> ```
> root@ceph-uvm2:/# lvs
>   LV  VG   Attr   LSize   Pool Origin Data%  Meta%  Move
> Log Cpy%Sync Convert
>   block-0 ceph-block-0 -wi-a-  16.37t
>
>   db-0ceph-db-0-wi-a- 326.00g
> ```
>
> CEPH was rolled out by using cephadm and all the commands I do under
> cephadm shell
>
> I have no idea what is the reason of that error and that is why I am here.
>
>
> Any help is appreciated.
> Thanks in advance.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph reef mon is not starting after host reboot

2024-03-06 Thread Adam King
When you ran this, was it directly on the host, or did you run `cephadm
shell` first? The two things you tend to need to connect to the cluster
(that "RADOS timed out" error is generally what you get when connecting to
the cluster fails. A bunch of different causes all end with that error) are
a keyring with proper permissions and a ceph conf that includes the
locations of the mon daemons. When you run `cephadm shell` on a host that
has the keyring and config present, which the host you bootstrapped the
cluster on should, it starts up a bash shell inside a container that mounts
the keyring and the config and has all the ceph packages you need to talk
to the cluster. If you weren't using the shell, or were trying this from a
node other than the bootstrap node, it could be worth trying that
combination. Otherwise, I see the title of your message says a mon is down.
For debugging that, I'd think we'd need to see the journal logs from when
it failed to start. `cephadm ls --no-detail | grep systemd` on a host where
the mon is (NOT from within `cephadm shell`, directly on the host) will
list out the systemd units for all the daemons cephadm has deployed there.
You could use that systemd unit name to try and grab the journal logs.

On Wed, Mar 6, 2024 at 2:09 PM  wrote:

> Hi guys,
>
> i am very newbie to ceph-cluster but after multiple attempts, i was able
> to install ceph-reef cluster on debian-12 by cephadm tool on test
> environment with 2 mons and 3 OSD's om VM's. All was seeming good and i was
> exploring more about it so i rebooted cluster and found that now i am not
> able to access ceph dashboard and i have try to check this
>
> root@ceph-mon-01:/# ceph orch ls
> 2024-03-01T08:53:05.051+ 7ff7602b8700  0 monclient(hunting):
> authenticate timed out after 300
> [errno 110] RADOS timed out (error connecting to the cluster)
>
> i have not configured RADOS. And i have no clue about it. Any help would
> be very appreciated? the same issue.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgraded 16.2.14 to 16.2.15

2024-03-05 Thread Adam King
There was a bug with this that was fixed by
https://github.com/ceph/ceph/pull/52122 (which also specifically added an
integration test for this case). It looks like it's missing a reef and
quincy backport though unfortunately. I'll try to open one for both.

On Tue, Mar 5, 2024 at 8:26 AM Eugen Block  wrote:

> It seems to be an issue with the service type (in this case "mon"),
> it's not entirely "broken", with the node-exporter it works:
>
> quincy-1:~ # cat node-exporter.yaml
> service_type: node-exporter
> service_name: node-exporter
> placement:
>host_pattern: '*'
> extra_entrypoint_args:
>-
> "--collector.textfile.directory=/var/lib/node_exporter/textfile_collector2"
>
> quincy-1:~ # ceph orch apply -i node-exporter.yaml
> Scheduled node-exporter update...
>
> I'll keep looking... unless one of the devs is reading this thread and
> finds it quicker.
>
>
> Zitat von Eugen Block :
>
> > Oh, you're right. I just checked on Quincy as well at it failed with
> > the same error message. For pacific it still works. I'll check for
> > existing tracker issues.
> >
> > Zitat von Robert Sander :
> >
> >> Hi,
> >>
> >> On 3/5/24 08:57, Eugen Block wrote:
> >>
> >>> extra_entrypoint_args:
> >>>   -
> >>>
> '--mon-rocksdb-options=write_buffer_size=33554432,compression=kLZ4Compression,level_compaction_dynamic_level_bytes=true,bottommost_compression=kLZ4HCCompression,max_background_jobs=4,max_subcompactions=2'
> >>
> >> When I try this on my test cluster with Reef 18.2.1 the
> >> orchestrator tells me:
> >>
> >> # ceph orch apply -i mon.yml
> >> Error EINVAL: ServiceSpec: __init__() got an unexpected keyword
> >> argument 'extra_entrypoint_args'
> >>
> >> It's a documented feature:
> >>
> >>
> https://docs.ceph.com/en/reef/cephadm/services/#cephadm-extra-entrypoint-args
> >>
> >> Regards
> >> --
> >> Robert Sander
> >> Heinlein Consulting GmbH
> >> Schwedter Str. 8/9b, 10119 Berlin
> >>
> >> https://www.heinlein-support.de
> >>
> >> Tel: 030 / 405051-43
> >> Fax: 030 / 405051-19
> >>
> >> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> >> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph orch doesn't execute commands and doesn't report correct status of daemons

2024-03-03 Thread Adam King
Okay, it seems like from what you're saying the RGW image itself isn't
special compared to the other ceph daemons, it's just that you want to use
the image on your local registry. In that case, I would still recommend
just using `ceph orch upgrade start --image ` with the image
from your local registry. It will transition all the ceph daemons rather
than just RGW to that image, but that's generally how cephadm expects
things to be anyway. Assuming that registry is reachable from all the nodes
on the cluster, the upgrade should be able to do it. Side note, but I have
seen some people in the past have issues with cephadm's use of repo digests
when using local registries, so if you're having issues you may want to try
setting the mgr/cephadm/use_repo_digest option to false. Just keep in mind
that means if you want to upgrade to another image, you'd have to make sure
it has a different name (usage of repo digest was added in to support
floating tags).

On Fri, Mar 1, 2024 at 11:47 AM wodel youchi  wrote:

> Hi,
>
> I'll try the 'ceph mgr fail' and report back.
>
> In the meantime, my problem with the images...
> I am trying to use my local registry to deploy the different services. I
> don't know how to use the 'apply' and force my cluster to use my local
> registry.
> So basically, what I am doing so far is :
> 1 - ceph orch apply -i rgw-service.yml   < deploy the
> rgw, and this will pull the image from the internet
> 2 - ceph orch daemon redeploy rgw.opsrgw.controllera.gtrttj --image
> 192.168.2.36:4000/ceph/ceph:v17  < Redeploy the demons of
> that service with my local image.
>
> How May I deploy directly from my local registry?
>
> Regards.
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Quincy] NFS ingress mode haproxy-protocol not recognized

2024-03-03 Thread Adam King
According to https://tracker.ceph.com/issues/58933, that was only
backported as far as reef. If I remember correctly, the reason for that was
the ganehsa version itself we were including in our quincy containers
wasn't new enough to support the feature on that end, so backporting the
nfs/orchestration side of it wouldn't have been useful.

On Sun, Mar 3, 2024 at 8:25 AM wodel youchi  wrote:

> Hi;
>
> I tried to create an NFS cluster using this command :
> [root@controllera ceph]# ceph nfs cluster create mynfs "3 controllera
> controllerb controllerc" --ingress --virtual_ip 20.1.0.201 --ingress-mode
> haproxy-protocol
> Invalid command: haproxy-protocol not in default|keepalive-only
>
> And I got this error : Invalid command haproxy-protocol
> I am using Quincy : ceph version 17.2.7 (...) quincy (stable)
>
> Is it not supported yet?
>
> Regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph orch doesn't execute commands and doesn't report correct status of daemons

2024-03-01 Thread Adam King
There have been bugs in the past where things have gotten "stuck". Usually
I'd say check the REFRESHED column in the output of `ceph orch ps`. It
should refresh the daemons on each host roughly every 10 minutes, so if you
see some value much larger than that, things are probably actually stuck.
If they are, the first thing to try is usually a mgr failover (`ceph mgr
fail`).

Are you using a special RGW image different from the other ceph daemons in
the cluster? The typical case is that all the ceph daemons use the same
image and then get updated using the `ceph orch upgrade start ...` command.

On Fri, Mar 1, 2024 at 4:26 AM wodel youchi  wrote:

> Hi,
>
> I have finished the conversion from ceph-ansible to cephadm yesterday.
> Everything seemed to be working until this morning, I wanted to redeploy
> rgw service to specify the network to be used.
>
> So I deleted the rgw services with ceph orch rm, then I prepared a yml file
> with the new conf. I applied the file and the new rgw service was started
> but it was launched with an external image, so I wanted to redeploy using
> my local image so I did a redeploy ... and then nothing happened, I get the
> rescheduled message but nothing happened, then I restarted one of the
> controllers, the orchestrator doesn't seem to be aware that some service
> have restarted???
>
> PS : I don't fully master the cephadm command line and use.
>
> Regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Migration from ceph-ansible to Cephadm

2024-02-29 Thread Adam King
>
> - I still have the ceph-crash container, what should I do with it?
>

If it's the old one, I think you can remove it. Cephadm can deploy its own
crash service (`ceph orch apply crash` if it hasn't). You can check if
`crash` is listed under `ceph orch ls` and if it is there you can do `ceph
orch ps --daemon-type crash` to check what crash daemons have been
deployed. There should generally be one on each host.

- The new rgw and mds daemons have some random string in their names (like
> rgw.opsrgw.controllera.*pkajqw*), is this correct ?
>

Yes, cephadm names daemon types it might possibly have to co-locate  on a
node with 6 random chars at the end.

- How should I proceed with the monitoring stack (grafana, prometheus,
> alermanager and node-exporter)? should I stop then delete the old ones,
> then deploy the new ones with ceph orch?
>

 Cephadm has a `cephadm adopt` command that can be used to put prometheus,
grafana, and alertmanager daemons under its control. You can do it sort of
like how its done for the mon/mgr in
https://docs.ceph.com/en/quincy/cephadm/adoption/#adoption-process but just
give the name for the prometheus, grafana, or alertmanager daemon. That's
all assuming these things have data you don't want to lose. If you just
want monitoring under cephadm going forward and don't care about losing
historical data, it could be easier to just remove them and deploy the
monitoring stack with cephadm. (`ceph orch apply node-exporter`, `ceph orch
apply grafana`, `ceph orch apply alertmanager`, `ceph orch apply
prometheus`). Note that for node-exporter you'll have to do it this way
anyway as there is no adoption process for node-exporter.

On Thu, Feb 29, 2024 at 7:52 AM wodel youchi  wrote:

> Hi,
>
> I am in the middle of migration from ceph-ansible to cephadm (version
> quincy), so far so good ;-). And I have some questions :
> - I still have the ceph-crash container, what should I do with it?
> - The new rgw and mds daemons have some random string in their names (like
> rgw.opsrgw.controllera.*pkajqw*), is this correct ?
> - How should I proceed with the monitoring stack (grafana, prometheus,
> alermanager and node-exporter)? should I stop then delete the old ones,
> then deploy the new ones with ceph orch?
>
> Regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Some questions about cephadm

2024-02-26 Thread Adam King
In regards to
>
> From the reading you gave me I have understood the following :
> 1 - Set osd_memory_target_autotune to true then set
> autotune_memory_target_ratio to 0.2
> 2 - Or do the math. For my setup I have 384Go per node, each node has 4
> nvme disks of 7.6To, 0.2 of memory is 19.5G. So each OSD will have 19G of
> memory.
>
> Question : Should I take into account the size of the disk when calculating
> the required memory for an OSD?
>
The memory in question is RAM, not disk space. To see the exact value
cephadm will see for the amount of memory (in kb, we multiply by 1024 when
actually using it) when doing this autotuning, you can run

[root@vm-00 ~]# cephadm gather-facts | grep memory_total
  "memory_total_kb": 40802184,

on your machine. Then it multiplies that by the ratio and subtracts out an
amount for every non-OSD daemon on the node. Specifically (taking this from
the code)

min_size_by_type = {
'mds': 4096 * 1048576,
'mgr': 4096 * 1048576,
'mon': 1024 * 1048576,
'crash': 128 * 1048576,
'keepalived': 128 * 1048576,
'haproxy': 128 * 1048576,
}
default_size = 1024 * 1048576

so 1 GB for most daemons, with mgr and mds requiring extra (although for
mds it also uses the `mds_cache_memory_limit` config option if it's set)
and some others requiring less. What's left after all that is done is then
divided by the number of OSDs deployed on the host. If that number ends up
too small, however, there is some floor that it won't set below, but I
can't remember off the top of my head what that is. Maybe 4 GB.

On Mon, Feb 26, 2024 at 5:10 AM wodel youchi  wrote:

> Thank you all for your help.
>
> @Adam
> From the reading you gave me I have understood the following :
> 1 - Set osd_memory_target_autotune to true then set
> autotune_memory_target_ratio to 0.2
> 2 - Or do the math. For my setup I have 384Go per node, each node has 4
> nvme disks of 7.6To, 0.2 of memory is 19.5G. So each OSD will have 19G of
> memory.
>
> Question : Should I take into account the size of the disk when calculating
> the required memory for an OSD?
>
>
> I have another problem, the local registry. I deployed a local registry
> with the required images, then I used cephadm-ansible to prepare my hosts
> and inject the local registry url into /etc/container/registry.conf file
>
> Then I tried to deploy using this command on the admin node:
> cephadm --image 192.168.2.36:4000/ceph/ceph:v17 bootstrap --mon-ip
> 10.1.0.23 --cluster-network 10.2.0.0/16
>
> After the boot strap I found that it still downloads the images from the
> internet, even the ceph image itself, I see two images one from my registry
> the second from quay.
>
> There is a section that talks about using a local registry here
>
> https://docs.ceph.com/en/reef/cephadm/install/#deployment-in-an-isolated-environment
> ,
> but it's not clear especially about the other images. It talks about
> preparing a temporary file named initial-ceph.conf, then it does not use
> it???!!!
>
> Could you help?
>
> Regards.
>
> Le jeu. 22 févr. 2024 à 11:10, Eugen Block  a écrit :
>
> > Hi,
> >
> > just responding to the last questions:
> >
> > >- After the bootstrap, the Web interface was accessible :
> > >   - How can I access the wizard page again? If I don't use it the
> > first
> > >   time I could not find another way to get it.
> >
> > I don't know how to recall the wizard, but you should be able to
> > create a new dashboard user with your desired role (e. g.
> > administrator) from the CLI:
> >
> > ceph dashboard ac-user-create  [] -i
> > 
> >
> > >   - I had a problem with telemetry, I did not configure telemetry,
> > then
> > >   when I clicked the button, the web gui became
> > inaccessible.!!!
> >
> > You can see what happened in the active MGR log.
> >
> > Zitat von wodel youchi :
> >
> > > Hi,
> > >
> > > I have some questions about ceph using cephadm.
> > >
> > > I used to deploy ceph using ceph-ansible, now I have to move to
> cephadm,
> > I
> > > am in my learning journey.
> > >
> > >
> > >- How can I tell my cluster that it's a part of an HCI deployment?
> > With
> > >ceph-ansible it was easy using is_hci : yes
> > >- The documentation of ceph does not indicate what versions of
> > grafana,
> > >prometheus, ...etc should be used with a certain version.
> > >   - I am trying to deploy Quincy, I did a bootstrap to see what
> > >   containers were downloaded and their version.
> > >   - I am asking because I need to use a local registry to deploy
> > those
> > >   images.
> > >- After the bootstrap, the Web interface was accessible :
> > >   - How can I access the wizard page again? If I don't use it the
> > first
> > >   time I could not find another way to get it.
> > >   - I had a problem with telemetry, I did not configure telemetry,
> > then
> > >   when I clicked the button, the web gui became
> > inaccessible.!!!
> 

[ceph-users] Re: Some questions about cephadm

2024-02-21 Thread Adam King
Cephadm does not have some variable that explicitly says it's an HCI
deployment. However, the HCI variable in ceph ansible I believe only
controlled the osd_memory_target attribute, which would automatically set
it to 20% or 70% respectively of the memory on the node  divided by the
number of OSDs on the node depending on whether it's HCI or not. Cephadm
doesn't have that exactly, but has a similar feature of osd memory
autotuning which has some docs here
https://docs.ceph.com/en/latest/cephadm/services/osd/#automatically-tuning-osd-memory.
The warning indicates that it isn't ideal for HCI to use this, but I think
if you set the mgr/cephadm/autotune_memory_target_ratio to a value closer
to 0.2 rather than the default 0.7, it might end up working out close to
how ceph-ansible worked with the is_hci option set to true. Otherwise, you
can set the option yourself after doing a similar calculation to what
ceph-ansible did with something like `ceph config set osd/host:
osd_memory_target ` where that amount is the per OSD memory target
you want for OSDs on that host.

I don't think we have documentation of what version to use for the
monitoring stack daemons, but the assumption is the default version defined
in cephadm should be okay to use unless you have some specific use case
that requires a different one. There are docs on how to change it to use a
different image if you'd like to do so
https://docs.ceph.com/en/latest/cephadm/services/monitoring/#using-custom-images

Can't speak much to the setup wizard in the dashboard and whether it's
possible to get it going again after closing out of it, or the telemetry
related dashboard issue.

On Wed, Feb 21, 2024 at 11:08 AM wodel youchi 
wrote:

> Hi,
>
> I have some questions about ceph using cephadm.
>
> I used to deploy ceph using ceph-ansible, now I have to move to cephadm, I
> am in my learning journey.
>
>
>- How can I tell my cluster that it's a part of an HCI deployment? With
>ceph-ansible it was easy using is_hci : yes
>- The documentation of ceph does not indicate what versions of grafana,
>prometheus, ...etc should be used with a certain version.
>   - I am trying to deploy Quincy, I did a bootstrap to see what
>   containers were downloaded and their version.
>   - I am asking because I need to use a local registry to deploy those
>   images.
>- After the bootstrap, the Web interface was accessible :
>   - How can I access the wizard page again? If I don't use it the first
>   time I could not find another way to get it.
>   - I had a problem with telemetry, I did not configure telemetry, then
>   when I clicked the button, the web gui became
> inaccessible.!!!
>
>
>
> Regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: first_virtual_router_id not allowed in ingress manifest

2024-02-21 Thread Adam King
It seems the quincy backport for that feature (
https://github.com/ceph/ceph/pull/53098) was merged Oct 1st 2023. According
to the quincy part of
https://docs.ceph.com/en/latest/releases/#release-timeline it looks like
that would mean it would only be present in 17.2.7, but not 17.2.6.

On Wed, Feb 21, 2024 at 8:52 AM Ramon Orrù  wrote:

> Hello,
> I deployed RGW and NFSGW services over a ceph (version 17.2.6) cluster.
> Both services are being accessed using 2 (separated) ingresses, actually
> working as expected when contacted by clients.
> Besides, I’m experiencing some problem while letting the ingresses work on
> the same cluster.
>
> keepalived logs are full of  "(VI_0) received an invalid passwd!”  lines,
> because both ingresses are using the same virtualrouter id, so I’m trying
> to introduce some additional parameter in service definition manifests to
> workaround the problem (first_virtual_router_id, default value is 50),
>  below are the manifest content:
>
> service_type: ingress
> service_id: ingress.rgw
> service_name: ingress.rgw
> placement:
>   hosts:
>   - c00.domain.org
>   - c01.domain.org
>   - c02.domain.org
> spec:
>   backend_service: rgw.rgw
>   frontend_port: 8080
>   monitor_port: 1967
>   virtual_ips_list:
> - X.X.X.200/24
>   first_virtual_router_id: 60
>
> service_type: ingress
> service_id: nfs.nfsgw
> service_name: ingress.nfs.nfsgw
> placement:
>   count: 2
> spec:
>   backend_service: nfs.nfsgw
>   frontend_port: 2049
>   monitor_port: 9049
>   virtual_ip: X.X.X.222/24
>   first_virtual_router_id: 70
>
>
> When I apply the manifests I’m getting the error, for both ingress
> definitions:
>
> Error EINVAL: ServiceSpec: __init__() got an unexpected keyword argument
> ‘first_virtual_router_id'
>
> even the documentation for quincy version describes the option and
> includes some similar example at:
> https://docs.ceph.com/en/quincy/cephadm/services/rgw
>
> Both manifests are working smoothly if I remove the
> first_virtual_router_id line.
>
> Any ideas on how I can troubleshoot the issue?
>
> Thanks in advance
>
> Ramon
>
> --
> Ramon Orrù
> Servizio di Calcolo
> Laboratori Nazionali di Frascati
> Istituto Nazionale di Fisica Nucleare
> Via E. Fermi, 54 - 00044 Frascati (RM) Italy
> Tel. +39 06 9403 2345
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific Bug?

2024-02-14 Thread Adam King
Does seem like a bug, actually in more than just this command. The `ceph
orch host ls` with the --label and/or --host-pattern flag just piggybacks
off of the existing filtering done for placements in service specs. I've
just taken a look and you actually can create the same behavior with the
placement of an actual service, for example, with

[ceph: root@vm-00 /]# ceph orch host ls
HOST   ADDR LABELS  STATUS
vm-00  192.168.122.7_admin
vm-01  192.168.122.171  foo
vm-02  192.168.122.147  foo
3 hosts in cluster

and spec

[ceph: root@vm-00 /]# cat ne.yaml
service_type: node-exporter
service_name: node-exporter
placement:
  host_pattern: 'vm-0[0-1]'

you get the expected placement on vm-00 and vm-01

[ceph: root@vm-00 /]# ceph orch ps --daemon-type node-exporter
NAME HOST   PORTS   STATUS REFRESHED  AGE  MEM USE
 MEM LIM  VERSION  IMAGE ID  CONTAINER ID
node-exporter.vm-00  vm-00  *:9100  running (23s)17s ago  23s3636k
   -  1.5.00da6a335fe13  f83e88caa7e0
node-exporter.vm-01  vm-01  *:9100  running (21h) 2m ago  21h16.1M
   -  1.5.00da6a335fe13  a5153c378449

but if I add label to the placement, while still leaving in the host pattern

[ceph: root@vm-00 /]# cat ne.yaml
service_type: node-exporter
service_name: node-exporter
placement:
  label: foo
  host_pattern: 'vm-0[0-1]'

you would expect to only get vm-01 at this point, as it's the only host
that matches both pieces of the placement, but instead you get both vm-01
and vm-02

[ceph: root@vm-00 /]# ceph orch ps --daemon-type node-exporter
NAME HOST   PORTS   STATUS REFRESHED  AGE  MEM USE
 MEM LIM  VERSION  IMAGE ID  CONTAINER ID
node-exporter.vm-01  vm-01  *:9100  running (21h) 4m ago  21h16.1M
   -  1.5.00da6a335fe13  a5153c378449
node-exporter.vm-02  vm-02  *:9100  running (23s)18s ago  23s5410k
   -  1.5.00da6a335fe13  ddd1e643e341

Looking at the scheduling implementation, it seems currently it selects
candidates based on attributes in this order: Explicit host list, label,
host pattern (with some additional handling for count that happens in all
cases). When it finds the first thing in that list, in this case the label,
that is present in the placement, it uses that to select the candidates and
then bails out without any additional filtering on the host pattern
attribute. Since the placement spec validation doesn't allow applying specs
with both host_pattern/label and an explicit host list, this case with the
label and host pattern is the only one you can hit where this is an issue,
and I guess was just overlooked. Will take a look at making a patch to fix
this.

On Tue, Feb 13, 2024 at 7:09 PM Alex  wrote:

> Hello Ceph Gurus!
>
> I'm running Ceph Pacific version.
> if I run
> ceph orch host ls --label osds
> shows all hosts label osds
> or
> ceph orch host ls --host-pattern host1
> shows just host1
> it works as expected
>
> But combining the two the label tag seems to "take over"
>
> ceph orch host ls --label osds --host-pattern host1
> 6 hosts in cluster who had label osds whose hostname matched host1
> shows all host with the label osds instead of only host1.
> So at first the flags seem to act like an OR instead of an AND.
>
> ceph orch host ls --label osds --host-pattern foo
> 6 hosts in cluster who had label osds whose hostname matched foo
> even though "foo" doesn't even exist
>
> ceph orch host ls --label bar --host-pattern host1
> 0 hosts in cluster who had label bar whose hostname matched host1
> if the label and host combo was an OR this should have worked
> there is no label bar but host1 exists so it just disregards the
> host-pattern.
>
> This started because the osd deployment task had both label and
> host_pattern.
> The cluster was attempting to deploy OSDS on all the servers with the
> given tag instead of the one host we needed,
> which caused it to go into warning state.
> If I ran
> ceph orch ls --export --service_name host1
> it also showed both tags and host_pattern.
> unmanaged: false
> placement:
>   host_pattern:
>   label:
> The issue persisted until I removed the label tag.
>
> Thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Pacific: Drain hosts does not remove mgr daemon

2024-01-31 Thread Adam King
If you just manually run `ceph orch daemon rm
` does it get removed? I know there's
some logic in host drain that does some ok-to-stop checks that can cause
things to be delayed or stuck if it doesn't think it's safe to remove the
daemon for some reason. I wonder if it's being overly cautious here. The
manual command to remove the mgr doesn't do all those checks I believe.

On Wed, Jan 31, 2024 at 10:16 AM Mevludin Blazevic 
wrote:

> Hi all,
>
> after performing "ceph orch host drain" on one of our host with only the
> mgr container left, I encounter that another mgr daemon is indeed
> deployed on another host, but the "old" does not get removed from the
> drain command. The same happens if I edit the mgr service via UI to
> define different hosts for the daemon and again the old mgr daemons are
> not getting removed. Any recommendations? I am using a setup with podman
> and RHEL.
>
> Best,
> Mevludin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT meeting notes January 24th 2024

2024-01-24 Thread Adam King
   - Build/package PRs- who to best review these?


   - Example: https://github.com/ceph/ceph/pull/55218


   - Idea: create a GitHub team specifically for these types of PRs
   https://github.com/orgs/ceph/teams


   - Laura will try to organize people for the group


   - Pacific 16.2.15 status


   - Handful of PRs left in 16.2.15 tag
   https://github.com/ceph/ceph/pulls?q=is%3Apr+is%3Aopen+milestone%3Av16.2.15
that still need to be tested and merged


   - Yuri will begin testing RC after that
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: nfs export over RGW issue in Pacific

2023-12-07 Thread Adam King
The first handling of nfs exports over rgw in the nfs module, including the
`ceph nfs export create rgw` command, wasn't added to the nfs module in
pacific until 16.2.7.

On Thu, Dec 7, 2023 at 1:35 PM Adiga, Anantha 
wrote:

> Hi,
>
>
> oot@a001s016:~# cephadm version
>
> Using recent ceph image ceph/daemon@sha256
> :261bbe628f4b438f5bf10de5a8ee05282f2697a5a2cb7ff7668f776b61b9d586
>
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific
> (stable)
>
> root@a001s016:~#
>
>
>
> root@a001s016:~# cephadm shell
>
> Inferring fsid 604d56db-2fab-45db-a9ea-c418f9a8cca8
>
> Inferring config
> /var/lib/ceph/604d56db-2fab-45db-a9ea-c418f9a8cca8/mon.a001s016/config
>
> Using recent ceph image ceph/daemon@sha256
> :261bbe628f4b438f5bf10de5a8ee05282f2697a5a2cb7ff7668f776b61b9d586
>
>
>
> root@a001s016:~# ceph version
>
> ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific
> (stable)
>
> -
> But, Cephadm does not show "nfs export create rgw"
>
>
> nfs export create cephfs[--readonly]
> []
>
> nfs export rm  
>
> nfs export delete  
>
> nfs export ls  [--detailed]
>
> nfs export get  
>
> nfs export update
>
> -
>
> However, Ceph Dashboard allows to create the export see below:
>
> Access Type RW
> Cluster nfs-1
> Daemons nfs-1.0.0.zp3110b001a0101.uckows,
> nfs-1.1.0.zp3110b001a0102.hhpebb, nfs-1.2.0.zp3110b001a0103.bbkpcb,
> nfs-1.3.0.zp3110b001a0104.zujkso
> NFS Protocol NFSv4
> Object Gateway User admin
> Path buc-cluster-inventory
> Pseudo /rgwnfs_cluster_inventory
> Squash no_root_squash
> Storage Backend Object Gateway
> Transport TCP
> -
>
> While nfs export is created, pseudo "/rgwnfs_cluster_inventory",   cephadm
> is not listing it
>
> # ceph nfs export ls nfs-1
> [
>   "/cluster_inventory",
>   "/oob_crashdump"
> ]
> #
>
> Anantha
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: error deploying ceph

2023-11-30 Thread Adam King
That message in the `ceph orch device ls` output is just why the device is
unavailable for an OSD. The reason it now has sufficient space in this case
is because you've already put an OSD on it, so it's really just telling you
you can't place another one. So you can expect to see something like that
for each device you place an OSD on and it's nothing to worry about. It's
useful information if, for example, you remove the OSD associated with the
device but forget to zap the device after, and are wondering why you can't
put another OSD on it later.

On Thu, Nov 30, 2023 at 8:00 AM Francisco Arencibia Quesada <
arencibia.franci...@gmail.com> wrote:

> Thanks again guys,
>
> The cluster is healthy now, is this normal?  all looks look except for
> this output
> *Has a FileSystem, Insufficient space (<10 extents) on vgs, LVM detected  *
>
> root@node1-ceph:~# cephadm shell -- ceph status
> Inferring fsid 209a7bf0-8f6d-11ee-8828-23977d76b74f
> Inferring config
> /var/lib/ceph/209a7bf0-8f6d-11ee-8828-23977d76b74f/mon.node1-ceph/config
> Using ceph image with id '921993c4dfd2' and tag 'v17' created on
> 2023-11-22 16:03:22 + UTC
>
> quay.io/ceph/ceph@sha256:dad2876c2916b732d060b71320f97111bc961108f9c249f4daa9540957a2b6a2
>   cluster:
> id: 209a7bf0-8f6d-11ee-8828-23977d76b74f
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum node1-ceph,node2-ceph,node3-ceph (age 2h)
> mgr: node1-ceph.peedpx(active, since 2h), standbys: node2-ceph.ykkvho
> osd: 3 osds: 3 up (since 2h), 3 in (since 2h)
>
>   data:
> pools:   2 pools, 33 pgs
> objects: 7 objects, 449 KiB
> usage:   873 MiB used, 299 GiB / 300 GiB avail
> pgs: 33 active+clean
>
> root@node1-ceph:~# cephadm shell -- ceph orch device ls --wide
> Inferring fsid 209a7bf0-8f6d-11ee-8828-23977d76b74f
> Inferring config
> /var/lib/ceph/209a7bf0-8f6d-11ee-8828-23977d76b74f/mon.node1-ceph/config
> Using ceph image with id '921993c4dfd2' and tag 'v17' created on
> 2023-11-22 16:03:22 + UTC
>
> quay.io/ceph/ceph@sha256:dad2876c2916b732d060b71320f97111bc961108f9c249f4daa9540957a2b6a2
> HOSTPATH   TYPE  TRANSPORT  RPM  DEVICE ID   SIZE  HEALTH
>  IDENT  FAULT  AVAILABLE  REFRESHED  REJECT REASONS
>
> node1-ceph  /dev/xvdb  ssd   100G  N/A
>N/ANo 27m agoHas a FileSystem, Insufficient space (<10
> extents) on vgs, LVM detected
> node2-ceph  /dev/xvdb  ssd   100G  N/A
>N/ANo 27m agoHas a FileSystem, Insufficient space (<10
> extents) on vgs, LVM detected
> node3-ceph  /dev/xvdb  ssd   100G  N/A
>N/A    No     27m agoHas a FileSystem, Insufficient space (<10
> extents) on vgs, LVM detected
> root@node1-ceph:~#
>
> On Wed, Nov 29, 2023 at 10:38 PM Adam King  wrote:
>
>> To run a `ceph orch...` (or really any command to the cluster) you should
>> first open a shell with `cephadm shell`. That will put you in a bash shell
>> inside a container that has the ceph packages matching the ceph version in
>> your cluster. If you just want a single command rather than an interactive
>> shell, you can also do `cephadm shell -- ceph orch...`. Also, this might
>> not turn out to be an issue, but just thinking ahead, the devices cephadm
>> will typically allow you to put an OSD on should match what's output by
>> `ceph orch device ls` (which is populated by `cephadm ceph-volume --
>> inventory --format=json-pretty` if you want to look further). So I'd
>> generally say to always check that before making any OSDs through the
>> orchestrator. I also generally like to recommend setting up OSDs through
>> drive group specs (
>> https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications)
>> over using `ceph orch daemon add osd...` although that's a tangent to what
>> you're trying to do now.
>>
>> On Wed, Nov 29, 2023 at 4:14 PM Francisco Arencibia Quesada <
>> arencibia.franci...@gmail.com> wrote:
>>
>>> Thanks so much Adam, that worked great, however I can not add any
>>> storage with:
>>>
>>> sudo cephadm ceph orch daemon add osd node2-ceph:/dev/nvme1n1
>>>
>>> root@node1-ceph:~# ceph status
>>>   cluster:
>>> id: 9d8f1112-8ef9-11ee-838e-a74e679f7866
>>> health: HEALTH_WARN
>>> Failed to apply 1 service(s): osd.all-available-devices
>>> 2 failed cephadm daemon(s)
>>> OSD count 0 < osd_pool_default_size 3
>>>
>>>   services:
>>> mon: 1 daemons, quorum nod

[ceph-users] Re: error deploying ceph

2023-11-29 Thread Adam King
To run a `ceph orch...` (or really any command to the cluster) you should
first open a shell with `cephadm shell`. That will put you in a bash shell
inside a container that has the ceph packages matching the ceph version in
your cluster. If you just want a single command rather than an interactive
shell, you can also do `cephadm shell -- ceph orch...`. Also, this might
not turn out to be an issue, but just thinking ahead, the devices cephadm
will typically allow you to put an OSD on should match what's output by
`ceph orch device ls` (which is populated by `cephadm ceph-volume --
inventory --format=json-pretty` if you want to look further). So I'd
generally say to always check that before making any OSDs through the
orchestrator. I also generally like to recommend setting up OSDs through
drive group specs (
https://docs.ceph.com/en/latest/cephadm/services/osd/#advanced-osd-service-specifications)
over using `ceph orch daemon add osd...` although that's a tangent to what
you're trying to do now.

On Wed, Nov 29, 2023 at 4:14 PM Francisco Arencibia Quesada <
arencibia.franci...@gmail.com> wrote:

> Thanks so much Adam, that worked great, however I can not add any storage
> with:
>
> sudo cephadm ceph orch daemon add osd node2-ceph:/dev/nvme1n1
>
> root@node1-ceph:~# ceph status
>   cluster:
> id: 9d8f1112-8ef9-11ee-838e-a74e679f7866
> health: HEALTH_WARN
> Failed to apply 1 service(s): osd.all-available-devices
> 2 failed cephadm daemon(s)
> OSD count 0 < osd_pool_default_size 3
>
>   services:
> mon: 1 daemons, quorum node1-ceph (age 18m)
> mgr: node1-ceph.jitjfd(active, since 17m)
> osd: 0 osds: 0 up, 0 in (since 6m)
>
>   data:
> pools:   0 pools, 0 pgs
> objects: 0 objects, 0 B
> usage:   0 B used, 0 B / 0 B avail
> pgs:
>
> root@node1-ceph:~#
>
> Regards
>
>
>
> On Wed, Nov 29, 2023 at 5:45 PM Adam King  wrote:
>
>> I think I remember a bug that happened when there was a small mismatch
>> between the cephadm version being used for bootstrapping and the container.
>> In this case, the cephadm binary used for bootstrap knows about the
>> ceph-exporter service and the container image being used does not. The
>> ceph-exporter was removed from quincy between 17.2.6 and 17.2.7 so I'd
>> guess the cephadm binary here is a bit older and it's pulling hte 17.2.7
>> image. For now, I'd say just workaround this by running bootstrap with
>> `--skip-monitoring-stack` flag. If you want the other services in the
>> monitoring stack after bootstrap you can just run `ceph orch apply
>> ` for services alertmanager, prometheus, node-exporter, and
>> grafana and it would get you in the same spot as if you didn't provide the
>> flag and weren't hitting the issue.
>>
>> For an extra note, this failed bootstrap might be leaving things around
>> that could cause subsequent bootstraps to fail. If you run `cephadm ls` and
>> see things listed, you can grab the fsid from the output of that command
>> and run `cephadm rm-cluster --force --fsid ` to clean up the env
>> before bootstrapping again.
>>
>> On Wed, Nov 29, 2023 at 11:32 AM Francisco Arencibia Quesada <
>> arencibia.franci...@gmail.com> wrote:
>>
>>> Hello guys,
>>>
>>> This situation is driving me crazy, I have tried to deploy a ceph
>>> cluster,
>>> in all ways possible, even with ansible and at some point it breaks. I'm
>>> using Ubuntu 22.0.4.  This is one of the errors I'm having, some problem
>>> with ceph-exporter.  Please could you help me, I have been dealing with
>>> this for like 5 days.
>>> Kind regards
>>>
>>>  root@node1-ceph:~# cephadm bootstrap --mon-ip 10.0.0.52
>>> Verifying podman|docker is present...
>>> Verifying lvm2 is present...
>>> Verifying time synchronization is in place...
>>> Unit systemd-timesyncd.service is enabled and running
>>> Repeating the final host check...
>>> docker (/usr/bin/docker) is present
>>> systemctl is present
>>> lvcreate is present
>>> Unit systemd-timesyncd.service is enabled and running
>>> Host looks OK
>>> Cluster fsid: 4ce3a92a-8ddd-11ee-9b23-6341187f70c1
>>> Verifying IP 10.0.0.52 port 3300 ...
>>> Verifying IP 10.0.0.52 port 6789 ...
>>> Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24` <http://10.0.0.0/24>
>>> Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24` <http://10.0.0.0/24>
>>> Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32` <http://10.0.0.1/32>
>>> Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32` <http://10.

[ceph-users] Re: error deploying ceph

2023-11-29 Thread Adam King
I think I remember a bug that happened when there was a small mismatch
between the cephadm version being used for bootstrapping and the container.
In this case, the cephadm binary used for bootstrap knows about the
ceph-exporter service and the container image being used does not. The
ceph-exporter was removed from quincy between 17.2.6 and 17.2.7 so I'd
guess the cephadm binary here is a bit older and it's pulling hte 17.2.7
image. For now, I'd say just workaround this by running bootstrap with
`--skip-monitoring-stack` flag. If you want the other services in the
monitoring stack after bootstrap you can just run `ceph orch apply
` for services alertmanager, prometheus, node-exporter, and
grafana and it would get you in the same spot as if you didn't provide the
flag and weren't hitting the issue.

For an extra note, this failed bootstrap might be leaving things around
that could cause subsequent bootstraps to fail. If you run `cephadm ls` and
see things listed, you can grab the fsid from the output of that command
and run `cephadm rm-cluster --force --fsid ` to clean up the env
before bootstrapping again.

On Wed, Nov 29, 2023 at 11:32 AM Francisco Arencibia Quesada <
arencibia.franci...@gmail.com> wrote:

> Hello guys,
>
> This situation is driving me crazy, I have tried to deploy a ceph cluster,
> in all ways possible, even with ansible and at some point it breaks. I'm
> using Ubuntu 22.0.4.  This is one of the errors I'm having, some problem
> with ceph-exporter.  Please could you help me, I have been dealing with
> this for like 5 days.
> Kind regards
>
>  root@node1-ceph:~# cephadm bootstrap --mon-ip 10.0.0.52
> Verifying podman|docker is present...
> Verifying lvm2 is present...
> Verifying time synchronization is in place...
> Unit systemd-timesyncd.service is enabled and running
> Repeating the final host check...
> docker (/usr/bin/docker) is present
> systemctl is present
> lvcreate is present
> Unit systemd-timesyncd.service is enabled and running
> Host looks OK
> Cluster fsid: 4ce3a92a-8ddd-11ee-9b23-6341187f70c1
> Verifying IP 10.0.0.52 port 3300 ...
> Verifying IP 10.0.0.52 port 6789 ...
> Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24` 
> Mon IP `10.0.0.52` is in CIDR network `10.0.0.0/24` 
> Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32` 
> Mon IP `10.0.0.52` is in CIDR network `10.0.0.1/32` 
> Internal network (--cluster-network) has not been provided, OSD replication
> will default to the public_network
> Pulling container image quay.io/ceph/ceph:v17...
> Ceph version: ceph version 17.2.7
> (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
> Extracting ceph user uid/gid from container image...
> Creating initial keys...
> Creating initial monmap...
> Creating mon...
> Waiting for mon to start...
> Waiting for mon...
> mon is available
> Assimilating anything we can from ceph.conf...
> Generating new minimal ceph.conf...
> Restarting the monitor...
> Setting mon public_network to 10.0.0.1/32,10.0.0.0/24
> Wrote config to /etc/ceph/ceph.conf
> Wrote keyring to /etc/ceph/ceph.client.admin.keyring
> Creating mgr...
> Verifying port 9283 ...
> Waiting for mgr to start...
> Waiting for mgr...
> mgr not available, waiting (1/15)...
> mgr not available, waiting (2/15)...
> mgr not available, waiting (3/15)...
> mgr not available, waiting (4/15)...
> mgr not available, waiting (5/15)...
> mgr is available
> Enabling cephadm module...
> Waiting for the mgr to restart...
> Waiting for mgr epoch 5...
> mgr epoch 5 is available
> Setting orchestrator backend to cephadm...
> Generating ssh key...
> Wrote public SSH key to /etc/ceph/ceph.pub
> Adding key to root@localhost authorized_keys...
> Adding host node1-ceph...
> Deploying mon service with default placement...
> Deploying mgr service with default placement...
> Deploying crash service with default placement...
> Deploying ceph-exporter service with default placement...
> Non-zero exit code 22 from /usr/bin/docker run --rm --ipc=host
> --stop-signal=SIGTERM --net=host --entrypoint /usr/bin/ceph --init -e
> CONTAINER_IMAGE=quay.io/ceph/ceph:v17 -e NODE_NAME=node1-ceph -e
> CEPH_USE_RANDOM_NONCE=1 -v
> /var/log/ceph/4ce3a92a-8ddd-11ee-9b23-6341187f70c1:/var/log/ceph:z -v
> /tmp/ceph-tmp6yz3vt5s:/etc/ceph/ceph.client.admin.keyring:z -v
> /tmp/ceph-tmpfhd01qwu:/etc/ceph/ceph.conf:z quay.io/ceph/ceph:v17 orch
> apply ceph-exporter
> /usr/bin/ceph: stderr Error EINVAL: Usage:
> /usr/bin/ceph: stderr   ceph orch apply -i  [--dry-run]
> /usr/bin/ceph: stderr   ceph orch apply 
> [--placement=] [--unmanaged]
> /usr/bin/ceph: stderr
> Traceback (most recent call last):
>   File "/usr/sbin/cephadm", line 9653, in 
> main()
>   File "/usr/sbin/cephadm", line 9641, in main
> r = ctx.func(ctx)
>   File "/usr/sbin/cephadm", line 2205, in _default_image
> return func(ctx)
>   File "/usr/sbin/cephadm", line 5774, in command_bootstrap
> 

[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-16 Thread Adam King
Guillaume ran that patch through the orch suite earlier today before
merging. I think we should be okay on that front. The issue it's fixing was
also particular to rook iirc, which teuthology doesn't cover.

On Thu, Nov 16, 2023 at 10:18 AM Yuri Weinstein  wrote:

> OK I will start building.
>
> Travis, Adam King - any need to rerun any suites?
>
> On Thu, Nov 16, 2023 at 7:14 AM Guillaume Abrioux 
> wrote:
> >
> > Hi Yuri,
> >
> >
> >
> > Backport PR [2] for reef has been merged.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > [2] https://github.com/ceph/ceph/pull/54514/files
> >
> >
> >
> > --
> >
> > Guillaume Abrioux
> >
> > Software Engineer
> >
> >
> >
> > From: Guillaume Abrioux 
> > Date: Wednesday, 15 November 2023 at 21:02
> > To: Yuri Weinstein , Nizamudeen A ,
> Guillaume Abrioux , Travis Nielsen <
> tniel...@redhat.com>
> > Cc: Adam King , Redouane Kachach ,
> dev , ceph-users 
> > Subject: Re: [EXTERNAL] [ceph-users] Re: reef 18.2.1 QE Validation status
> >
> > Hi Yuri, (thanks)
> >
> >
> >
> > Indeed, we had a regression in ceph-volume impacting rook scenarios
> which was supposed to be fixed by [1].
> >
> > It turns out rook's CI didn't catch that fix wasn't enough for some
> reason (I believe the CI run wasn't using the right image, Travis might
> confirm or give more details).
> >
> > Another patch [2] is needed in order to fix this regression.
> >
> >
> >
> > Let me know if more details are needed.
> >
> >
> >
> > Thanks,
> >
> >
> >
> > [1]
> https://github.com/ceph/ceph/pull/54429/commits/ee26074a5e7e90b4026659bf3adb1bc973595e91
> >
> > [2] https://github.com/ceph/ceph/pull/54514/files
> >
> >
> >
> >
> >
> > --
> >
> > Guillaume Abrioux
> >
> > Software Engineer
> >
> >
> >
> > 
> >
> > From: Yuri Weinstein 
> > Sent: 15 November 2023 20:23
> > To: Nizamudeen A ; Guillaume Abrioux <
> gabri...@redhat.com>; Travis Nielsen 
> > Cc: Adam King ; Redouane Kachach ;
> dev ; ceph-users 
> > Subject: [EXTERNAL] [ceph-users] Re: reef 18.2.1 QE Validation status
> >
> >
> >
> > This is on behalf of Guillaume.
> >
> > We have one more last mites issue that may have to be included
> > https://tracker.ceph.com/issues/63545
> https://github.com/ceph/ceph/pull/54514
> >
> > Travis, Redo, Guillaume will provide more context and details.
> >
> > We are assessing the situation as 18.2.1 has been built and signed.
> >
> > On Tue, Nov 14, 2023 at 11:07 AM Yuri Weinstein 
> wrote:
> > >
> > > OK thx!
> > >
> > > We have completed the approvals.
> > >
> > > On Tue, Nov 14, 2023 at 9:13 AM Nizamudeen A  wrote:
> > > >
> > > > dashboard approved. Failure known and unrelated!
> > > >
> > > > On Tue, Nov 14, 2023, 22:34 Adam King  wrote:
> > > >>
> > > >> orch approved.  After reruns, orch/cephadm was just hitting two
> known (nonblocker) issues and orch/rook teuthology suite is known to not be
> functional currently.
> > > >>
> > > >> On Tue, Nov 14, 2023 at 10:33 AM Yuri Weinstein <
> ywein...@redhat.com> wrote:
> > > >>>
> > > >>> Build 4 with https://github.com/ceph/ceph/pull/54224  was built
> and I
> > > >>> ran the tests below and asking for approvals:
> > > >>>
> > > >>> smoke - Laura
> > > >>> rados/mgr - PASSED
> > > >>> rados/dashboard - Nizamudeen
> > > >>> orch - Adam King
> > > >>>
> > > >>> See Build 4 runs - https://tracker.ceph.com/issues/63443#note-1
> > > >>>
> > > >>> On Tue, Nov 14, 2023 at 12:21 AM Redouane Kachach <
> rkach...@redhat.com> wrote:
> > > >>> >
> > > >>> > Yes, cephadm has some tests for monitoring that should be enough
> to ensure basic functionality is working properly. The rest of the changes
> in the PR are for rook orchestrator.
> > > >>> >
> > > >>> > On Tue, Nov 14, 2023 at 5:04 AM Nizamudeen A 
> wrote:
> > > >>> >>
> > > >>> >> dashboard changes are minimal and approved. and since the
> dashboard change is related to the
&g

[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-14 Thread Adam King
orch approved.  After reruns, orch/cephadm was just hitting two known
(nonblocker) issues and orch/rook teuthology suite is known to not be
functional currently.

On Tue, Nov 14, 2023 at 10:33 AM Yuri Weinstein  wrote:

> Build 4 with https://github.com/ceph/ceph/pull/54224 was built and I
> ran the tests below and asking for approvals:
>
> smoke - Laura
> rados/mgr - PASSED
> rados/dashboard - Nizamudeen
> orch - Adam King
>
> See Build 4 runs - https://tracker.ceph.com/issues/63443#note-1
>
> On Tue, Nov 14, 2023 at 12:21 AM Redouane Kachach 
> wrote:
> >
> > Yes, cephadm has some tests for monitoring that should be enough to
> ensure basic functionality is working properly. The rest of the changes in
> the PR are for rook orchestrator.
> >
> > On Tue, Nov 14, 2023 at 5:04 AM Nizamudeen A  wrote:
> >>
> >> dashboard changes are minimal and approved. and since the dashboard
> change is related to the
> >> monitoring stack (prometheus..) which is something not covered in the
> dashboard test suites, I don't think running it is necessary.
> >> But maybe the cephadm suite has some monitoring stack related testings
> written?
> >>
> >> On Tue, Nov 14, 2023 at 1:10 AM Yuri Weinstein 
> wrote:
> >>>
> >>> Ack Travis.
> >>>
> >>> Since it touches a dashboard, Nizam - please reply/approve.
> >>>
> >>> I assume that rados/dashboard tests will be sufficient, but expecting
> >>> your recommendations.
> >>>
> >>> This addition will make the final release likely to be pushed.
> >>>
> >>> On Mon, Nov 13, 2023 at 11:30 AM Travis Nielsen 
> wrote:
> >>> >
> >>> > I'd like to see these changes for much improved dashboard
> integration with Rook. The changes are to the rook mgr orchestrator module,
> and supporting test changes. Thus, this should be very low risk to the ceph
> release. I don't know the details of the tautology suites, but I would
> think suites involving the mgr modules would only be necessary.
> >>> >
> >>> > Travis
> >>> >
> >>> > On Mon, Nov 13, 2023 at 12:14 PM Yuri Weinstein 
> wrote:
> >>> >>
> >>> >> Redouane
> >>> >>
> >>> >> What would be a sufficient level of testing (tautology suite(s))
> >>> >> assuming this PR is approved to be added?
> >>> >>
> >>> >> On Mon, Nov 13, 2023 at 9:13 AM Redouane Kachach <
> rkach...@redhat.com> wrote:
> >>> >> >
> >>> >> > Hi Yuri,
> >>> >> >
> >>> >> > I've just backported to reef several fixes that I introduced in
> the last months for the rook orchestrator. Most of them are fixes for
> dashboard issues/crashes that only happen on Rook environments. The PR [1]
> has all the changes and it was merged into reef this morning. We really
> need these changes to be part of the next reef release as the upcoming Rook
> stable version will be based on it.
> >>> >> >
> >>> >> > Please, can you include those changes in the upcoming reef 18.2.1
> release?
> >>> >> >
> >>> >> > [1] https://github.com/ceph/ceph/pull/54224
> >>> >> >
> >>> >> > Thanks a lot,
> >>> >> > Redouane.
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Nov 13, 2023 at 6:03 PM Yuri Weinstein <
> ywein...@redhat.com> wrote:
> >>> >> >>
> >>> >> >> -- Forwarded message -
> >>> >> >> From: Venky Shankar 
> >>> >> >> Date: Thu, Nov 9, 2023 at 11:52 PM
> >>> >> >> Subject: Re: [ceph-users] Re: reef 18.2.1 QE Validation status
> >>> >> >> To: Yuri Weinstein 
> >>> >> >> Cc: dev , ceph-users 
> >>> >> >>
> >>> >> >>
> >>> >> >> Hi Yuri,
> >>> >> >>
> >>> >> >> On Fri, Nov 10, 2023 at 4:55 AM Yuri Weinstein <
> ywein...@redhat.com> wrote:
> >>> >> >> >
> >>> >> >> > I've updated all approvals and merged PRs in the tracker and
> it looks
> >>> >> >> > like we are ready for gibba, LRC upgrades pending
> approval/update from
> >>> >> >> > Venky.
> >>> >> >>
> 

[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-08 Thread Adam King
>
> https://tracker.ceph.com/issues/63151 - Adam King do we need anything for
> this?
>

Yes, but not an actual code change in the main ceph repo. I'm looking into
a ceph-container change to alter the ganesha version in the container as a
solution.

On Wed, Nov 8, 2023 at 11:10 AM Yuri Weinstein  wrote:

> We merged 3 PRs and rebuilt "reef-release" (Build 2)
>
> Seeking approvals/reviews for:
>
> smoke - Laura, Radek 2 jobs failed in "objectstore/bluestore" tests
> (see Build 2)
> rados - Neha, Radek, Travis, Ernesto, Adam King
> rgw - Casey reapprove on Build 2
> fs - Venky, approve on Build 2
> orch - Adam King
> upgrade/quincy-x (reef) - Laura PTL
> powercycle - Brad (known issues)
>
> We need to close
> https://tracker.ceph.com/issues/63391
> (https://github.com/ceph/ceph/pull/54392) - Travis, Guillaume
> https://tracker.ceph.com/issues/63151 - Adam King do we need anything for
> this?
>
> On Wed, Nov 8, 2023 at 6:33 AM Travis Nielsen  wrote:
> >
> > Yuri, we need to add this issue as a blocker for 18.2.1. We discovered
> this issue after the release of 17.2.7, and don't want to hit the same
> blocker in 18.2.1 where some types of OSDs are failing to be created in new
> clusters, or failing to start in upgraded clusters.
> > https://tracker.ceph.com/issues/63391
> >
> > Thanks!
> > Travis
> >
> > On Wed, Nov 8, 2023 at 4:41 AM Venky Shankar 
> wrote:
> >>
> >> Hi Yuri,
> >>
> >> On Wed, Nov 8, 2023 at 2:32 AM Yuri Weinstein 
> wrote:
> >> >
> >> > 3 PRs above mentioned were merged and I am returning some tests:
> >> >
> https://pulpito.ceph.com/?sha1=55e3239498650453ff76a9b06a37f1a6f488c8fd
> >> >
> >> > Still seeing approvals.
> >> > smoke - Laura, Radek, Prashant, Venky in progress
> >> > rados - Neha, Radek, Travis, Ernesto, Adam King
> >> > rgw - Casey in progress
> >> > fs - Venky
> >>
> >> There's a failure in the fs suite
> >>
> >>
> https://pulpito.ceph.com/vshankar-2023-11-07_05:14:36-fs-reef-release-distro-default-smithi/7450325/
> >>
> >> Seems to be related to nfs-ganesha. I've reached out to Frank Filz
> >> (#cephfs on ceph slack) to have a look. WIll update as soon as
> >> possible.
> >>
> >> > orch - Adam King
> >> > rbd - Ilya approved
> >> > krbd - Ilya approved
> >> > upgrade/quincy-x (reef) - Laura PTL
> >> > powercycle - Brad
> >> > perf-basic - in progress
> >> >
> >> >
> >> > On Tue, Nov 7, 2023 at 8:38 AM Casey Bodley 
> wrote:
> >> > >
> >> > > On Mon, Nov 6, 2023 at 4:31 PM Yuri Weinstein 
> wrote:
> >> > > >
> >> > > > Details of this release are summarized here:
> >> > > >
> >> > > > https://tracker.ceph.com/issues/63443#note-1
> >> > > >
> >> > > > Seeking approvals/reviews for:
> >> > > >
> >> > > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE
> failures)
> >> > > > rados - Neha, Radek, Travis, Ernesto, Adam King
> >> > > > rgw - Casey
> >> > >
> >> > > rgw results are approved. https://github.com/ceph/ceph/pull/54371
> >> > > merged to reef but is needed on reef-release
> >> > >
> >> > > > fs - Venky
> >> > > > orch - Adam King
> >> > > > rbd - Ilya
> >> > > > krbd - Ilya
> >> > > > upgrade/quincy-x (reef) - Laura PTL
> >> > > > powercycle - Brad
> >> > > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> >> > > >
> >> > > > Please reply to this email with approval and/or trackers of known
> >> > > > issues/PRs to address them.
> >> > > >
> >> > > > TIA
> >> > > > YuriW
> >> > > > ___
> >> > > > ceph-users mailing list -- ceph-users@ceph.io
> >> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> > > >
> >> > >
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> >>
> >> --
> >> Cheers,
> >> Venky
> >> ___
> >> Dev mailing list -- d...@ceph.io
> >> To unsubscribe send an email to dev-le...@ceph.io
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef 18.2.1 QE Validation status

2023-11-07 Thread Adam King
I think the orch code itself is doing fine, but a bunch of tests are
failing due to https://tracker.ceph.com/issues/63151. I think that's likely
related to the ganesha build we have included in the container and if we
want nfs over rgw to work properly in this release I think we'll have to
update it. From previous notes in the tracker, it looks like 5.5-2 is
currently in there (specifically nfs-ganesha-rgw-5.5-2.el8s.x86_64  package
probably has an issue).

On Tue, Nov 7, 2023 at 4:02 PM Yuri Weinstein  wrote:

> 3 PRs above mentioned were merged and I am returning some tests:
> https://pulpito.ceph.com/?sha1=55e3239498650453ff76a9b06a37f1a6f488c8fd
>
> Still seeing approvals.
> smoke - Laura, Radek, Prashant, Venky in progress
> rados - Neha, Radek, Travis, Ernesto, Adam King
> rgw - Casey in progress
> fs - Venky
> orch - Adam King
> rbd - Ilya approved
> krbd - Ilya approved
> upgrade/quincy-x (reef) - Laura PTL
> powercycle - Brad
> perf-basic - in progress
>
>
> On Tue, Nov 7, 2023 at 8:38 AM Casey Bodley  wrote:
> >
> > On Mon, Nov 6, 2023 at 4:31 PM Yuri Weinstein 
> wrote:
> > >
> > > Details of this release are summarized here:
> > >
> > > https://tracker.ceph.com/issues/63443#note-1
> > >
> > > Seeking approvals/reviews for:
> > >
> > > smoke - Laura, Radek, Prashant, Venky (POOL_APP_NOT_ENABLE failures)
> > > rados - Neha, Radek, Travis, Ernesto, Adam King
> > > rgw - Casey
> >
> > rgw results are approved. https://github.com/ceph/ceph/pull/54371
> > merged to reef but is needed on reef-release
> >
> > > fs - Venky
> > > orch - Adam King
> > > rbd - Ilya
> > > krbd - Ilya
> > > upgrade/quincy-x (reef) - Laura PTL
> > > powercycle - Brad
> > > perf-basic - Laura, Prashant (POOL_APP_NOT_ENABLE failures)
> > >
> > > Please reply to this email with approval and/or trackers of known
> > > issues/PRs to address them.
> > >
> > > TIA
> > > YuriW
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.7 QE Validation status

2023-10-17 Thread Adam King
orch approved

On Mon, Oct 16, 2023 at 2:52 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/63219#note-2
> Release Notes - TBD
>
> Issue https://tracker.ceph.com/issues/63192 appears to be failing several
> runs.
> Should it be fixed for this release?
>
> Seeking approvals/reviews for:
>
> smoke - Laura
> rados - Laura, Radek, Travis, Ernesto, Adam King
>
> rgw - Casey
> fs - Venky
> orch - Adam King
>
> rbd - Ilya
> krbd - Ilya
>
> upgrade/quincy-p2p - Known issue IIRC, Casey pls confirm/approve
>
> client-upgrade-quincy-reef - Laura
>
> powercycle - Brad pls confirm
>
> ceph-volume - Guillaume pls take a look
>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> Josh, Neha - gibba and LRC upgrades -- N/A for quincy now after reef
> release.
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT weekly notes October 11th 2023

2023-10-11 Thread Adam King
Here are the notes from this week's CLT call. The call focused heavily on
release process, specifically around figuring out which patches are
required for a release.


   - 17.2.7 status


   - A few more FS PRs and one core PR then we can start release process


   - Trying to finalize list of PRs needed for 18.2.1


   - General discussion about the process for getting the list of required
   PRs for a given release
  - Using per-release github milestones. E.g. a milestone specifically
  for 18.2.1 rather than just reef


   - Would require fixing some scripts that refer to the milestone


   -
  https://github.com/ceph/ceph/blob/main/src/script/backport-resolve-issue


   - For now, continue using etherpad until something more automated exists


   - Create pads a lot earlier


   - Could use existing clt call to try to finalize required PRs for
   releases


   - should be on agenda for every clt call


   - couple of build related PRs that were stalled.


   - for a while, it's not possible to build w/FIO


   - PR https://github.com/ceph/ceph/pull/53346


   - for a while, it's not possible to "make (or ninja) install" with
   dashboard disabled


   - PR https://github.com/ceph/ceph/pull/52313


   - Some more general discussion of how to get more attention for build PRs
  - Laura will start grouping some build PRs with RADOS PRs for
  build/testing in the ci


   - Can make CI builds with CMAKE_BUILD_TYPE=Debug


   - https://github.com/ceph/ceph-build/pull/2167


   - https://github.com/ceph/ceph/pull/53855#issuecomment-1751367302


   -
   
https://shaman.ceph.com/builds/ceph/wip-batrick-testing-20231006.014828-debug/cfbdc475a5ca4098c0330e42cd978c9fd647e012/
   - relies on us removing centos 8 from all testing suites and dropping
   that as a build target


   - Last Pacific?


   - Yes, 17.2.7, then 18.2.1, then 16.2.15 (final)


   - PTLs will need to go through and find what backports still need to get
   into pacific


   - A lot of open pacific backports right now


Thanks,
  - Adam King
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm, cannot use ECDSA key with quincy

2023-10-10 Thread Adam King
The CA signed keys working in pacific was sort of accidental. We found out
that it was a working use case in pacific but not in quincy earlier this
year, which resulted in this tracker https://tracker.ceph.com/issues/62009.
That has since been implemented in main, and backported to the reef branch
(but wasn't in the initial reef release, will be in 18.2.1). It hasn't been
beackported to quincy yet though. I think it was decided no more PRs for
17.2.7 which is about to come out so the earliest the support could get to
quincy is 17.2.8. I don't know of any workaround unfortunately.

On Tue, Oct 10, 2023 at 7:57 AM Paul JURCO  wrote:

> Hi!
> If is because old ssh client was replaced with asyncssh (
> https://github.com/ceph/ceph/pull/51899) and only ported to reef, when
> will
> be added to quincy?
> For us is a blocker as we cannot move to cephadm anymore, as we planned for
> Q4.
> Is there a workarround?
>
> Thank you for your efforts!
> Paul
>
>
> On Sat, Oct 7, 2023 at 12:03 PM Paul JURCO  wrote:
>
> > Resent due to moderation when using web interface.
> >
> > Hi ceph users,
> > We have a few clusters with quincy 17.2.6 and we are preparing to migrate
> > from ceph-deploy to cephadm for better management.
> > We are using Ubuntu20 with latest updates (latest openssh).
> > While testing the migration to cephadm on a test cluster with octopus
> (v16
> > latest) we had no issues replacing ceph generated cert/key with our own
> CA
> > signed certs (ECDSA).
> > After upgrading to quincy the test cluster and test again the migration
> we
> > cannot add hosts due to the errors below, ssh access errors specified a
> > while ago in a tracker.
> > We use the following type of certs:
> > Type: ecdsa-sha2-nistp384-cert-...@openssh.com user certificate
> > The certificate works everytime when using ssh client from shell to
> > connect to all hosts in the cluster.
> > We do a ceph mgr fail every time we replace cert/key so they are
> restarted.
> >
> > - cephadm logs from mgr --
> > Oct 06 09:23:27 ceph-m2 bash[1363]: Log: Opening SSH connection to
> > 10.10.10.232, port 22
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connected to SSH server at
> > 10.10.10.232, port 22
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3]   Local address:
> > 10.10.12.160, port 51870
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3]   Peer address:
> 10.10.10.232,
> > port 22
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Beginning auth for user root
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Auth failed for user root
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Connection failure:
> > Permission denied
> > Oct 06 09:23:27 ceph-m2 bash[1363]: [conn=3] Aborting connection
> > Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/ssh.py", line 111, in redirect_log
> > Oct 06 09:23:27 ceph-m2 bash[1363]: yield
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/ssh.py", line 90, in _remote_connection
> > Oct 06 09:23:27 ceph-m2 bash[1363]: preferred_auth=['publickey'],
> > options=ssh_options)
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in
> connect
> > Oct 06 09:23:27 ceph-m2 bash[1363]: 'Opening SSH connection to')
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/lib/python3.6/site-packages/asyncssh/connection.py", line 303, in
> _connect
> > Oct 06 09:23:27 ceph-m2 bash[1363]: await conn.wait_established()
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/lib/python3.6/site-packages/asyncssh/connection.py", line 2243, in
> > wait_established
> > Oct 06 09:23:27 ceph-m2 bash[1363]: await self._waiter
> > Oct 06 09:23:27 ceph-m2 bash[1363]: asyncssh.misc.PermissionDenied:
> > Permission denied
> > Oct 06 09:23:27 ceph-m2 bash[1363]: During handling of the above
> > exception, another exception occurred:
> > Oct 06 09:23:27 ceph-m2 bash[1363]: Traceback (most recent call last):
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper
> > Oct 06 09:23:27 ceph-m2 bash[1363]: return OrchResult(f(*args,
> > **kwargs))
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/module.py", line 2810, in apply
> > Oct 06 09:23:27 ceph-m2 bash[1363]: results.append(self._apply(spec))
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/module.py", line 2558, in _apply
> > Oct 06 09:23:27 ceph-m2 bash[1363]: return
> > self._add_host(cast(HostSpec, spec))
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/module.py", line 1434, in _add_host
> > Oct 06 09:23:27 ceph-m2 bash[1363]: ip_addr =
> > self._check_valid_addr(spec.hostname, spec.addr)
> > Oct 06 09:23:27 ceph-m2 bash[1363]:   File
> > "/usr/share/ceph/mgr/cephadm/module.py", line 1415, in 

[ceph-users] Re: ceph orch osd data_allocate_fraction does not work

2023-09-21 Thread Adam King
Looks like the orchestation side support for this got brought into pacific
with the rest of the drive group stuff, but the actual underlying feature
in ceph-volume (from https://github.com/ceph/ceph/pull/40659) never got a
pacific backport. I've opened the backport now
https://github.com/ceph/ceph/pull/53581 and I think another pacific release
is planned so we can hopefully have it fixed there eventually, but it's
definitely broken as of now.  Sorry about that.

On Thu, Sep 21, 2023 at 7:54 AM Boris Behrens  wrote:

> I have a use case where I want to only use a small portion of the disk for
> the OSD and the documentation states that I can use
> data_allocation_fraction [1]
>
> But cephadm can not use this and throws this error:
> /usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized
> arguments: --data-allocate-fraction 0.1
>
> So, what I actually want to achieve:
> Split up a single SSD into:
> 3-5x block.db for spinning disks (5x 320GB or 3x 500GB regarding if I have
> 8TB HDDs or 16TB HDDs)
> 1x SSD OSD (100G) for RGW index / meta pools
> 1x SSD OSD (100G) for RGW gc pool because of this bug [2]
>
> My service definition looks like this:
>
> service_type: osd
> service_id: hdd-8tb
> placement:
>   host_pattern: '*'
> crush_device_class: hdd
> spec:
>   data_devices:
> rotational: 1
> size: ':9T'
>   db_devices:
> rotational: 0
> limit: 5
> size: '1T:2T'
>   encrypted: true
>   block_db_size: 3200
> ---
> service_type: osd
> service_id: hdd-16tb
> placement:
>   host_pattern: '*'
> crush_device_class: hdd
> spec:
>   data_devices:
> rotational: 1
> size: '14T:'
>   db_devices:
> rotational: 0
> limit: 1
> size: '1T:2T'
>   encrypted: true
>   block_db_size: 5000
> ---
> service_type: osd
> service_id: gc
> placement:
>   host_pattern: '*'
> crush_device_class: gc
> spec:
>   data_devices:
> rotational: 0
> size: '1T:2T'
>   encrypted: true
>   data_allocate_fraction: 0.05
> ---
> service_type: osd
> service_id: ssd
> placement:
>   host_pattern: '*'
> crush_device_class: ssd
> spec:
>   data_devices:
> rotational: 0
> size: '1T:2T'
>   encrypted: true
>   data_allocate_fraction: 0.05
>
>
> [1]
>
> https://docs.ceph.com/en/pacific/cephadm/services/osd/#ceph.deployment.drive_group.DriveGroupSpec.data_allocate_fraction
> [2] https://tracker.ceph.com/issues/53585
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groüen Saal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.14 pacific QE validation status

2023-08-28 Thread Adam King
cephadm piece of rados can be approved. Failures all look known to me.

On Fri, Aug 25, 2023 at 4:06 PM Radoslaw Zarzynski 
wrote:

> rados approved
>
> On Thu, Aug 24, 2023 at 12:33 AM Laura Flores  wrote:
>
>> Rados summary is here:
>> https://tracker.ceph.com/projects/rados/wiki/PACIFIC#Pacific-v16214-httpstrackercephcomissues62527note-1
>>
>> Most are known, except for two new trackers I raised:
>>
>>1. https://tracker.ceph.com/issues/62557 - rados/dashboard:
>>Teuthology test failure due to "MDS_CLIENTS_LAGGY" warning - Ceph - RADOS
>>2. https://tracker.ceph.com/issues/62559 - rados/cephadm/dashboard:
>>test times out due to host stuck in maintenance mode - Ceph - Orchestrator
>>
>> #1 is related to a similar issue we saw where the MDS_CLIENTS_LAGGY
>> warning was coming up in the Jenkins api check, where these kinds of
>> conditions are expected. In that case, I would call #1 more of a test
>> issue, and say that the fix is to whitelist the warning for that test.
>> Would be good to have someone from CephFS weigh in though-- @Patrick
>> Donnelly  @Dhairya Parmar 
>>
>> #2 looks new to me. @Adam King  can you take a look
>> and see if it's something to be concerned about? The same test failed for a
>> different reason in the rerun, so the failure did not reproduce.
>>
>> On Wed, Aug 23, 2023 at 1:08 PM Laura Flores  wrote:
>>
>>> Thanks Yuri! I will take a look for rados and get back to this thread.
>>>
>>> On Wed, Aug 23, 2023 at 9:41 AM Yuri Weinstein 
>>> wrote:
>>>
>>>> Details of this release are summarized here:
>>>>
>>>> https://tracker.ceph.com/issues/62527#note-1
>>>> Release Notes - TBD
>>>>
>>>> Seeking approvals for:
>>>>
>>>> smoke - Venky
>>>> rados - Radek, Laura
>>>>   rook - Sébastien Han
>>>>   cephadm - Adam K
>>>>   dashboard - Ernesto
>>>>
>>>> rgw - Casey
>>>> rbd - Ilya
>>>> krbd - Ilya
>>>> fs - Venky, Patrick
>>>>
>>>> upgrade/pacific-p2p - Laura
>>>> powercycle - Brad (SELinux denials)
>>>>
>>>>
>>>> Thx
>>>> YuriW
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>
>>>
>>>
>>> --
>>>
>>> Laura Flores
>>>
>>> She/Her/Hers
>>>
>>> Software Engineer, Ceph Storage <https://ceph.io>
>>>
>>> Chicago, IL
>>>
>>> lflo...@ibm.com | lflo...@redhat.com 
>>> M: +17087388804
>>>
>>>
>>>
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage <https://ceph.io>
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>> ___
>> Dev mailing list -- d...@ceph.io
>> To unsubscribe send an email to dev-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm to setup wal/db on nvme

2023-08-23 Thread Adam King
this should be possible by specifying a "data_devices" and "db_devices"
fields in the OSD spec file each with different filters. There's some
examples in the docs
https://docs.ceph.com/en/latest/cephadm/services/osd/#the-simple-case that
show roughly how that's done, and some other sections (
https://docs.ceph.com/en/latest/cephadm/services/osd/#filters) that go more
in depth on the different filtering options available so you can try and
find one that works for your disks. You can check the output of "ceph orch
device ls --format json | jq" to see things like what cephadm considers the
model, size etc. for the devices to be for use in the filtering.

On Wed, Aug 23, 2023 at 1:13 PM Satish Patel  wrote:

> Folks,
>
> I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to
> build ceph storage using cephadm on Ubuntu 22.04 distro.
>
> If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs then
> how does cephadm handle it?
>
> Trying to find a document where I can tell cephadm to deploy wal/db on nvme
> so it can speed up write optimization. Do I need to create or cephadm will
> create each partition for the number of OSD?
>
> Help me to understand how it works and is it worth doing?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osdspec_affinity error in the Cephadm module

2023-08-16 Thread Adam King
it looks like you've hit https://tracker.ceph.com/issues/58946 which has a
candidate fix open, but nothing merged. The description on the PR with the
candidate fix says "When osdspec_affinity is not set, the drive selection
code will fail. This can happen when a device has multiple LVs where some
of are used by Ceph and at least one LV isn't used by Ceph." so maybe you
can start there in terms of finding a potential workaround for now.

On Wed, Aug 16, 2023 at 12:05 PM Adam Huffman 
wrote:

> I've been having fun today trying to invite a new disk that replaced a
> failing one into a cluster.
>
> One of my attempts to apply an OSD spec was clearly wrong, because I now
> have this error:
>
> Module 'cephadm' has failed: 'osdspec_affinity'
>
> and this was the traceback in the mgr logs:
>
>  Traceback (most recent call last):
>File "/usr/share/ceph/mgr/cephadm/utils.py", line 77, in do_work
>  return f(*arg)
>File "/usr/share/ceph/mgr/cephadm/serve.py", line 224, in refresh
>  r = self._refresh_host_devices(host)
>File "/usr/share/ceph/mgr/cephadm/serve.py", line 396, in
> _refresh_host_devices
>  self.update_osdspec_previews(host)
>File "/usr/share/ceph/mgr/cephadm/serve.py", line 412, in
> update_osdspec_previews
>  previews.extend(self.mgr.osd_service.get_previews(search_host))
>File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 258, in
> get_previews
>  return self.generate_previews(osdspecs, host)
>File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 291, in
> generate_previews
>  for host, ds in self.prepare_drivegroup(osdspec):
>File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 225, in
> prepare_drivegroup
>  existing_daemons=len(dd_for_spec_and_host))
>File
> "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
> line 35, in __init__
>  self._data = self.assign_devices('data_devices',
> self.spec.data_devices)
>File
> "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
> line 19, in wrapper
>  return f(self, ds)
>File
> "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py",
> line 134, in assign_devices
>  if lv['osdspec_affinity'] != self.spec.service_id:
>  KeyError: 'osdspec_affinity'
>
> This cluster is running 16.2.13.
>
> The exported service spec is:
>
> service_type: osd
> service_id: osd_spec-0.3
> service_name: osd.osd_spec-0.3
> placement:
>   host_pattern: cepho-*
> spec:
>   data_devices:
> rotational: true
>   db_devices:
> model: SSDPE2KE032T8L
>   encrypted: true
>   filter_logic: AND
>   objectstore: bluestore
>
> Best Wishes,
> Adam
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm orchestrator does not restart daemons [was: ceph orch upgrade stuck between 16.2.7 and 16.2.13]

2023-08-16 Thread Adam King
I've seen this before where the ceph-volume process hanging causes the
whole serve loop to get stuck (we have a patch to get it to timeout
properly in reef and are backporting to quincy but nothing for pacific
unfortunately). That's why I was asking about the REFRESHED column in the
orch ps/ orch device ls output. Typically when this happens it presents as
the REFRESHED column reporting not having refreshed anything since the
ceph-volume process started hanging. Either way, if you killed those
ceph-volume processes and any new ones aren't hanging and the serve loop is
running okay I'd expect the issues to clear up. This could (and most likely
did) cause both the daemon restarts to not happen and the upgrade to not
progress.

On Wed, Aug 16, 2023 at 8:50 AM Robert Sander 
wrote:

> On 8/16/23 12:10, Eugen Block wrote:
> > I don't really have a good idea right now, but there was a thread [1]
> > about ssh sessions that are not removed, maybe that could have such an
> > impact? And if you crank up the debug level to 30, do you see anything
> > else?
>
> It was something similar. There were leftover ceph-volume processes
> running on some of the OSD nodes. After killing them the cephadm
> orchestrator is now able to resume the upgrade.
>
> As we also restarted the MGR processes (with systemctl restart
> CONTAINER) there were no leftover SSH sessions.
>
> But the still running ceph-volume processes must have used a lock that
> blocked new cephadm commands.
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm adoption - service reconfiguration changes container image

2023-08-15 Thread Adam King
you could maybe try running "ceph config set global container
quay.io/ceph/ceph:v16.2.9" before running the adoption. It seems it still
thinks it should be deploying mons with the default image (
docker.io/ceph/daemon-base:latest-pacific-devel ) for some reason and maybe
that config option is why.

On Tue, Aug 15, 2023 at 7:29 AM Iain Stott 
wrote:

> Hi Everyone,
>
> We are looking at migrating all our production clusters from ceph-ansible
> to cephadm. We are currently experiencing an issue where when reconfiguring
> a service through ceph orch, it will change the running container image for
> that service which has led to the mgr services running an earlier version
> than the rest of the cluster, this has caused cephadm/ceph orch to not be
> able to manage services in the cluster.
>
> Does anyone have any help with this as I cannot find anything in the docs
> that would correct this.
>
> Cheers
> Iain
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# cephadm --image
> quay.io/ceph/ceph:v16.2.9 adopt --style legacy --name mon.$(hostname -s)
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# cat ./mon.yaml
> service_type: mon
> service_name: mon
> placement:
>   host_pattern: '*mon*'
> extra_container_args:
> - -v
> - /etc/ceph/ceph.client.admin.keyring:/etc/ceph/ceph.client.admin.keyring
>
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# cephadm shell -- ceph config get mgr
> mgr/cephadm/container_image_base
> quay.io/ceph/ceph
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# podman ps -a | grep mon
> 55fe8bc10476  quay.io/ceph/ceph:v16.2.9
>   -n mgr.de1-ceph-m...  5 minutes ago  Up 5 minutes
> ceph-da9e9837-a3cf-4482-9a13-790a721598cd-mgr-de1-ceph-mon-ceph-site-a-1
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# cat ./mon.yaml | cephadm --image
> quay.io/ceph/ceph:v16.2.9 shell -- ceph orch apply -i -
> Inferring fsid da9e9837-a3cf-4482-9a13-790a721598cd
> Scheduled mon update...
>
> [root@de1-ceph-mon-ceph-site-a-1 ~]# podman ps -a | grep mon
> ecec1d62c719  docker.io/ceph/daemon-base:latest-pacific-devel
>   -n mon.de1-ceph-m...  25 seconds ago  Up 26 seconds
> ceph-da9e9837-a3cf-4482-9a13-790a721598cd-mon-de1-ceph-mon-ceph-site-a-1
>
> [root@de1-ceph-mon-ceph-site-a-2 ~]# ceph versions
> {
> "mon": {
> "ceph version 16.2.5-387-g7282d81d
> (7282d81d2c500b5b0e929c07971b72444c6ac424) pacific (stable)": 3
> },
> "mgr": {
> "ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830)
> pacific (stable)": 3
> },
> "osd": {
> "ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830)
> pacific (stable)": 8
> },
> "mds": {},
> "rgw": {
> "ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830)
> pacific (stable)": 3
> },
> "overall": {
> "ceph version 16.2.5-387-g7282d81d
> (7282d81d2c500b5b0e929c07971b72444c6ac424) pacific (stable)": 3,
> "ceph version 16.2.9 (4c3647a322c0ff5a1dd2344e039859dcbd28c830)
> pacific (stable)": 14
> }
> }
>
>
> Iain Stott
> OpenStack Engineer
> iain.st...@thehutgroup.com
> [THG Ingenuity Logo]
> [https://i.imgur.com/wbpVRW6.png]<
> https://www.linkedin.com/company/thgplc/?originalSubdomain=uk>  [
> https://i.imgur.com/c3040tr.png] 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph orch upgrade stuck between 16.2.7 and 16.2.13

2023-08-15 Thread Adam King
with the log to cluster level already on debug, if you do a "ceph mgr fail"
what does cephadm log to the cluster before it reports sleeping? It should
at least be doing something if it's responsive at all. Also, in "ceph orch
ps"  and "ceph orch device ls" are the REFRESHED columns reporting that
they've refreshed the info recently (last 10 minutes for daemons, last 30
minutes for devices)?

On Tue, Aug 15, 2023 at 3:46 AM Robert Sander 
wrote:

> Hi,
>
> A healthy 16.2.7 cluster should get an upgrade to 16.2.13.
>
> ceph orch upgrade start --ceph-version 16.2.13
>
> did upgrade MONs, MGRs and 25% of the OSDs and is now stuck.
>
> We tried several "ceph orch upgrade stop" and starts again.
> We "failed" the active MGR but no progress.
> We set the debug logging with "ceph config set mgr
> mgr/cephadm/log_to_cluster_level debug" but it only tells that it starts:
>
> 2023-08-15T09:05:58.548896+0200 mgr.cephmon01 [INF] Upgrade: Started with
> target quay.io/ceph/ceph:v16.2.13
>
> How can we check what is happening (or not happening) here?
> How do we get cephadm to complete the task?
>
> Current status is:
>
> # ceph orch upgrade status
> {
>  "target_image": "quay.io/ceph/ceph:v16.2.13",
>  "in_progress": true,
>  "which": "Upgrading all daemon types on all hosts",
>  "services_complete": [],
>  "progress": "",
>  "message": "",
>  "is_paused": false
> }
>
> # ceph -s
>cluster:
>  id: 3098199a-c7f5-4baf-901c-f178131be6f4
>  health: HEALTH_WARN
>  There are daemons running an older version of ceph
>
>services:
>  mon: 5 daemons, quorum
> cephmon02,cephmon01,cephmon03,cephmon04,cephmon05 (age 4d)
>  mgr: cephmon03(active, since 8d), standbys: cephmon01, cephmon02
>  mds: 2/2 daemons up, 1 standby, 2 hot standby
>  osd: 202 osds: 202 up (since 11d), 202 in (since 13d)
>  rgw: 2 daemons active (2 hosts, 1 zones)
>
>data:
>  volumes: 2/2 healthy
>  pools:   11 pools, 4961 pgs
>  objects: 98.84M objects, 347 TiB
>  usage:   988 TiB used, 1.3 PiB / 2.3 PiB avail
>  pgs: 4942 active+clean
>   19   active+clean+scrubbing+deep
>
>io:
>  client:   89 MiB/s rd, 598 MiB/s wr, 25 op/s rd, 157 op/s wr
>
>progress:
>  Upgrade to quay.io/ceph/ceph:v16.2.13 (0s)
>[]
>
> # ceph versions
> {
>  "mon": {
>  "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e)
> pacific (stable)": 5
>  },
>  "mgr": {
>  "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e)
> pacific (stable)": 3
>  },
>  "osd": {
>  "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e)
> pacific (stable)": 48,
>  "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> pacific (stable)": 154
>  },
>  "mds": {
>  "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> pacific (stable)": 5
>  },
>  "rgw": {
>  "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> pacific (stable)": 2
>  },
>  "overall": {
>  "ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e)
> pacific (stable)": 56,
>  "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> pacific (stable)": 161
>  }
> }
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ref v18.2.0 QE Validation status

2023-07-31 Thread Adam King
orch approved. We have an issue with the "upgrade ls" (which just lists
images that could be upgraded to, not actually necessary for upgrading) and
one known issue in the jager-tracing deployment test, but nothing to block
release.

On Sun, Jul 30, 2023 at 11:46 AM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/62231#note-1
>
> Seeking approvals/reviews for:
>
> smoke - Laura, Radek
> rados - Neha, Radek, Travis, Ernesto, Adam King
> rgw - Casey
> fs - Venky
> orch - Adam King
> rbd - Ilya
> krbd - Ilya
> upgrade-clients:client-upgrade* - in progress
> powercycle - Brad
>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> bookworm distro support is an outstanding issue.
>
> TIA
> YuriW
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm logs

2023-07-28 Thread Adam King
Not currently. Those logs aren't generated by any daemons, they come
directly from anything done by the cephadm binary one the host, which tends
to be quite a bit since the cephadm mgr module runs most of its operations
on the host through a copy of the cephadm binary. It doesn't log to journal
because it doesn't have a systemd unit or anything, it's just a python
script being run directly and nothing has been implemented to make it
possible for that to log to journald.

On Fri, Jul 28, 2023 at 9:43 AM Luis Domingues 
wrote:

> Hi,
>
> Quick question about cephadm and its logs. On my cluster I have every logs
> that goes to journald. But on each machine, I still have
> /var/log/ceph/cephadm.log that is alive.
>
> Is there a way to make cephadm log to journald instead of a file? If yes
> did I miss it on the documentation? Of if not is there any reason to log
> into a file while everything else logs to journald?
>
> Thanks
>
> Luis Domingues
> Proton AG
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific

2023-07-25 Thread Adam King
okay, not much info on the mon failure. The other one at least seems to be
a simple port conflict. What does `sudo netstat -tulpn` give you on that
host?

On Tue, Jul 25, 2023 at 12:00 PM Renata Callado Borges <
renato.call...@incor.usp.br> wrote:

> Hi Adam!
>
>
> Thank you for your response, but I am still trying to figure out the
> issue. I am pretty sure the problem occurs "inside" the container, and I
> don´t  know how to get logs from there.
>
> Just in case, this is what systemd sees:
>
>
> Jul 25 12:36:32 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:32 darkside1 systemd[1]: Starting Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda...
> Jul 25 12:36:33 darkside1 bash[52271]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.32233695
> -0300 -03 m=+0.131005321 container create
> 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darkside1)
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.526241218
> -0300 -03 m=+0.334909578 container init 7cf1d340e0a9658b2c9da
> c3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksid
> e1)
> Jul 25 12:36:33 darkside1 podman[52311]: 2023-07-25 12:36:33.556646854
> -0300 -03 m=+0.365315225 container start 7cf1d340e0a9658b2c9d
> ac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> (image=quay.io/ceph/ceph:v15,
> name=ceph-920740ee-cf2d-11ed-9097-08c0eb320eda-mon.darksi
> de1)
> Jul 25 12:36:33 darkside1 bash[52271]:
> 7cf1d340e0a9658b2c9dac3a039a6f5b4fd2a5581bb60213b0ab708d24b69f54
> Jul 25 12:36:33 darkside1 systemd[1]: Started Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:43 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service: main
> process exited, code=exi
> ted, status=1/FAILURE
> Jul 25 12:36:43 darkside1 systemd[1]: Unit
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
> failed state.
> Jul 25 12:36:43 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
> Jul 25 12:36:53 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service holdoff
> time over, scheduling
> restart.
> Jul 25 12:36:53 darkside1 systemd[1]: Stopped Ceph mon.darkside1 for
> 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:53 darkside1 systemd[1]: start request repeated too quickly
> for ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1
> .service
> Jul 25 12:36:53 darkside1 systemd[1]: Failed to start Ceph mon.darkside1
> for 920740ee-cf2d-11ed-9097-08c0eb320eda.
> Jul 25 12:36:53 darkside1 systemd[1]: Unit
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service entered
> failed state.
> Jul 25 12:36:53 darkside1 systemd[1]:
> ceph-920740ee-cf2d-11ed-9097-08c0eb320eda@mon.darkside1.service failed.
>
>
> Also, I get the following error every 10 minutes or so on "ceph -W
> cephadm --watch-debug":
>
>
> 2023-07-25T12:35:38.115146-0300 mgr.darkside3.ujjyun [INF] Deploying
> daemon node-exporter.darkside1 on darkside1
> 2023-07-25T12:35:38.612569-0300 mgr.darkside3.ujjyun [ERR] cephadm
> exited with an error code: 1, stderr:Deploy daemon node-exporter.
> darkside1 ...
> Verifying port 9100 ...
> Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
> ERROR: TCP Port(s) '9100' required for node-exporter already in use
> Traceback (most recent call last):
>File "/usr/share/ceph/mgr/cephadm/module.py", line 1029, in
> _remote_connection
>  yield (conn, connr)
>File "/usr/share/ceph/mgr/cephadm/module.py", line 1185, in _run_cephadm
>  code, '\n'.join(err)))
> orchestrator._interface.OrchestratorError: cephadm exited with an error
> code: 1, stderr:Deploy daemon node-exporter.darkside1 ...
> Verifying port 9100 ...
> Cannot bind to IP 0.0.0.0 port 9100: [Errno 98] Address already in use
> ERROR: TCP Port(s) '9100' required for node-exporter already in use
>
> And finally I get this error on the first line of output for my "ceph
> mon dump":
>
> 2023-07-25T12:46:17.008-0300 7f145f59e700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only suppor
> t [2,1]
>
>
> Cordially,
>
> Renata.
>
> On 7/24/23 10:57, Adam King wrote:
> > The logs you probably really want to look at here are the journal logs
> > from the mgr and mon. If you have a copy of the ce

[ceph-users] Re: Failing to restart mon and mgr daemons on Pacific

2023-07-24 Thread Adam King
The logs you probably really want to look at here are the journal logs from
the mgr and mon. If you have a copy of the cephadm tool on the host, you
can do a "cephadm ls --no-detail | grep systemd" to list out the systemd
unit names for the ceph daemons on the host, or just look find the systemd
unit names in the standard way you would for any other systemd unit (e.g.
"systemctl -l | grep mgr'' will probably include the mgr one) and then take
a look at "journalctl -eu " for the systemd unit for
both the mgr and the mon. I'd expect near the end of the log it would
include a reason for going down.

As for the debug_ms (I think that's what you want over "debug mon") stuff,
I think that would need to be a command line option for the mgr/mon
process. For cephadm deployments, the systemd unit is run through a
"unit.run" file in /var/lib/ceph///unit.run. If
you go to the very end of that file, which will be a very long podman or
docker run command, add in the "--debug_ms 20" and then restart the systemd
unit for that daemon, it should cause the extra debug logging to happen
from that daemon. I would say first check if there are useful errors in the
journal logs mentioned above before trying that though.

On Mon, Jul 24, 2023 at 9:47 AM Renata Callado Borges <
renato.call...@incor.usp.br> wrote:

> Dear all,
>
>
> How are you?
>
> I have a cluster on Pacific with 3 hosts, each one with 1 mon,  1 mgr
> and 12 OSDs.
>
> One of the hosts, darkside1, has been out of quorum according to ceph
> status.
>
> Systemd showed 4 services dead, two mons and two mgrs.
>
> I managed to systemctl restart one mon and one mgr, but even after
> several attempts, the remaining mon and mgr services, when asked to
> restart, keep returning to a failed state after a few seconds. They try
> to auto-restart and then go into a failed state where systemd requires
> me to manually set them to "reset-failed" before trying to start again.
> But they never stay up. There are no clear messages about the issue in
> /var/log/ceph/cephadm.log.
>
> The host is still out of quorum.
>
>
> I have failed to "turn on debug" as per
> https://docs.ceph.com/en/pacific/rados/troubleshooting/log-and-debug/.
> It seems I do not know the proper incantantion for "ceph daemon X config
> show", no string for X seems to satisfy this command. I have tried
> adding this:
>
> [mon]
>
>   debug mon = 20
>
>
> To my ceph.conf, but no additional lines of log are sent to
> /var/log/cephadm.log
>
>
>   so I'm sorry I can´t provide more details.
>
>
> Could someone help me debug this situation? I am sure that if just
> reboot the machine, it will start up the services properly, as it always
> has done, but I would prefer to fix this without this action.
>
>
> Cordially,
>
> Renata.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm does not redeploy OSD

2023-07-19 Thread Adam King
>
> When looking on the very verbous cephadm logs, it seemed that cephadm was
> just skipping my node, with a message saying that a node was already part
> of another spec.
>

If you have it, would you mind sharing what this message was? I'm still not
totally sure what happened here.

On Wed, Jul 19, 2023 at 10:15 AM Luis Domingues 
wrote:

> So good news, I was not hit by the bug you mention on this thread.
>
> What happened, (apparently, I did not tried to replicated it yet) is that
> I had another OSD (let call it OSD.1) using the db device, but that was
> part of an old spec. (let call it spec-a). And the OSD (OSD.2) I removed
> should be detected as part of spec-b. The difference between them was just
> the name and the placement, using labels instead of hostname.
>
> When looking on the very verbous cephadm logs, it seemed that cephadm was
> just skipping my node, with a message saying that a node was already part
> of another spec.
>
> I purged OSD.1 with --replace and --zap, and once disks where empty and
> ready to go, cephamd just added back OSD.1 and OSD.2 using the db_device as
> specified.
>
> I do not know if this is the intended behavior, or if I was just lucky,
> but all my OSDs are back to the cluster.
>
> Luis Domingues
> Proton AG
>
>
> --- Original Message ---
> On Tuesday, July 18th, 2023 at 18:32, Luis Domingues <
> luis.doming...@proton.ch> wrote:
>
>
> > That part looks quite good:
> >
> > "available": false,
> > "ceph_device": true,
> > "created": "2023-07-18T16:01:16.715487Z",
> > "device_id": "SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600354",
> > "human_readable_type": "ssd",
> > "lsm_data": {},
> > "lvs": [
> > {
> > "cluster_fsid": "11b47c57-5e7f-44c0-8b19-ddd801a89435",
> > "cluster_name": "ceph",
> > "db_uuid": "CUMgp7-Uscn-ASLo-bh14-7Sxe-80GE-EcywDb",
> > "name": "osd-block-db-5cb8edda-30f9-539f-b4c5-dbe420927911",
> > "osd_fsid": "089894cf-1782-4a3a-8ac0-9dd043f80c71",
> > "osd_id": "7",
> > "osdspec_affinity": "",
> > "type": "db"
> > },
> > {
> >
> > I forgot to mention that the cluster was initially deployed with
> ceph-ansible and adopted by cephadm.
> >
> > Luis Domingues
> > Proton AG
> >
> >
> >
> >
> > --- Original Message ---
> > On Tuesday, July 18th, 2023 at 18:15, Adam King adk...@redhat.com wrote:
> >
> >
> >
> > > in the "ceph orch device ls --format json-pretty" output, in the blob
> for
> > > that specific device, is the "ceph_device" field set? There was a bug
> where
> > > it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and
> it
> > > would make it so you couldn't use a device serving as a db device for
> any
> > > further OSDs, unless the device was fully cleaned out (so it is no
> longer
> > > serving as a db device). The "ceph_device" field is meant to be our
> way of
> > > knowing "yes there are LVM partitions here, but they're our partitions
> for
> > > ceph stuff, so we can still use the device" and without it (or with it
> just
> > > being broken, as in the tracker) redeploying OSDs that used the device
> for
> > > its DB wasn't working as we don't know if those LVs imply its our
> device or
> > > has LVs for some other purpose. I had thought this was fixed already in
> > > 16.2.13 but it sounds too similar to what you're seeing not to
> consider it.
> > >
> > > On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues
> luis.doming...@proton.ch
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > We are running a ceph cluster managed with cephadm v16.2.13.
> Recently we
> > > > needed to change a disk, and we replaced it with:
> > > >
> > > > ceph orch osd rm 37 --replace.
> > > >
> > > > It worked fine, the disk was drained and the OSD marked as destroy.
> > > >
> > > > However, after changing the disk, no OSD was created. Looking to the
> db
> > > > device, the partition for db for OSD 37 was still there. So we
> destroyed it
> > > > using:
> > > > ceph-volume lvm zap --osd-id=37 --destroy.
> > > >
> > > > But we still have no OSD redeployed.
&

[ceph-users] Re: cephadm does not redeploy OSD

2023-07-18 Thread Adam King
in the "ceph orch device ls --format json-pretty" output, in the blob for
that specific device, is the "ceph_device" field set? There was a bug where
it wouldn't be set at all (https://tracker.ceph.com/issues/57100) and it
would make it so you couldn't use a device serving as a db device for any
further OSDs, unless the device was fully cleaned out (so it is no longer
serving as a db device). The "ceph_device" field is meant to be our way of
knowing "yes there are LVM partitions here, but they're our partitions for
ceph stuff, so we can still use the device" and without it (or with it just
being broken, as in the tracker) redeploying OSDs that used the device for
its DB wasn't working as we don't know if those LVs imply its our device or
has LVs for some other purpose. I had thought this was fixed already in
16.2.13 but it sounds too similar to what you're seeing not to consider it.

On Tue, Jul 18, 2023 at 10:53 AM Luis Domingues 
wrote:

> Hi,
>
> We are running a ceph cluster managed with cephadm v16.2.13. Recently we
> needed to change a disk, and we replaced it with:
>
> ceph orch osd rm 37 --replace.
>
> It worked fine, the disk was drained and the OSD marked as destroy.
>
> However, after changing the disk, no OSD was created. Looking to the db
> device, the partition for db for OSD 37 was still there. So we destroyed it
> using:
> ceph-volume lvm zap --osd-id=37 --destroy.
>
> But we still have no OSD redeployed.
> Here we have our spec:
>
> ---
> service_type: osd
> service_id: osd-hdd
> placement:
> label: osds
> spec:
> data_devices:
> rotational: 1
> encrypted: true
> db_devices:
> size: '1TB:2TB' db_slots: 12
>
> And the disk looks good:
>
> HOST PATH TYPE DEVICE ID SIZE AVAILABLE REFRESHED REJECT REASONS
> node05 /dev/nvme2n1 ssd SAMSUNG MZPLJ1T6HBJR-7_S55JNG0R600357 1600G
> 12m ago LVM detected, locked
>
> node05 /dev/sdk hdd SEAGATE_ST1NM0206_ZA21G217C7240KPF 10.0T Yes
> 12m ago
>
> And VG on db_device looks to have enough space:
> ceph-33b06f1a-f6f6-57cf-9ca8-6e4aa81caae0 1 11 0 wz--n- <1.46t 173.91g
>
> If I remove the db_devices and db_slots from the specs, and do a dry run,
> the orchestrator seems to see the new disk as available:
>
> ceph orch apply -i osd_specs.yml --dry-run
> WARNING! Dry-Runs are snapshots of a certain point in time and are bound
> to the current inventory setup. If any of these conditions change, the
> preview will be invalid. Please make sure to have a minimal
> timeframe between planning and applying the specs.
> 
> SERVICESPEC PREVIEWS
> 
> +-+--++-+
> |SERVICE |NAME |ADD_TO |REMOVE_FROM |
> +-+--++-+
> +-+--++-+
> 
> OSDSPEC PREVIEWS
> 
> +-+-+-+--++-+
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> +-+-+-+--++-+
> |osd |osd-hdd |node05 |/dev/sdk |- |- |
> +-+-+-+--++-+
>
> But as soon as I add db_devices back, the orchestrator is happy as it is,
> like there is nothing to do:
>
> ceph orch apply -i osd_specs.yml --dry-run
> WARNING! Dry-Runs are snapshots of a certain point in time and are bound
> to the current inventory setup. If any of these conditions change, the
> preview will be invalid. Please make sure to have a minimal
> timeframe between planning and applying the specs.
> 
> SERVICESPEC PREVIEWS
> 
> +-+--++-+
> |SERVICE |NAME |ADD_TO |REMOVE_FROM |
> +-+--++-+
> +-+--++-+
> 
> OSDSPEC PREVIEWS
> 
> +-+--+--+--++-+
> |SERVICE |NAME |HOST |DATA |DB |WAL |
> +-+--+--+--++-+
>
> I do not know why ceph will not use this disk, and I do not know where to
> look. It seems logs are not saying anything. And the weirdest thing,
> another disk was replaced on the same machine, and it went without any
> issues.
>
> Luis Domingues
> Proton AG
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPHADM_FAILED_SET_OPTION

2023-07-18 Thread Adam King
Someone hit what I think is this same issue the other day. Do you have a
"config" section in your rgw spec that sets the
"rgw_keystone_implicit_tenants" option to "True" or "true"? For them,
changing the value to be 1 (which should be equivalent to "true" here)
instead of "true" fixed it. Likely an bug with the handling of
"True"/"true" by the cephadm mgr module when setting the config options.

On Tue, Jul 18, 2023 at 11:38 AM Arnoud de Jonge 
wrote:

> Hi,
>
> After having set up RadosGW with keystone authentication the cluster shows
> this warning:
>
> # ceph health detail
> HEALTH_WARN Failed to set 1 option(s)
> [WRN] CEPHADM_FAILED_SET_OPTION: Failed to set 1 option(s)
> Failed to set rgw.fra option rgw_keystone_implicit_tenants: config set
> failed: error parsing value: 'True' is not one of the permitted values:
> false, true, swift, s3, both, 0, 1, none retval: -22
>
> I might have made a typo but that has been corrected.
>
> # ceph config dump |grep rgw_keystone_implicit_tenants
> client.rgw.fra.controller1.lushdcadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller1.pznmufadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller1.tdrqotadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.ndxaetadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.rodqxhadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.wyhjukadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.boasgradvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.lkczbladvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.qxcteeadvanced
> rgw_keystone_implicit_tenants  true
>*
>
> # ceph --version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
>
> Any idea on how to fix this issue?
>
>
> Thanks,
> Arnoud.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPHADM_FAILED_SET_OPTION

2023-07-13 Thread Adam King
The `config` section tells cephadm to try to set the given config options.
So it will try something equivalent to "ceph config set rgw.fra
rgw_keystone_implicit_tenants true" and what it reported in the health
warning "'True' is not one of the permitted values: false, true, swift, s3,
both, 0, 1, none" is the error message it got back when it attempted that.
The error message makes me think the option probably isn't deprecated, but
just didn't like the value that was being passed.

On Thu, Jul 13, 2023 at 9:21 AM  wrote:

> Hi Adam,
>
> That section is indeed pretty short. I’m not sure what the difference is
> between the config and the spec section. Most of the settings I put under
> config give an error when I put then under spec.
>
> I have this spec now.
>
> service_type: rgw
> service_id: fra
> placement:
>   label: rgw
>   count_per_host: 3
> networks:
> - 10.103.4.0/24
> config:
>   rgw enable usage log: true
>   rgw keystone accepted roles: Member, _member_, admin
>   rgw keystone admin domain: default
>   rgw keystone admin password: secret
>   rgw keystone admin project: service
>   rgw keystone admin user: swift
>   rgw keystone api version: 3
>   rgw keystone implicit tenants: true
>   rgw keystone url: http://10.103.4.254:35357
>   rgw s3 auth use keystone: true
>   rgw swift account in url: true
>   rgw usage log flush threshold: 1024
>   rgw usage log tick interval: 30
>   rgw usage max shards: 32
>   rgw usage max user shards: 1
> spec:
>   rgw_frontend_port: 8100
>
> I deleted the 'rgw keystone implicit tenants’ settings now, and the
> warning disappeared. Seems like it has been deprecated? The warning message
> is very misleading.
>
>
> Thanks,
> Arnoud.
>
>
> On 13 Jul 2023, at 14:07, Adam King  wrote:
>
> Do you have a `config` section in your RGW spec? That health warning is
> from cephadm trying to set options from a spec section like that. There's a
> short bit about it at the top of
> https://docs.ceph.com/en/latest/cephadm/services/#service-specification.
>
> On Thu, Jul 13, 2023 at 3:39 AM  wrote:
>
>> Hi,
>>
>> After having set up RadosGW with keystone authentication the cluster
>> shows this warning:
>>
>> # ceph health detail
>> HEALTH_WARN Failed to set 1 option(s)
>> [WRN] CEPHADM_FAILED_SET_OPTION: Failed to set 1 option(s)
>>Failed to set rgw.fra option rgw_keystone_implicit_tenants: config set
>> failed: error parsing value: 'True' is not one of the permitted values:
>> false, true, swift, s3, both, 0, 1, none retval: -22
>>
>> I might have made a typo but that has been corrected.
>>
>> # ceph config dump |grep rgw_keystone_implicit_tenants
>> client.rgw.fra.controller1.lushdcadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller1.pznmufadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller1.tdrqotadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller2.ndxaetadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller2.rodqxhadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller2.wyhjukadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller3.boasgradvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller3.lkczbladvanced
>> rgw_keystone_implicit_tenants  true
>>*
>> client.rgw.fra.controller3.qxcteeadvanced
>> rgw_keystone_implicit_tenants  true
>>*
>>
>> # ceph --version
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> Any idea on how to fix this issue?
>>
>>
>> Thanks,
>> Arnoud.
>>
>>
>> --
>> Arnoud de Jonge
>> DevOps Engineer
>>
>> FUGA | CLOUD
>> https://fuga.cloud <https://fuga.cloud/>
>> arnoud@fuga.cloud <mailto:arnoud@fuga.cloud>
>>
>> Phone +31727513408 >
>> Registration ID 64988767
>> VAT ID NL855935984B01
>> Twitter @fugacloud
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPHADM_FAILED_SET_OPTION

2023-07-13 Thread Adam King
Do you have a `config` section in your RGW spec? That health warning is
from cephadm trying to set options from a spec section like that. There's a
short bit about it at the top of
https://docs.ceph.com/en/latest/cephadm/services/#service-specification.

On Thu, Jul 13, 2023 at 3:39 AM  wrote:

> Hi,
>
> After having set up RadosGW with keystone authentication the cluster shows
> this warning:
>
> # ceph health detail
> HEALTH_WARN Failed to set 1 option(s)
> [WRN] CEPHADM_FAILED_SET_OPTION: Failed to set 1 option(s)
>Failed to set rgw.fra option rgw_keystone_implicit_tenants: config set
> failed: error parsing value: 'True' is not one of the permitted values:
> false, true, swift, s3, both, 0, 1, none retval: -22
>
> I might have made a typo but that has been corrected.
>
> # ceph config dump |grep rgw_keystone_implicit_tenants
> client.rgw.fra.controller1.lushdcadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller1.pznmufadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller1.tdrqotadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.ndxaetadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.rodqxhadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller2.wyhjukadvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.boasgradvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.lkczbladvanced
> rgw_keystone_implicit_tenants  true
>*
> client.rgw.fra.controller3.qxcteeadvanced
> rgw_keystone_implicit_tenants  true
>*
>
> # ceph --version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
>
> Any idea on how to fix this issue?
>
>
> Thanks,
> Arnoud.
>
>
> --
> Arnoud de Jonge
> DevOps Engineer
>
> FUGA | CLOUD
> https://fuga.cloud 
> arnoud@fuga.cloud 
>
> Phone +31727513408 
> Registration ID 64988767
> VAT ID NL855935984B01
> Twitter @fugacloud
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT Meeting Notes June 28th, 2023

2023-06-28 Thread Adam King
Reef RC linking failure on Alpine Linux. Do we worry about that?

   1. https://tracker.ceph.com/issues/61718
   2. Nice to fix, but not a requirement
   3. If there are patches available, we should accept them, but probably
   don't put too much work into it currently

debian bullseye build failure on reef rc:

   1. https://tracker.ceph.com/issues/61845
   2. want to fix before final release

clean-up in AuthMonitor – CephFS and core are fine. Any other component
interested in?

   1. https://github.com/ceph/ceph/pull/52008#issuecomment-1606581139
   2. Already tested for rados and cephfs
   3. No other components requesting testing

Reef rc v18.1.2 in progress

   1. Next rc build in progress
   2. build issue to be looked at
   1. Failure on a jammy arm build, not a platform we test on meaningfully
  2. Think this is an infrastructure issue
  3. Generally, this shouldn't be a release blocker
  4. Priority of arm builds might rise in the future though
  5. investigate today, if we can't figure it out quickly, publish rc
  with a known issue
  6. NOTE: Rook expects arm builds to be present
  3. Would like to release next rc later this week if things work out
   4. Would also like to upgrade lrc

CDS agenda https://pad.ceph.com/p/cds-squid

   1. leads should add topics
   2. plan is for this to happen week of July 17th

mempool monitoring in teuthology tests
https://github.com/ceph/ceph/pull/51853

   1. Just an FYI
   2. ceph task will now have ability to dump memtools
   3. might add a bit of delay to how long tests take
   4. expected to be merged soon
   5. will follow up in performance meeting

iSCSI packages old/not signed -- want to fix before final release

   1. https://tracker.ceph.com/issues/57388
   2. tcmu-runner, since containerization of ceph, is being pulled from our
   build system
   3. This tcmu-runner package is not signed
   4. ceph-iscsi package is signed, but outdated (seems to be because this
   is the newest one that is signed and pushed to download.ceph.com)
   5. someone with access to tools to sign the packages would have to help
   fix this
   6. been like this for a long time and nobody noticed
   7. only ceph-iscsi package, not tcmu-runner, is distributed through
   download.ceph.com
   8. getting updated ceph-iscsi package on download.ceph.com should be
   done before reef release
   9. tcmu-runner inside the container being unsigned is not as big of a
   deal (was this way in quincy/pacific as well)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-24 Thread Adam King
Reminds me of https://tracker.ceph.com/issues/57007 which wasn't fixed in
pacific until 16.2.11, so this is probably just the result of a cephadm bug
unfortunately.

On Fri, Jun 23, 2023 at 5:16 PM Malte Stroem  wrote:

> Hello Eugen,
>
> thanks.
>
> We found the cause.
>
> Somehow all
>
> /var/lib/ceph/fsid/osd.XX/config
>
> files on every host were still filled with expired information about the
> mons.
>
> So refreshing the files helped to bring the osds up again. Damn.
>
> All other configs for the mons, mds', rgws and so on were up to date.
>
> I do not know why the osd config files did not get refreshed however I
> guess something went wrong draining the nodes we removed from the cluster.
>
> Best regards,
> Malte
>
> Am 21.06.23 um 22:11 schrieb Eugen Block:
> > I still can’t really grasp what might have happened here. But could you
> > please clarify which of the down OSDs (or Hosts) are supposed to be down
> > and which you’re trying to bring back online? Obviously osd.40 is one of
> > your attempts. But what about the hosts cephx01 and cephx08? Are those
> > the ones refusing to start their OSDs? And the remaining up OSDs you
> > haven’t touched yet, correct?
> > And regarding debug logs, you should set it with ceph config set because
> > the local ceph.conf won’t have an effect. It could help to have the
> > startup debug logs from one of the OSDs.
> >
> > Zitat von Malte Stroem :
> >
> >> Hello Eugen,
> >>
> >> recovery and rebalancing was finished however now all PGs show missing
> >> OSDs.
> >>
> >> Everything looks like the PGs are missing OSDs although it finished
> >> correctly.
> >>
> >> As if we shut down the servers immediately.
> >>
> >> But we removed the nodes the way it is described in the documentation.
> >>
> >> We just added new disks and they join the cluster immediately.
> >>
> >> So the old OSDs removed from the cluster are available, I restored
> >> OSD.40 but it does not want to join the cluster.
> >>
> >> Following are the outputs of the mentioned commands:
> >>
> >> ceph -s
> >>
> >>   cluster:
> >> id: X
> >> health: HEALTH_WARN
> >> 1 failed cephadm daemon(s)
> >> 1 filesystem is degraded
> >> 1 MDSs report slow metadata IOs
> >> 19 osds down
> >> 4 hosts (50 osds) down
> >> Reduced data availability: 1220 pgs inactive
> >> Degraded data redundancy: 132 pgs undersized
> >>
> >>   services:
> >> mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
> >> mgr: cephx02.xx(active, since 92s), standbys: cephx04.yy,
> >> cephx06.zz mds: 2/2 daemons up, 2 standby
> >> osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped
> pgs
> >> rgw: 1 daemon active (1 hosts, 1 zones)
> >>
> >>   data:
> >> volumes: 1/2 healthy, 1 recovering
> >> pools:   12 pools, 1345 pgs
> >> objects: 11.02k objects, 1.9 GiB
> >> usage:   145 TiB used, 669 TiB / 814 TiB avail
> >> pgs: 86.617% pgs unknown
> >>  4.089% pgs not active
> >>  39053/33069 objects misplaced (118.095%)
> >>  1165 unknown
> >>  77   active+undersized+remapped
> >>  55   undersized+remapped+peered
> >>  38   active+clean+remapped
> >>  10   active+clean
> >>
> >> ceph osd tree
> >>
> >> ID   CLASS  WEIGHT  TYPE NAMESTATUS  REWEIGHT
> >> PRI-AFF
> >> -214.36646  root ssds
> >> -610.87329  host cephx01-ssd
> >> 186ssd 0.87329  osd.186down   1.0
> >> 1.0
> >> -760.87329  host cephx02-ssd
> >> 263ssd 0.87329  osd.263  up   1.0
> >> 1.0
> >> -850.87329  host cephx04-ssd
> >> 237ssd 0.87329  osd.237  up   1.0
> >> 1.0
> >> -880.87329  host cephx06-ssd
> >> 236ssd 0.87329  osd.236  up   1.0
> >> 1.0
> >> -940.87329  host cephx08-ssd
> >> 262ssd 0.87329  osd.262down   1.0
> >> 1.0
> >>  -1 1347.07397  root default
> >> -62  261.93823  host cephx01
> >> 139hdd10.91409  osd.139down 0
> >> 1.0
> >> 140hdd10.91409  osd.140down 0
> >> 1.0
> >> 142hdd10.91409  osd.142down 0
> >> 1.0
> >> 144hdd10.91409  osd.144down 0
> >> 1.0
> >> 146hdd10.91409  osd.146down 0
> >> 1.0
> >> 148hdd10.91409  osd.148down 0
> >> 1.0
> >> 150hdd10.91409  osd.150down 0
> >> 1.0
> >> 152hdd10.91409  osd.152down 0
> >> 1.0
> >> 154hdd10.91409  osd.154down   1.0
> >> 

[ceph-users] Re: Error while adding host : Error EINVAL: Traceback (most recent call last): File /usr/share/ceph/mgr/mgr_module.py, line 1756, in _handle_command

2023-06-20 Thread Adam King
There was a cephadm bug that wasn't fixed by the time 17.2.6 came out (I'm
assuming that's the version being used here, although it may have been
present in some slightly earlier quincy versions) that caused this
misleading error to be printed out when adding a host failed. There's a
tracker for it here https://tracker.ceph.com/issues/59081 that has roughly
the same traceback. The real issue is likely a connectivity or permission
issue from the active mgr trying to ssh to the host. In the case I saw from
the tracker, it was caused by the ssh pub key not being set up on the host.
If you check the cephadm cluster logs ("ceph log last 50 debug cephadm")
after trying to add the host I'm guessing you'll see some error like the
second set of output in the tracker that will hopefully give some more info
on why adding the host failed.

On Tue, Jun 20, 2023 at 6:38 PM Adiga, Anantha 
wrote:

> Hi,
>
> I am seeing this error  after an offline  was deleted and while adding the
> host again. Thereafter, I have removed the /var/lib/cep  folder and removed
> the ceph quincy image in the offline host. What is the cause of this issue
> and the solution.
>
> root@fl31ca104ja0201:/home/general# cephadm shell
> Inferring fsid d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
> Using recent ceph image
> quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e
>  :af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e>
> root@fl31ca104ja0201:/#
>
> root@fl31ca104ja0201:/# ceph orch host rm fl31ca104ja0302 --offline
> --force
>
> Removed offline host 'fl31ca104ja0302'
>
> root@fl31ca104ja0201:/# ceph -s
>
>   cluster:
>
> id: d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
>
> health: HEALTH_OK
>
>
>
>   services:
>
> mon: 3 daemons, quorum fl31ca104ja0201,fl31ca104ja0202,fl31ca104ja0203
> (age 28h)
>
> mgr: fl31ca104ja0203(active, since 6d), standbys: fl31ca104ja0202,
> fl31ca104ja0201
>
> mds: 1/1 daemons up, 2 standby
>
> osd: 33 osds: 33 up (since 28h), 33 in (since 28h)
>
> rgw: 3 daemons active (3 hosts, 1 zones)
>
>
>
>   data:
>
> volumes: 1/1 healthy
>
> pools:   24 pools, 737 pgs
>
> objects: 613.56k objects, 1.9 TiB
>
> usage:   2.9 TiB used, 228 TiB / 231 TiB avail
>
> pgs: 737 active+clean
>
>
>
>   io:
>
> client:   161 MiB/s rd, 75 op/s rd, 0 op/s wr
>
>
> root@fl31ca104ja0201:/# ceph orch host add fl31ca104ja0302 10.45.219.5
> Error EINVAL: Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/mgr_module.py", line 1756, in _handle_command
> return self.handle_command(inbuf, cmd)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in
> handle_command
> return dispatch[cmd['prefix']].call(self, cmd, inbuf)
>   File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call
> return self.func(mgr, **kwargs)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in
> 
> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args,
> **l_kwargs)  # noqa: E731
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in
> wrapper
> return func(*args, **kwargs)
>   File "/usr/share/ceph/mgr/orchestrator/module.py", line 356, in _add_host
> return self._apply_misc([s], False, Format.plain)
>  File "/usr/share/ceph/mgr/orchestrator/module.py", line 1092, in
> _apply_misc
> raise_if_exception(completion)
>   File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 225, in
> raise_if_exception
> e = pickle.loads(c.serialized_exception)
> TypeError: __init__() missing 2 required positional arguments: 'hostname'
> and 'addr'
>
> Thank you,
> Anantha
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: stray daemons not managed by cephadm

2023-06-12 Thread Adam King
if you do a mgr failover ("ceph mgr fail") and wait a few minutes do the
issues clear out? I know there's a bug where removed mons get marked as
stray daemons while downsizing by multiple mons at once (cephadm might be
removing them too quickly, not totally sure of the cause) but doing a mgr
failover has always cleared the stray daemon notifications for me. For some
context, what it's listing as stray daemons are roughly what is being
reported in "ceph node ls" that doesn't show up in "ceph orch ps". The idea
being the orch ps output shows all the daemons cephadm is aware of and
managing while "ceph node ls" are ceph daemons the cluster, but not
necessarily cephadm itself, is aware of. For me, the mon daemons marked
stray were still showing up in that "ceph node ls" output, but doing a mgr
failover would clean that up and then the stray daemon warnings would also
disappear.

On Mon, Jun 12, 2023 at 8:54 AM farhad kh 
wrote:

>  i deployed the ceph cluster with 8 node (v17.2.6) and  after add all of
> hosts, ceph create 5 mon daemon instances
> i try decrease that to 3 instance with ` ceph orch apply mon
> --placement=label:mon,count:3 it worked, but after that i get error "2
> stray daemons not managed by cephadm" .
> But every time I tried to deploy and delete other instances, this number
> increased Now I have 7 daemon that are not managed by cephadm
> How to deal with this issue?
>
> 
> [root@opcsdfpsbpp0201 ~]# ceph -s
>   cluster:
> id: 79a2627c-0821-11ee-a494-00505695c58c
> health: HEALTH_WARN
> 16 stray daemon(s) not managed by cephadm
>
>   services:
> mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203
> (age 2m)
> mgr: opcsdfpsbpp0201.vttwxa(active, since 27h), standbys:
> opcsdfpsbpp0207.kzxepm
> mds: 1/1 daemons up, 2 standby
> osd: 74 osds: 74 up (since 26h), 74 in (since 26h)
>
>   data:
> volumes: 1/1 healthy
> pools:   6 pools, 6 pgs
> objects: 2.10k objects, 8.1 GiB
> usage:   28 GiB used, 148 TiB / 148 TiB avail
> pgs: 6 active+clean
>
>   io:
> client:   426 B/s rd, 0 op/s rd, 0 op/s wr
>
> [root@opcsdfpsbpp0201 ~]# ceph health detail
> HEALTH_WARN 16 stray daemon(s) not managed by cephadm
> [WRN] CEPHADM_STRAY_DAEMON: 16 stray daemon(s) not managed by cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0203 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0205 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0207 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0207 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0209 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0215 on host opcsdfpsbpp0211 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0211 on host opcsdfpsbpp0213 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0209 on host opcsdfpsbpp0215 not managed by
> cephadm
> stray daemon mon.opcsdfpsbpp0213 on host opcsdfpsbpp0215 not managed by
> cephadm
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: change user root to non-root after deploy cluster by cephadm

2023-06-07 Thread Adam King
When you try to change the user using "ceph cephadm set-user" (or any of
the other commands that change ssh settings) it will attempt a connection
to a random host with the new settings, and run the "cephadm check-host"
command on that host. If that fails, it will change the setting back and
report an error message. It does seem like there's a bit of a bug here,
where it's changing the user back before reporting the error, so it's
reporting it attempted a connection with root@hostname instead of
@hostname, but either way it seems to think it can't connect and run
that command on whatever host it's attempting to connect to. You could
follow https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors and
see if you catch anything there (substitute your user for "root" there).
You could also force it to update the user by directly running "ceph
config-key set mgr/cephadm/ssh_user" and then restarting the cephadm module
or doing a mgr failover for it to pick up the change. Then if there are
connection issues using that ssh user it should end up popping up as health
warnings when it tries to refresh metadata on each host.

On Wed, Jun 7, 2023 at 6:24 AM farhad kh  wrote:

>  Hi guys
> I deployed the ceph cluster with cephadm and root user, but I need to
> change the user to a non-root user
> And I did these steps:
> 1- Created a non-root user on all hosts with access without password and
> sudo
> `$USER_NAME ALL = (root) NOPASSWD:ALL`
> 2- Generated a SSH key pair and use ssh-copy-it to add all hosts
> `
> ssh-keygen (accept the default file name and leave the passphrase empty)
> ssh-copy-id USER_NAME@HOST_NAME
> `
> 3 - ceph cephadm set-user But I get "Error EINVAL: ssh connection to
> root@hostname failed" error
> How to deal with this issue?
> What should be done to change the user to non-root?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reef v18.1.0 QE Validation status

2023-05-31 Thread Adam King
Orch approved. The orch/cephadm tests looked good and the orch/rook tests
are known to not work currently.

On Tue, May 30, 2023 at 12:54 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/61515#note-1
> Release Notes - TBD
>
> Seeking approvals/reviews for:
>
> rados - Neha, Radek, Travis, Ernesto, Adam King (we still have to
> merge https://github.com/ceph/ceph/pull/51788 for
> the core)
> rgw - Casey
> fs - Venky
> orch - Adam King
> rbd - Ilya
> krbd - Ilya
> upgrade/octopus-x - deprecated
> upgrade/pacific-x - known issues, Ilya, Laura?
> upgrade/reef-p2p - N/A
> clients upgrades - not run yet
> powercycle - Brad
> ceph-volume - in progress
>
> Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
>
> gibba upgrade was done and will need to be done again this week.
> LRC upgrade TBD
>
> TIA
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
13] Connect call failed ('192.168.122.201', 22)


By trying this for connecting to each host in the cluster, given how close
it is to how cephadm is operating, it should help verify with relative
certainty if this is connection related or not. Will add the important bit
that I'm using the user "root". If you're using a non-root user, the script
takes a "--user" option.


On Mon, May 15, 2023 at 3:36 PM Thomas Widhalm 
wrote:

> I just checked every single host. The only processes of cephadm running
> where "cephadm shell" from debugging. I closed all of them, so now I can
> verify, there's not a single cephadm process running on any of my ceph
> hosts. (and since I found the shell processes, I can verify I didn't
> have a typo ;-) )
>
> Regarding broken record: I'm extremly thankful for your support. And I
> should have checked that earlier. We all know that sometimes it's the
> least probable things that go sideways. So checking the things you're
> sure to be ok is always a good idea. Thanks for being adamant about
> that. But now we can be sure, at least.
>
> On 15.05.23 21:27, Adam King wrote:
> > If it persisted through a full restart, it's possible the conditions
> > that caused the hang are still present after the fact. The two known
> > causes I'm aware of are lack of space in the root partition and hanging
> > mount points. Both would show up as processes in "ps aux | grep cephadm"
> > though. The latter could possibly be related to cephfs pool issues if
> > you have something mounted on one of the host hosts. Still hard to say
> > without knowing what exactly got stuck. For clarity, without restarting
> > or changing anything else, can you verify  if "ps aux | grep cephadm"
> > shows anything on the nodes. I know I'm a bit of a broken record on
> > mentioning the hanging processes stuff, but outside of module crashes
> > which don't appear to be present here, 100% of other cases of this type
> > of thing happening I've looked at before have had those processes
> > sitting around.
> >
> > On Mon, May 15, 2023 at 3:10 PM Thomas Widhalm  > <mailto:widha...@widhalm.or.at>> wrote:
> >
> > This is why I even tried a full cluster shutdown. All Hosts were
> > out, so
> > there's not a possibility that there's any process hanging. After I
> > started the nodes, it's just the same as before. All refresh times
> show
> > "4 weeks". Like it stopped simoultanously on all nodes.
> >
> > Some time ago we had a small change in name resolution so I thought,
> > maybe the orchestrator can't connect via ssh anymore. But I tried all
> > the steps in
> >
> https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors <
> https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors> .
> > The only thing that's slightly suspicous is that, it said, it added
> the
> > host key to known hosts. But since I tried via "cephadm shell" I
> guess,
> > the known hosts are just not replicated to these containers. ssh
> works,
> > too. (And I would have suspected that I get a warning if that failed)
> >
> > I don't see any information about the orchestrator module having
> > crashed. It's running as always.
> >
> >   From the the prior problem I had some issues in my cephfs pools.
> So,
> > maybe there's something broken in the .mgr pool? Could that be a
> reason
> > for this behaviour? I googled a while but didn't find any way how to
> > check that explicitly.
> >
> > On 15.05.23 19:15, Adam King wrote:
> >  > This is sort of similar to what I said in a previous email, but
> > the only
> >  > way I've seen this happen in other setups is through hanging
> cephadm
> >  > commands. The debug process has been, do a mgr failover, wait a
> few
> >  > minutes, see in "ceph orch ps" and "ceph orch device ls" which
> > hosts have
> >  > and have not been refreshed (the REFRESHED column should be some
> > lower
> >  > value on the hosts where it refreshed), go to the hosts where it
> > did not
> >  > refresh and check "ps aux | grep cephadm" looking for long
> > running (and
> >  > therefore most likely hung) processes. I would still expect
> > that's the most
> >  > likely thing you're experiencing here. I haven't seen any other
> > causes for
> >  > cephadm to not refresh unless the module crashed, but that would
> be
> > 

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
If it persisted through a full restart, it's possible the conditions that
caused the hang are still present after the fact. The two known causes I'm
aware of are lack of space in the root partition and hanging mount points.
Both would show up as processes in "ps aux | grep cephadm" though. The
latter could possibly be related to cephfs pool issues if you have
something mounted on one of the host hosts. Still hard to say without
knowing what exactly got stuck. For clarity, without restarting or changing
anything else, can you verify  if "ps aux | grep cephadm" shows anything on
the nodes. I know I'm a bit of a broken record on mentioning the hanging
processes stuff, but outside of module crashes which don't appear to be
present here, 100% of other cases of this type of thing happening I've
looked at before have had those processes sitting around.

On Mon, May 15, 2023 at 3:10 PM Thomas Widhalm 
wrote:

> This is why I even tried a full cluster shutdown. All Hosts were out, so
> there's not a possibility that there's any process hanging. After I
> started the nodes, it's just the same as before. All refresh times show
> "4 weeks". Like it stopped simoultanously on all nodes.
>
> Some time ago we had a small change in name resolution so I thought,
> maybe the orchestrator can't connect via ssh anymore. But I tried all
> the steps in
> https://docs.ceph.com/docs/master/cephadm/troubleshooting/#ssh-errors .
> The only thing that's slightly suspicous is that, it said, it added the
> host key to known hosts. But since I tried via "cephadm shell" I guess,
> the known hosts are just not replicated to these containers. ssh works,
> too. (And I would have suspected that I get a warning if that failed)
>
> I don't see any information about the orchestrator module having
> crashed. It's running as always.
>
>  From the the prior problem I had some issues in my cephfs pools. So,
> maybe there's something broken in the .mgr pool? Could that be a reason
> for this behaviour? I googled a while but didn't find any way how to
> check that explicitly.
>
> On 15.05.23 19:15, Adam King wrote:
> > This is sort of similar to what I said in a previous email, but the only
> > way I've seen this happen in other setups is through hanging cephadm
> > commands. The debug process has been, do a mgr failover, wait a few
> > minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have
> > and have not been refreshed (the REFRESHED column should be some lower
> > value on the hosts where it refreshed), go to the hosts where it did not
> > refresh and check "ps aux | grep cephadm" looking for long running (and
> > therefore most likely hung) processes. I would still expect that's the
> most
> > likely thing you're experiencing here. I haven't seen any other causes
> for
> > cephadm to not refresh unless the module crashed, but that would be
> > explicitly stated in the cluster health.
> >
> > On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm 
> > wrote:
> >
> >> Hi,
> >>
> >> I tried a lot of different approaches but I didn't have any success so
> far.
> >>
> >> "ceph orch ps" still doesn't get refreshed.
> >>
> >> Some examples:
> >>
> >> mds.mds01.ceph06.huavsw  ceph06   starting  -
> >> --- 
> >> mds.mds01.ceph06.rrxmks  ceph06   error4w ago
> >> 3M-- 
> >> mds.mds01.ceph07.omdisd  ceph07   error4w ago
> >> 4M-- 
> >> mds.mds01.ceph07.vvqyma  ceph07   starting  -
> >> --- 
> >> mgr.ceph04.qaexpvceph04  *:8443,9283  running (4w) 4w ago
> >> 10M 551M-  17.2.6 9cea3956c04b  33df84e346a0
> >> mgr.ceph05.jcmkbbceph05  *:8443,9283  running (4w) 4w ago
> >> 4M 441M-  17.2.6 9cea3956c04b  1ad485df4399
> >> mgr.ceph06.xbduufceph06  *:8443,9283  running (4w) 4w ago
> >> 4M 432M-  17.2.6 9cea3956c04b  5ba5fd95dc48
> >> mon.ceph04   ceph04   running (4w) 4w ago
> >> 4M 223M2048M  17.2.6 9cea3956c04b  8b6116dd216f
> >> mon.ceph05   ceph05   running (4w) 4w ago
> >> 4M 326M2048M  17.2.6 9cea3956c04b  70520d737f29
> >>
> >> Debug Log doesn't show anything that could help me, either.
> >>
> >> 2023-05-15T14:48:40.852088+ mgr.ceph05.jcmkbb (mgr.83897390) 1376 :
> >> c

[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD

2023-05-15 Thread Adam King
As you've already seem to have figured out, "ceph orch device ls" is
populated with the results from "ceph-volume inventory". My best guess to
try and debug this would be to manually run "cephadm ceph-volume --
inventory" (the same as "cephadm ceph-volume inventory", I just like to
separate the ceph-volume command from cephadm itself with the " -- ") and
then check /var/log/ceph//ceph-volume.log from when you ran the
command onward to try and see why it isn't seeing your devices. For example
I can see a line  like

[2023-05-15 19:11:58,048][ceph_volume.main][INFO  ] Running command:
ceph-volume  inventory

in there. Then if I look onward from there I can see it ran things like

lsblk -P -o
NAME,KNAME,PKNAME,MAJ:MIN,FSTYPE,MOUNTPOINT,LABEL,UUID,RO,RM,MODEL,SIZE,STATE,OWNER,GROUP,MODE,ALIGNMENT,PHY-SEC,LOG-SEC,ROTA,SCHED,TYPE,DISC-ALN,DISC-GRAN,DISC-MAX,DISC-ZERO,PKNAME,PARTLABEL

as part of getting my device list. So if I was having issues I would try
running that directly and see what I got. Will note that ceph-volume on
certain more recent versions (not sure about octopus) runs commands through
nsenter, so you'd have to look past that part in the log lines to the
underlying command being used, typically something with lsblk, blkid,
udevadm, lvs, or pvs.

Also, if you want to see if it's an issue with a certain version of
ceph-volume, you can use different versions by passing the image flag to
cephadm. E.g.

cephadm --image quay.io/ceph/ceph:v17.2.6 ceph-volume -- inventory

would use the 17.2.6 version of ceph-volume for the inventory. It works by
running ceph-volume through the container, so you don't have to have to
worry about installing different packages to try them and it should pull
the container image on its own if it isn't on the machine already (but note
that means the command will take longer as it pulls the image the first
time).



On Sat, May 13, 2023 at 4:34 AM Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> wrote:

> Hi Joshua,
>
> I've tried these commands but it looks like CEPH is unable to see and
> configure these HDDs.
> [root@mostha1 ~]# cephadm ceph-volume inventory
>
> Inferring fsid 4b7a6504-f0be-11ed-be1a-00266cf8869c
> Using recent ceph image
>
> quay.io/ceph/ceph@sha256:e6919776f0ff8331a8e9c4b18d36c5e9eed31e1a80da62ae8454e42d10e95544
>
> Device Path   Size Device nodesrotates
> available Model name
>
> [root@mostha1 ~]# cephadm shell
>
> [ceph: root@mostha1 /]# ceph orch apply osd --all-available-devices
>
> Scheduled osd.all-available-devices update...
>
> [ceph: root@mostha1 /]# ceph orch device ls[ceph: root@mostha1 /]#
> ceph-volume lvm zap /dev/sdb
>
> --> Zapping: /dev/sdb
> --> --destroy was not specified, but zapping a whole device will
> remove the partition table
> Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10
> conv=fsync
>   stderr: 10+0 records in
> 10+0 records out
> 10485760 bytes (10 MB, 10 MiB) copied, 0.10039 s, 104 MB/s
> --> Zapping successful for: 
>
> I can check that /dev/sdb1 has been erased, so previous command is
> successful
> [ceph: root@mostha1 ceph]# lsblk
> NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda8:01 232.9G  0 disk
> |-sda1 8:11   3.9G  0 part /rootfs/boot
> |-sda2 8:21  78.1G  0 part
> | `-osvg-rootvol 253:00  48.8G  0 lvm  /rootfs
> |-sda3 8:31   3.9G  0 part [SWAP]
> `-sda4 8:41 146.9G  0 part
>|-secretvg-homevol 253:10   9.8G  0 lvm  /rootfs/home
>|-secretvg-tmpvol  253:20   9.8G  0 lvm  /rootfs/tmp
>`-secretvg-varvol  253:30   9.8G  0 lvm  /rootfs/var
> sdb8:16   1 465.8G  0 disk
> sdc8:32   1 232.9G  0 disk
>
> But still no visible HDD:
>
> [ceph: root@mostha1 ceph]# ceph orch apply osd --all-available-devices
>
> Scheduled osd.all-available-devices update...
>
> [ceph: root@mostha1 ceph]# ceph orch device ls
> [ceph: root@mostha1 ceph]#
>
> May be I have done something bad at install time as in the container
> I've unintentionally run:
>
> dnf -y install
>
> https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm
>
> (an awful copy/paste launching the command). Can this break The
> container ? I do not know what should be available as ceph packages in
> the container to remove properly this install (no dnf.log file in the
> container)
>
> Patrick
>
>
> Le 12/05/2023 à 21:38, Beaman, Joshua a écrit :
> > The most significant point I see there, is you have no OSD service
> > spec to tell orchestrator how to deploy OSDs.  The easiest fix for
> > that would be “cephorchapplyosd--all-available-devices”
> >
> > This will create a simple spec that should work for a test
> > environment.  Most likely it will collocate the block, block.db, and
> > WAL all on the same device.  Not ideal for prod environments, but 

[ceph-users] Re: Orchestration seems not to work

2023-05-15 Thread Adam King
This is sort of similar to what I said in a previous email, but the only
way I've seen this happen in other setups is through hanging cephadm
commands. The debug process has been, do a mgr failover, wait a few
minutes, see in "ceph orch ps" and "ceph orch device ls" which hosts have
and have not been refreshed (the REFRESHED column should be some lower
value on the hosts where it refreshed), go to the hosts where it did not
refresh and check "ps aux | grep cephadm" looking for long running (and
therefore most likely hung) processes. I would still expect that's the most
likely thing you're experiencing here. I haven't seen any other causes for
cephadm to not refresh unless the module crashed, but that would be
explicitly stated in the cluster health.

On Mon, May 15, 2023 at 11:44 AM Thomas Widhalm 
wrote:

> Hi,
>
> I tried a lot of different approaches but I didn't have any success so far.
>
> "ceph orch ps" still doesn't get refreshed.
>
> Some examples:
>
> mds.mds01.ceph06.huavsw  ceph06   starting  -
> --- 
> mds.mds01.ceph06.rrxmks  ceph06   error4w ago
> 3M-- 
> mds.mds01.ceph07.omdisd  ceph07   error4w ago
> 4M-- 
> mds.mds01.ceph07.vvqyma  ceph07   starting  -
> --- 
> mgr.ceph04.qaexpvceph04  *:8443,9283  running (4w) 4w ago
> 10M 551M-  17.2.6 9cea3956c04b  33df84e346a0
> mgr.ceph05.jcmkbbceph05  *:8443,9283  running (4w) 4w ago
> 4M 441M-  17.2.6 9cea3956c04b  1ad485df4399
> mgr.ceph06.xbduufceph06  *:8443,9283  running (4w) 4w ago
> 4M 432M-  17.2.6 9cea3956c04b  5ba5fd95dc48
> mon.ceph04   ceph04   running (4w) 4w ago
> 4M 223M2048M  17.2.6 9cea3956c04b  8b6116dd216f
> mon.ceph05   ceph05   running (4w) 4w ago
> 4M 326M2048M  17.2.6 9cea3956c04b  70520d737f29
>
> Debug Log doesn't show anything that could help me, either.
>
> 2023-05-15T14:48:40.852088+ mgr.ceph05.jcmkbb (mgr.83897390) 1376 :
> cephadm [INF] Schedule start daemon mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:43.620700+ mgr.ceph05.jcmkbb (mgr.83897390) 1380 :
> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.hcmvae
> 2023-05-15T14:48:45.124822+ mgr.ceph05.jcmkbb (mgr.83897390) 1392 :
> cephadm [INF] Schedule start daemon mds.mds01.ceph04.krxszj
> 2023-05-15T14:48:46.493902+ mgr.ceph05.jcmkbb (mgr.83897390) 1394 :
> cephadm [INF] Schedule redeploy daemon mds.mds01.ceph04.krxszj
> 2023-05-15T15:05:25.637079+ mgr.ceph05.jcmkbb (mgr.83897390) 2629 :
> cephadm [INF] Saving service mds.mds01 spec with placement count:2
> 2023-05-15T15:07:27.625773+ mgr.ceph05.jcmkbb (mgr.83897390) 2780 :
> cephadm [INF] Saving service mds.fs_name spec with placement count:3
> 2023-05-15T15:07:42.120912+ mgr.ceph05.jcmkbb (mgr.83897390) 2795 :
> cephadm [INF] Saving service mds.mds01 spec with placement count:3
>
> I'm seeing all the commands I give but I don't get any more information
> on why it's not actually happening.
>
> I tried to change different scheduling mechanisms. Host, Tag, unmanaged
> and back again. I turned off orchestration and resumed. I failed mgr. I
> even had full cluster stops (in the past). I made sure all daemons run
> the same version. (If you remember, upgrade failed underway).
>
> So my only way of getting daemons only is manually. I added two more
> hosts, tagged them. But there isn't a single daemon started there.
>
> Could you help me again with how to debug orchestration not working?
>
>
> On 04.05.23 15:12, Thomas Widhalm wrote:
> > Thanks.
> >
> > I set the log level to debug, try a few steps and then come back.
> >
> > On 04.05.23 14:48, Eugen Block wrote:
> >> Hi,
> >>
> >> try setting debug logs for the mgr:
> >>
> >> ceph config set mgr mgr/cephadm/log_level debug
> >>
> >> This should provide more details what the mgr is trying and where it's
> >> failing, hopefully. Last week this helped to identify an issue between
> >> a lower pacific issue for me.
> >> Do you see anything in the cephadm.log pointing to the mgr actually
> >> trying something?
> >>
> >>
> >> Zitat von Thomas Widhalm :
> >>
> >>> Hi,
> >>>
> >>> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but
> >>> the following problem existed when I was still everywhere on 17.2.5 .
> >>>
> >>> I had a major issue in my cluster which could be solved with a lot of
> >>> your help and even more trial and error. Right now it seems that most
> >>> is already fixed but I can't rule out that there's still some problem
> >>> hidden. The very issue I'm asking about started during the repair.
> >>>
> >>> When I want to orchestrate the cluster, it logs the command but it
> >>> doesn't do anything. No matter if I use ceph dashboard or "ceph orch"
> >>> in "cephadm 

[ceph-users] Re: cephadm does not honor container_image default value

2023-05-15 Thread Adam King
I think with the `config set` commands there is logic to notify the
relevant mgr modules and update their values. That might not exist with
`config rm`, so it's still using the last set value. Looks like a real bug.
Curious what happens if the mgr restarts after the `config rm`. Whether it
goes back to the default image in that case or not. Might take a look later.

On Mon, May 15, 2023 at 7:37 AM Daniel Krambrock <
krambr...@hrz.uni-marburg.de> wrote:

> Hello.
>
> I think i found a bug in cephadm/ceph orch:
> Redeploying a container image (tested with alertmanager) after removing
> a custom `mgr/cephadm/container_image_alertmanager` value, deploys the
> previous container image and not the default container image.
>
> I'm running `cephadm` from ubuntu 22.04 pkg 17.2.5-0ubuntu0.22.04.3 and
> `ceph` version 17.2.6.
>
> Here is an example. Node clrz20-08 is the node altermanager is running
> on, clrz20-01 the node I'm controlling ceph from:
>
> * Get alertmanager version
> ```
> root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
> "alertmanager")| .container_image_name'
> "quay.io/prometheus/alertmanager:v0.23.0"
> ```
>
> * Set alertmanager image
> ```
> root@clrz20-01:~# ceph config set mgr
> mgr/cephadm/container_image_alertmanager quay.io/prometheus/alertmanager
> root@clrz20-01:~# ceph config get mgr
> mgr/cephadm/container_image_alertmanager
> quay.io/prometheus/alertmanager
> ```
>
> * redeploy altermanager
> ```
> root@clrz20-01:~# ceph orch redeploy alertmanager
> Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
> ```
>
> * Get alertmanager version
> ```
> root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
> "alertmanager")| .container_image_name'
> "quay.io/prometheus/alertmanager:latest"
> ```
>
> * Remove alertmanager image setting, revert to default:
> ```
> root@clrz20-01:~# ceph config rm mgr
> mgr/cephadm/container_image_alertmanager
> root@clrz20-01:~# ceph config get mgr
> mgr/cephadm/container_image_alertmanager
> quay.io/prometheus/alertmanager:v0.23.0
> ```
>
> * redeploy altermanager
> ```
> root@clrz20-01:~# ceph orch redeploy alertmanager
> Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
> ```
>
> * Get alertmanager version
> ```
> root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
> "alertmanager")| .container_image_name'
> "quay.io/prometheus/alertmanager:latest"
> ```
> -> `mgr/cephadm/container_image_alertmanager` is set to
> `quay.io/prometheus/alertmanager:v0.23.0`
> , but redeploy uses
> `quay.io/prometheus/alertmanager:latest`
> . This looks like a bug.
>
> * Set alertmanager image explicitly to the default value
> ```
> root@clrz20-01:~# ceph config set mgr
> mgr/cephadm/container_image_alertmanager
> quay.io/prometheus/alertmanager:v0.23.0
> root@clrz20-01:~# ceph config get mgr
> mgr/cephadm/container_image_alertmanager
> quay.io/prometheus/alertmanager:v0.23.0
> ```
>
> * redeploy altermanager
> ```
> root@clrz20-01:~# ceph orch redeploy alertmanager
> Scheduled to redeploy alertmanager.clrz20-08 on host 'clrz20-08'
> ```
>
> * Get alertmanager version
> ```
> root@clrz20-08:~# cephadm ls | jq '.[] | select(.service_name ==
> "alertmanager")| .container_image_name'
> "quay.io/prometheus/alertmanager:v0.23.0"
> ```
> -> Setting `mgr/cephadm/container_image_alertmanager` to the default
> setting fixes the issue.
>
>
>
> Bests,
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: docker restarting lost all managers accidentally

2023-05-10 Thread Adam King
in /var/lib/ceph// on the host with that mgr
reporting the error, there should be a unit.run file that shows what is
being done to start the mgr as well as a few files that get mounted into
the mgr on startup, notably the "config" and "keyring" files. That config
file should include the mon host addresses. E.g.

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/config
# minimal ceph.conf for 5a72983c-ef57-11ed-a389-525400e42d74
[global]
fsid = 5a72983c-ef57-11ed-a389-525400e42d74
mon_host = [v2:192.168.122.75:3300/0,v1:192.168.122.75:6789/0] [v2:
192.168.122.246:3300/0,v1:192.168.122.246:6789/0] [v2:
192.168.122.97:3300/0,v1:192.168.122.97:6789/0]

The first thing I'd do is probably make sure that array of addresses is
correct.

Then you could probably check the keyring file as well and see if it
matches up with what you get running "ceph auth get ".
E.g. here

[root@vm-01 ~]# cat
/var/lib/ceph/5a72983c-ef57-11ed-a389-525400e42d74/mgr.vm-01.ilfvis/keyring
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==

the key matches with

[ceph: root@vm-00 /]# ceph auth get mgr.vm-01.ilfvis
[mgr.vm-01.ilfvis]
key = AQDf01tk7mn/IRAAvZ+ZhUgT77uZsFBSzLGPyQ==
caps mds = "allow *"
caps mon = "profile mgr"
caps osd = "allow *"

I wouldn't post them for obvious reasons (these are just on a test cluster
I'll tear back down so it's fine for me) but those are the first couple
things I'd check. You could also try to make adjustments directly to the
unit.run file if you have other things you'd like to try.

On Wed, May 10, 2023 at 11:09 AM Ben  wrote:

> Hi,
> This cluster is deployed by cephadm 17.2.5,containerized.
> It ends up in this(no active mgr):
> [root@8cd2c0657c77 /]# ceph -s
>   cluster:
> id: ad3a132e-e9ee-11ed-8a19-043f72fb8bf9
> health: HEALTH_WARN
> 6 hosts fail cephadm check
> no active mgr
> 1/3 mons down, quorum h18w,h19w
> Degraded data redundancy: 781908/2345724 objects degraded
> (33.333%), 101 pgs degraded, 209 pgs undersized
>
>   services:
> mon: 3 daemons, quorum h18w,h19w (age 19m), out of quorum: h15w
> mgr: no daemons active (since 5h)
> mds: 1/1 daemons up, 1 standby
> osd: 9 osds: 6 up (since 5h), 6 in (since 5h)
> rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
> volumes: 1/1 healthy
> pools:   8 pools, 209 pgs
> objects: 781.91k objects, 152 GiB
> usage:   312 GiB used, 54 TiB / 55 TiB avail
> pgs: 781908/2345724 objects degraded (33.333%)
>  108 active+undersized
>  101 active+undersized+degraded
>
> I checked the h20w, there is a manager container running with log:
>
> debug 2023-05-10T12:43:23.315+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:48:23.318+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:53:23.318+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T12:58:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:03:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:08:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
> debug 2023-05-10T13:13:23.319+ 7f5e152ec000  0 monclient(hunting):
> authenticate timed out after 300
>
>
> any idea to get a mgr up running again through cephadm?
>
> Thanks,
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: non root deploy ceph 17.2.5 failed

2023-05-09 Thread Adam King
which I think was merged too late* (as in the patch wouldn't be in 17.2.6)

On Tue, May 9, 2023 at 5:52 PM Adam King  wrote:

> What's the umask for the "deployer" user? We saw an instance of someone
> hitting something like this, but for them it seemed to only happen when
> they had changed the umask to 027. We had patched in
> https://github.com/ceph/ceph/pull/50736 to address it, which I don't
> think was merged too late for the 17.2.6 release.
>
> On Mon, May 8, 2023 at 5:24 AM Ben  wrote:
>
>> Hi,
>>
>> with following command:
>>
>> sudo cephadm  --docker bootstrap --mon-ip 10.1.32.33
>> --skip-monitoring-stack
>>   --ssh-user deployer
>> the user deployer has passwordless sudo configuration.
>> I can see the error below:
>>
>> debug 2023-05-04T12:46:43.268+ 7fc5ddc2e700  0 [cephadm ERROR
>> cephadm.ssh] Unable to write
>>
>> szhyf-xx1d002-hx15w:/var/lib/ceph/ad3a132e-e9ee-11ed-8a19-043f72fb8bf9/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e:
>> scp:
>>
>> /tmp/var/lib/ceph/ad3a132e-e9ee-11ed-8a19-043f72fb8bf9/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e.new:
>> Permission denied
>>
>> Traceback (most recent call last):
>>
>>   File "/usr/share/ceph/mgr/cephadm/ssh.py", line 222, in
>> _write_remote_file
>>
>> await asyncssh.scp(f.name, (conn, tmp_path))
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp
>>
>> await source.run(srcpath)
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run
>>
>> self.handle_error(exc)
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
>> handle_error
>>
>> raise exc from None
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run
>>
>> await self._send_files(path, b'')
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in
>> _send_files
>>
>> self.handle_error(exc)
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
>> handle_error
>>
>> raise exc from None
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in
>> _send_files
>>
>> await self._send_file(srcpath, dstpath, attrs)
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in
>> _send_file
>>
>> await self._make_cd_request(b'C', attrs, size, srcpath)
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
>> _make_cd_request
>>
>> self._fs.basename(path))
>>
>>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in
>> make_request
>>
>> raise exc
>>
>> Any ideas on this?
>>
>> Thanks,
>> Ben
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: non root deploy ceph 17.2.5 failed

2023-05-09 Thread Adam King
What's the umask for the "deployer" user? We saw an instance of someone
hitting something like this, but for them it seemed to only happen when
they had changed the umask to 027. We had patched in
https://github.com/ceph/ceph/pull/50736 to address it, which I don't think
was merged too late for the 17.2.6 release.

On Mon, May 8, 2023 at 5:24 AM Ben  wrote:

> Hi,
>
> with following command:
>
> sudo cephadm  --docker bootstrap --mon-ip 10.1.32.33
> --skip-monitoring-stack
>   --ssh-user deployer
> the user deployer has passwordless sudo configuration.
> I can see the error below:
>
> debug 2023-05-04T12:46:43.268+ 7fc5ddc2e700  0 [cephadm ERROR
> cephadm.ssh] Unable to write
>
> szhyf-xx1d002-hx15w:/var/lib/ceph/ad3a132e-e9ee-11ed-8a19-043f72fb8bf9/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e:
> scp:
>
> /tmp/var/lib/ceph/ad3a132e-e9ee-11ed-8a19-043f72fb8bf9/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e.new:
> Permission denied
>
> Traceback (most recent call last):
>
>   File "/usr/share/ceph/mgr/cephadm/ssh.py", line 222, in
> _write_remote_file
>
> await asyncssh.scp(f.name, (conn, tmp_path))
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp
>
> await source.run(srcpath)
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run
>
> self.handle_error(exc)
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
> handle_error
>
> raise exc from None
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run
>
> await self._send_files(path, b'')
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in
> _send_files
>
> self.handle_error(exc)
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in
> handle_error
>
> raise exc from None
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in
> _send_files
>
> await self._send_file(srcpath, dstpath, attrs)
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in
> _send_file
>
> await self._make_cd_request(b'C', attrs, size, srcpath)
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in
> _make_cd_request
>
> self._fs.basename(path))
>
>   File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in
> make_request
>
> raise exc
>
> Any ideas on this?
>
> Thanks,
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-05-04 Thread Adam King
for setting the user, `ceph cephadm set-user` command should do it. Bit
surprised by the second part of that though. With passwordless sudo access
I would have expected that to start working.

On Thu, May 4, 2023 at 11:27 AM Reza Bakhshayeshi 
wrote:

> Thank you.
> I don't see any more errors rather than:
>
> 2023-05-04T15:07:38.003+ 7ff96cbe0700  0 log_channel(cephadm) log
> [DBG] : Running command: sudo which python3
> 2023-05-04T15:07:38.025+ 7ff96cbe0700  0 log_channel(cephadm) log
> [DBG] : Connection to host1 failed. Process exited with non-zero exit
> status 3
> 2023-05-04T15:07:38.025+ 7ff96cbe0700  0 log_channel(cephadm) log
> [DBG] : _reset_con close host1
>
> What is the best way to safely change the cephadm user to root for the
> existing cluster? It seems "ceph cephadm set-ssh-config" is not effective
> (BTW, my cephadmin user can run "sudo which python3" without prompting
> password on other hosts now, but nothing has been solved)
>
> Best regards,
> Reza
>
> On Tue, 2 May 2023 at 19:00, Adam King  wrote:
>
>> The number of mgr daemons thing is expected. The way it works is it first
>> upgrades all the standby mgrs (which will be all but one) and then fails
>> over so the previously active mgr can be upgraded as well. After that
>> failover is when it's first actually running the newer cephadm code, which
>> is when you're hitting this issue. Are the logs still saying something
>> similar about how "sudo which python3" is failing? I'm thinking this
>> might just be a general issue with the user being used not having
>> passwordless sudo access, that sort of accidentally working in pacific, but
>> now not working any more in quincy. If the log lines confirm the same, we
>> might have to work on something in order to handle this case (making the
>> sudo optional somehow). As mentioned in the previous email, that setup
>> wasn't intended to be supported even in pacific, although if it did work,
>> we could bring something in to make it usable in quincy onward as well.
>>
>> On Tue, May 2, 2023 at 10:58 AM Reza Bakhshayeshi 
>> wrote:
>>
>>> Hi Adam,
>>>
>>> I'm still struggling with this issue. I also checked it one more time
>>> with newer versions, upgrading the cluster from 16.2.11 to 16.2.12 was
>>> successful but from 16.2.12 to 17.2.6 failed again with the same ssh errors
>>> (I checked
>>> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors a
>>> couple of times and all keys/access are fine).
>>>
>>> [root@host1 ~]# ceph health detail
>>> HEALTH_ERR Upgrade: Failed to connect to host host2 at addr (x.x.x.x)
>>> [ERR] UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect to host host2 at
>>> addr (x.x.x.x)
>>> SSH connection failed to host2 at addr (x.x.x.x): Host(s) were
>>> marked offline: {'host2', 'host6', 'host9', 'host4', 'host3', 'host5',
>>> 'host1', 'host7', 'host8'}
>>>
>>> The interesting thing is that always (total number of mgrs) - 1 is
>>> upgraded, If I provision 5 MGRs then 4 of them, and for 3, 2 of them!
>>>
>>> As long as I'm in an internal environment, I also checked the process
>>> with Quincy cephadm binary file. FYI I'm using stretch mode on this cluster.
>>>
>>> I don't understand why Quincy MGRs cannot ssh into Pacific nodes, if you
>>> have any more hints I would be really glad to hear.
>>>
>>> Best regards,
>>> Reza
>>>
>>>
>>>
>>> On Wed, 12 Apr 2023 at 17:18, Adam King  wrote:
>>>
>>>> Ah, okay. Someone else had opened an issue about the same thing after
>>>> the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use
>>>> sudo for non-root users
>>>> https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153.
>>>> But it looks like you're also using a non-root user anyway. We've required
>>>> passwordless sudo access for custom ssh users for a long time I think (e.g.
>>>> it's in pacific docs
>>>> https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap,
>>>> see the point on "--ssh-user"). Did this actually work for you before in
>>>> pacific with a non-root user that doesn't have sudo privileges? I had
>>>> assumed that had never worked.
>>>>
>>>> On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi <
>>>> reza.b2...@gmail.com> wrote:
>>>>
>>>>> Thank you Adam for your

[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
what does specifically `ceph log last 200 debug cephadm` spit out? The log
lines you've posted so far I don't think are generated by the orchestrator
so curious what the last actions it took was (and how long ago).

On Thu, May 4, 2023 at 10:35 AM Thomas Widhalm 
wrote:

> To completely rule out hung processes, I managed to get another short
> shutdown.
>
> Now I'm seeing lots of:
>
> mgr.server handle_open ignoring open from mds.mds01.ceph01.usujbi
> v2:192.168.23.61:6800/2922006253; not ready for session (expect reconnect)
> mgr finish mon failed to return metadata for mds.mds01.ceph02.otvipq:
> (2) No such file or directory
>
> log lines. Seems like it now realises that some of these informations
> are stale. But it looks like it's just waiting for it to come back and
> not do anything about it.
>
> On 04.05.23 14:48, Eugen Block wrote:
> > Hi,
> >
> > try setting debug logs for the mgr:
> >
> > ceph config set mgr mgr/cephadm/log_level debug
> >
> > This should provide more details what the mgr is trying and where it's
> > failing, hopefully. Last week this helped to identify an issue between a
> > lower pacific issue for me.
> > Do you see anything in the cephadm.log pointing to the mgr actually
> > trying something?
> >
> >
> > Zitat von Thomas Widhalm :
> >
> >> Hi,
> >>
> >> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but
> >> the following problem existed when I was still everywhere on 17.2.5 .
> >>
> >> I had a major issue in my cluster which could be solved with a lot of
> >> your help and even more trial and error. Right now it seems that most
> >> is already fixed but I can't rule out that there's still some problem
> >> hidden. The very issue I'm asking about started during the repair.
> >>
> >> When I want to orchestrate the cluster, it logs the command but it
> >> doesn't do anything. No matter if I use ceph dashboard or "ceph orch"
> >> in "cephadm shell". I don't get any error message when I try to deploy
> >> new services, redeploy them etc. The log only says "scheduled" and
> >> that's it. Same when I change placement rules. Usually I use tags. But
> >> since they don't work anymore, too, I tried host and umanaged. No
> >> success. The only way I can actually start and stop containers is via
> >> systemctl from the host itself.
> >>
> >> When I run "ceph orch ls" or "ceph orch ps" I see services I deployed
> >> for testing being deleted (for weeks now). Ans especially a lot of old
> >> MDS are listed as "error" or "starting". The list doesn't match
> >> reality at all because I had to start them by hand.
> >>
> >> I tried "ceph mgr fail" and even a complete shutdown of the whole
> >> cluster with all nodes including all mgs, mds even osd - everything
> >> during a maintenance window. Didn't change anything.
> >>
> >> Could you help me? To be honest I'm still rather new to Ceph and since
> >> I didn't find anything in the logs that caught my eye I would be
> >> thankful for hints how to debug.
> >>
> >> Cheers,
> >> Thomas
> >> --
> >> http://www.widhalm.or.at
> >> GnuPG : 6265BAE6 , A84CB603
> >> Threema: H7AV7D33
> >> Telegram, Signal: widha...@widhalm.or.at
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Orchestration seems not to work

2023-05-04 Thread Adam King
First thing I always check when it seems like orchestrator commands aren't
doing anything is "ceph orch ps" and "ceph orch device ls" and check the
REFRESHED column. If it's well above 10 minutes for orch ps or 30 minutes
for orch device ls, then it means the orchestrator is most likely hanging
on some command to refresh the host information. If that's the case, you
can follow up with a "ceph mgr fail", wait a few minutes and check the orch
ps and device ls REFRESHED column again. If only certain hosts are not
having their daemon/device information refreshed, you can go to the hosts
that aren't having their info refreshed and check for hanging "cephadm"
commands (I just check for "ps aux | grep cephadm").

On Thu, May 4, 2023 at 8:38 AM Thomas Widhalm 
wrote:

> Hi,
>
> I'm in the process of upgrading my cluster from 17.2.5 to 17.2.6 but the
> following problem existed when I was still everywhere on 17.2.5 .
>
> I had a major issue in my cluster which could be solved with a lot of
> your help and even more trial and error. Right now it seems that most is
> already fixed but I can't rule out that there's still some problem
> hidden. The very issue I'm asking about started during the repair.
>
> When I want to orchestrate the cluster, it logs the command but it
> doesn't do anything. No matter if I use ceph dashboard or "ceph orch" in
> "cephadm shell". I don't get any error message when I try to deploy new
> services, redeploy them etc. The log only says "scheduled" and that's
> it. Same when I change placement rules. Usually I use tags. But since
> they don't work anymore, too, I tried host and umanaged. No success. The
> only way I can actually start and stop containers is via systemctl from
> the host itself.
>
> When I run "ceph orch ls" or "ceph orch ps" I see services I deployed
> for testing being deleted (for weeks now). Ans especially a lot of old
> MDS are listed as "error" or "starting". The list doesn't match reality
> at all because I had to start them by hand.
>
> I tried "ceph mgr fail" and even a complete shutdown of the whole
> cluster with all nodes including all mgs, mds even osd - everything
> during a maintenance window. Didn't change anything.
>
> Could you help me? To be honest I'm still rather new to Ceph and since I
> didn't find anything in the logs that caught my eye I would be thankful
> for hints how to debug.
>
> Cheers,
> Thomas
> --
> http://www.widhalm.or.at
> GnuPG : 6265BAE6 , A84CB603
> Threema: H7AV7D33
> Telegram, Signal: widha...@widhalm.or.at
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-05-02 Thread Adam King
The number of mgr daemons thing is expected. The way it works is it first
upgrades all the standby mgrs (which will be all but one) and then fails
over so the previously active mgr can be upgraded as well. After that
failover is when it's first actually running the newer cephadm code, which
is when you're hitting this issue. Are the logs still saying something
similar about how "sudo which python3" is failing? I'm thinking this might
just be a general issue with the user being used not having passwordless
sudo access, that sort of accidentally working in pacific, but now not
working any more in quincy. If the log lines confirm the same, we might
have to work on something in order to handle this case (making the sudo
optional somehow). As mentioned in the previous email, that setup wasn't
intended to be supported even in pacific, although if it did work, we could
bring something in to make it usable in quincy onward as well.

On Tue, May 2, 2023 at 10:58 AM Reza Bakhshayeshi 
wrote:

> Hi Adam,
>
> I'm still struggling with this issue. I also checked it one more time with
> newer versions, upgrading the cluster from 16.2.11 to 16.2.12 was
> successful but from 16.2.12 to 17.2.6 failed again with the same ssh errors
> (I checked
> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors a
> couple of times and all keys/access are fine).
>
> [root@host1 ~]# ceph health detail
> HEALTH_ERR Upgrade: Failed to connect to host host2 at addr (x.x.x.x)
> [ERR] UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect to host host2 at
> addr (x.x.x.x)
> SSH connection failed to host2 at addr (x.x.x.x): Host(s) were marked
> offline: {'host2', 'host6', 'host9', 'host4', 'host3', 'host5', 'host1',
> 'host7', 'host8'}
>
> The interesting thing is that always (total number of mgrs) - 1 is
> upgraded, If I provision 5 MGRs then 4 of them, and for 3, 2 of them!
>
> As long as I'm in an internal environment, I also checked the process with
> Quincy cephadm binary file. FYI I'm using stretch mode on this cluster.
>
> I don't understand why Quincy MGRs cannot ssh into Pacific nodes, if you
> have any more hints I would be really glad to hear.
>
> Best regards,
> Reza
>
>
>
> On Wed, 12 Apr 2023 at 17:18, Adam King  wrote:
>
>> Ah, okay. Someone else had opened an issue about the same thing after
>> the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use
>> sudo for non-root users
>> https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153.
>> But it looks like you're also using a non-root user anyway. We've required
>> passwordless sudo access for custom ssh users for a long time I think (e.g.
>> it's in pacific docs
>> https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap,
>> see the point on "--ssh-user"). Did this actually work for you before in
>> pacific with a non-root user that doesn't have sudo privileges? I had
>> assumed that had never worked.
>>
>> On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi 
>> wrote:
>>
>>> Thank you Adam for your response,
>>>
>>> I tried all your comments and the troubleshooting link you sent. From
>>> the Quincy mgrs containers, they can ssh into all other Pacific nodes
>>> successfully by running the exact command in the log output and vice versa.
>>>
>>> Here are some debug logs from the cephadm while updating:
>>>
>>> 2023-04-12T11:35:56.260958+ mgr.host8.jukgqm (mgr.4468627) 103 :
>>> cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh options
>>> '-F /tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr'
>>> 2023-04-12T11:35:56.525091+ mgr.host8.jukgqm (mgr.4468627) 144 :
>>> cephadm [DBG] _run_cephadm : command = ls
>>> 2023-04-12T11:35:56.525406+ mgr.host8.jukgqm (mgr.4468627) 145 :
>>> cephadm [DBG] _run_cephadm : args = []
>>> 2023-04-12T11:35:56.525571+ mgr.host8.jukgqm (mgr.4468627) 146 :
>>> cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256
>>> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>> 2023-04-12T11:35:56.525619+ mgr.host8.jukgqm (mgr.4468627) 147 :
>>> cephadm [DBG] args: --image 
>>> my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
>>> ls
>>> 2023-04-12T11:35:56.525738+ mgr.host8.jukgqm (mgr.4468627) 148 :
>>> cephadm [DBG] Running command: sudo which python3
>>> 2023-04-12T11:35:56.534227+ mgr.host8.jukgqm (mgr.4468627) 149 :
>>> cephadm [DBG] Connection to host1 failed. Process exited wit

[ceph-users] Re: 16.2.13 pacific QE validation status

2023-05-01 Thread Adam King
approved for the rados/cephadm stuff

On Thu, Apr 27, 2023 at 5:21 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/59542#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> smoke - Radek, Laura
> rados - Radek, Laura
>   rook - Sébastien Han
>   cephadm - Adam K
>   dashboard - Ernesto
>
> rgw - Casey
> rbd - Ilya
> krbd - Ilya
> fs - Venky, Patrick
> upgrade/octopus-x (pacific) - Laura (look the same as in 16.2.8)
> upgrade/pacific-p2p - Laura
> powercycle - Brad (SELinux denials)
> ceph-volume - Guillaume, Adam K
>
> Thx
> YuriW
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: pacific v16.2.1 (hot-fix) QE Validation status

2023-04-12 Thread Adam King
Obviously the issue installing EPEL makes the runs look pretty bad. But,
given the ubuntu based tests look alright, the EPEL stuff is likely not on
our side (so who knows when it will be resolved), and this is only
16.2.11 + a handful of ceph-volume patches, I'm willing to approve in the
interest of time.

On Wed, Apr 12, 2023 at 11:28 AM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/59426#note-3
> Release Notes - TBD
>
> Seeking approvals/reviews for:
>
> smoke - Josh approved?
> orch - Adam King approved?
>
> (there are infrastructure issues in the runs, but we want to release this
> ASAP)
>
> Thx
> YuriW
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-04-12 Thread Adam King
Ah, okay. Someone else had opened an issue about the same thing after
the 17.2.5 release I believe. It's changed in 17.2.6 at least to only use
sudo for non-root users
https://github.com/ceph/ceph/blob/v17.2.6/src/pybind/mgr/cephadm/ssh.py#L148-L153.
But it looks like you're also using a non-root user anyway. We've required
passwordless sudo access for custom ssh users for a long time I think (e.g.
it's in pacific docs
https://docs.ceph.com/en/pacific/cephadm/install/#further-information-about-cephadm-bootstrap,
see the point on "--ssh-user"). Did this actually work for you before in
pacific with a non-root user that doesn't have sudo privileges? I had
assumed that had never worked.

On Wed, Apr 12, 2023 at 10:38 AM Reza Bakhshayeshi 
wrote:

> Thank you Adam for your response,
>
> I tried all your comments and the troubleshooting link you sent. From the
> Quincy mgrs containers, they can ssh into all other Pacific nodes
> successfully by running the exact command in the log output and vice versa.
>
> Here are some debug logs from the cephadm while updating:
>
> 2023-04-12T11:35:56.260958+ mgr.host8.jukgqm (mgr.4468627) 103 :
> cephadm [DBG] Opening connection to cephadmin@x.x.x.x with ssh options
> '-F /tmp/cephadm-conf-2bbfubub -i /tmp/cephadm-identity-7x2m8gvr'
> 2023-04-12T11:35:56.525091+ mgr.host8.jukgqm (mgr.4468627) 144 :
> cephadm [DBG] _run_cephadm : command = ls
> 2023-04-12T11:35:56.525406+ mgr.host8.jukgqm (mgr.4468627) 145 :
> cephadm [DBG] _run_cephadm : args = []
> 2023-04-12T11:35:56.525571+ mgr.host8.jukgqm (mgr.4468627) 146 :
> cephadm [DBG] mon container image my-private-repo/quay-io/ceph/ceph@sha256
> :1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
> 2023-04-12T11:35:56.525619+ mgr.host8.jukgqm (mgr.4468627) 147 :
> cephadm [DBG] args: --image 
> my-private-repo/quay-io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add
> ls
> 2023-04-12T11:35:56.525738+ mgr.host8.jukgqm (mgr.4468627) 148 :
> cephadm [DBG] Running command: sudo which python3
> 2023-04-12T11:35:56.534227+ mgr.host8.jukgqm (mgr.4468627) 149 :
> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
> status 3
> 2023-04-12T11:35:56.534275+ mgr.host8.jukgqm (mgr.4468627) 150 :
> cephadm [DBG] _reset_con close host1
> 2023-04-12T11:35:56.540135+ mgr.host8.jukgqm (mgr.4468627) 158 :
> cephadm [DBG] Host "host1" marked as offline. Skipping gather facts refresh
> 2023-04-12T11:35:56.540178+ mgr.host8.jukgqm (mgr.4468627) 159 :
> cephadm [DBG] Host "host1" marked as offline. Skipping network refresh
> 2023-04-12T11:35:56.540408+ mgr.host8.jukgqm (mgr.4468627) 160 :
> cephadm [DBG] Host "host1" marked as offline. Skipping device refresh
> 2023-04-12T11:35:56.540490+ mgr.host8.jukgqm (mgr.4468627) 161 :
> cephadm [DBG] Host "host1" marked as offline. Skipping osdspec preview
> refresh
> 2023-04-12T11:35:56.540527+ mgr.host8.jukgqm (mgr.4468627) 162 :
> cephadm [DBG] Host "host1" marked as offline. Skipping autotune
> 2023-04-12T11:35:56.540978+ mgr.host8.jukgqm (mgr.4468627) 163 :
> cephadm [DBG] Connection to host1 failed. Process exited with non-zero exit
> status 3
> 2023-04-12T11:35:56.796966+ mgr.host8.jukgqm (mgr.4468627) 728 :
> cephadm [ERR] Upgrade: Paused due to UPGRADE_OFFLINE_HOST: Upgrade: Failed
> to connect to host host1 at addr (x.x.x.x)
>
> As I can see here, it turns out sudo is added to the code to be able to
> continue:
>
>
> https://github.com/ceph/ceph/blob/v17.2.5/src/pybind/mgr/cephadm/ssh.py#L143
>
> I cannot privilege the cephadmin user to run sudo commands for some policy
> reasons, could this be the root cause of the issue?
>
> Best regards,
> Reza
>
> On Thu, 6 Apr 2023 at 14:59, Adam King  wrote:
>
>> Does "ceph health detail" give any insight into what the unexpected
>> exception was? If not, I'm pretty confident some traceback would end up
>> being logged. Could maybe still grab it with "ceph log last 200 info
>> cephadm" if not a lot else has happened. Also, probably need to find out if
>> the check-host is failing due to the check on the host actually failing or
>> failing to connect to the host. Could try putting a copy of the cephadm
>> binary on one and running "cephadm check-host --expect-hostname "
>> where the hostname is the name cephadm knows the host by. If that's not an
>> issue I'd expect it's a connection thing. Could maybe try going through
>>  https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
>> <https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors>.
>> Cephadm changed

[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
It seems like it maybe didn't actually do the redeploy as it should log
something saying it's actually doing it on top of the line saying it
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade status"
reports is_paused as false)? If so, maybe try doing a mgr failover ("ceph
mgr fail") and then check "ceph orch ps"  and "ceph orch device ls" a few
minutes later and look at the REFRESHED column. If any of those are giving
amounts of time farther back then when you did the failover, there's
probably something going on on the host(s) where it says it hasn't
refreshed recently that's sticking things up (you'd have to go on that host
and look for hanging cephadm commands). Lastly, you could look at the
/var/lib/ceph///unit.run file on the hosts where the
mds daemons are deployed. The (very long) last podman/docker run line in
that file should have the image name of the image the daemon is being
deployed with. So you could use that to confirm if cephadm ever actually
tried a redeploy of the mds with the new image. You could also check the
journal logs for the mds. Cephadm reports the sytemd unit name for the
daemon as part of "cephadm ls" output if you put a copy of the cephadm
binary, un "cephadm ls" with it, grab the systemd unit name for the mds
daemon form that output, you could use that to check the journal logs which
should tell the last restart time and why it's gone down.

On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm 
wrote:

> I did what you told me.
>
> I also see in the log, that the command went through:
>
> 2023-04-10T19:58:46.522477+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph06.rrxmks
> 2023-04-10T20:01:03.360559+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph05.pqxmvt
> 2023-04-10T20:01:21.787635+ mgr.ceph04.qaexpv [INF] Schedule
> redeploy daemon mds.mds01.ceph07.omdisd
>
>
> But the MDS never start. They stay in error state. I tried to redeploy
> and start them a few times. Even restarted one host where a MDS should run.
>
> mds.mds01.ceph03.xqwdjy  ceph03   error   32m ago
> 2M-- 
> mds.mds01.ceph04.hcmvae  ceph04   error   31m ago
> 2h-- 
> mds.mds01.ceph05.pqxmvt  ceph05   error   32m ago
> 9M-- 
> mds.mds01.ceph06.rrxmks  ceph06   error   32m ago
> 10w-- 
> mds.mds01.ceph07.omdisd  ceph07   error   32m ago
> 2M-- 
>
>
> And other ideas? Or am I missing something.
>
> Cheers,
> Thomas
>
> On 10.04.23 21:53, Adam King wrote:
> > Will also note that the normal upgrade process scales down the mds
> > service to have only 1 mds per fs before upgrading it, so maybe
> > something you'd want to do as well if the upgrade didn't do it already.
> > It does so by setting the max_mds to 1 for the fs.
> >
> > On Mon, Apr 10, 2023 at 3:51 PM Adam King  > <mailto:adk...@redhat.com>> wrote:
> >
> > You could try pausing the upgrade and manually "upgrading" the mds
> > daemons by redeploying them on the new image. Something like "ceph
> > orch daemon redeploy  --image <17.2.6 image>"
> > (daemon names should match those in "ceph orch ps" output). If you
> > do that for all of them and then get them into an up state you
> > should be able to resume the upgrade and have it complete.
> >
> > On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
> > mailto:widha...@widhalm.or.at>> wrote:
> >
> > Hi,
> >
> > If you remember, I hit bug https://tracker.ceph.com/issues/58489
> > <https://tracker.ceph.com/issues/58489> so I
> > was very relieved when 17.2.6 was released and started to update
> > immediately.
> >
> > But now I'm stuck again with my broken MDS. MDS won't get into
> > up:active
> > without the update but the update waits for them to get into
> > up:active
> > state. Seems like a deadlock / chicken-egg problem to me.
> >
> > Since I'm still relatively new to Ceph, could you help me?
> >
> > What I see when watching the update status:
> >
> > {
> >   "target_image":
> > "
> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> <
> http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> >",
> > 

[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
Will also note that the normal upgrade process scales down the mds service
to have only 1 mds per fs before upgrading it, so maybe something you'd
want to do as well if the upgrade didn't do it already. It does so by
setting the max_mds to 1 for the fs.

On Mon, Apr 10, 2023 at 3:51 PM Adam King  wrote:

> You could try pausing the upgrade and manually "upgrading" the mds daemons
> by redeploying them on the new image. Something like "ceph orch daemon
> redeploy  --image <17.2.6 image>" (daemon names should
> match those in "ceph orch ps" output). If you do that for all of them and
> then get them into an up state you should be able to resume the upgrade and
> have it complete.
>
> On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm 
> wrote:
>
>> Hi,
>>
>> If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I
>> was very relieved when 17.2.6 was released and started to update
>> immediately.
>>
>> But now I'm stuck again with my broken MDS. MDS won't get into up:active
>> without the update but the update waits for them to get into up:active
>> state. Seems like a deadlock / chicken-egg problem to me.
>>
>> Since I'm still relatively new to Ceph, could you help me?
>>
>> What I see when watching the update status:
>>
>> {
>>  "target_image":
>> "
>> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
>> ",
>>  "in_progress": true,
>>  "which": "Upgrading all daemon types on all hosts",
>>  "services_complete": [
>>  "crash",
>>  "mgr",
>> "mon",
>> "osd"
>>  ],
>>  "progress": "18/40 daemons upgraded",
>>  "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect
>> to host ceph01 at addr (192.168.23.61)",
>>  "is_paused": false
>> }
>>
>> (The offline host was one host that broke during the upgrade. I fixed
>> that in the meantime and the update went on.)
>>
>> And in the log:
>>
>> 2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: Waiting
>> for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
>> 2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No mds
>> is up; continuing upgrade procedure to poke things in the right direction
>>
>>
>> Please give me a hint what I can do.
>>
>> Cheers,
>> Thomas
>> --
>> http://www.widhalm.or.at
>> GnuPG : 6265BAE6 , A84CB603
>> Threema: H7AV7D33
>> Telegram, Signal: widha...@widhalm.or.at
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

2023-04-10 Thread Adam King
You could try pausing the upgrade and manually "upgrading" the mds daemons
by redeploying them on the new image. Something like "ceph orch daemon
redeploy  --image <17.2.6 image>" (daemon names should
match those in "ceph orch ps" output). If you do that for all of them and
then get them into an up state you should be able to resume the upgrade and
have it complete.

On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm 
wrote:

> Hi,
>
> If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I
> was very relieved when 17.2.6 was released and started to update
> immediately.
>
> But now I'm stuck again with my broken MDS. MDS won't get into up:active
> without the update but the update waits for them to get into up:active
> state. Seems like a deadlock / chicken-egg problem to me.
>
> Since I'm still relatively new to Ceph, could you help me?
>
> What I see when watching the update status:
>
> {
>  "target_image":
> "
> quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635
> ",
>  "in_progress": true,
>  "which": "Upgrading all daemon types on all hosts",
>  "services_complete": [
>  "crash",
>  "mgr",
> "mon",
> "osd"
>  ],
>  "progress": "18/40 daemons upgraded",
>  "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect
> to host ceph01 at addr (192.168.23.61)",
>  "is_paused": false
> }
>
> (The offline host was one host that broke during the upgrade. I fixed
> that in the meantime and the update went on.)
>
> And in the log:
>
> 2023-04-10T19:23:48.750129+ mgr.ceph04.qaexpv [INF] Upgrade: Waiting
> for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
> 2023-04-10T19:23:58.758141+ mgr.ceph04.qaexpv [WRN] Upgrade: No mds
> is up; continuing upgrade procedure to poke things in the right direction
>
>
> Please give me a hint what I can do.
>
> Cheers,
> Thomas
> --
> http://www.widhalm.or.at
> GnuPG : 6265BAE6 , A84CB603
> Threema: H7AV7D33
> Telegram, Signal: widha...@widhalm.or.at
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"

2023-04-06 Thread Adam King
Does "ceph health detail" give any insight into what the unexpected
exception was? If not, I'm pretty confident some traceback would end up
being logged. Could maybe still grab it with "ceph log last 200 info
cephadm" if not a lot else has happened. Also, probably need to find out if
the check-host is failing due to the check on the host actually failing or
failing to connect to the host. Could try putting a copy of the cephadm
binary on one and running "cephadm check-host --expect-hostname "
where the hostname is the name cephadm knows the host by. If that's not an
issue I'd expect it's a connection thing. Could maybe try going through
 https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#ssh-errors
.
Cephadm changed the backend ssh library from pacific to quincy due to the
one used in pacific no longer being supported so it's possible some general
ssh error has popped up in your env as a result.

On Thu, Apr 6, 2023 at 8:38 AM Reza Bakhshayeshi 
wrote:

> Hi all,
>
> I have a problem regarding upgrading Ceph cluster from Pacific to Quincy
> version with cephadm. I have successfully upgraded the cluster to the
> latest Pacific (16.2.11). But when I run the following command to upgrade
> the cluster to 17.2.5, After upgrading 3/4 mgrs, the upgrade process stops
> with "Unexpected error". (everything is on a private network)
>
> ceph orch upgrade start my-private-repo/quay-io/ceph/ceph:v17.2.5
>
> I also tried the 17.2.4 version.
>
> cephadm fails to check the hosts' status and marks them as offline:
>
> cephadm 2023-04-06T10:19:59.998510+ mgr.host9.arhpnd (mgr.4516356) 5782
> : cephadm [DBG]  host host4 (x.x.x.x) failed check
> cephadm 2023-04-06T10:19:59.998553+ mgr.host9.arhpnd (mgr.4516356) 5783
> : cephadm [DBG] Host "host4" marked as offline. Skipping daemon refresh
> cephadm 2023-04-06T10:19:59.998581+ mgr.host9.arhpnd (mgr.4516356) 5784
> : cephadm [DBG] Host "host4" marked as offline. Skipping gather facts
> refresh
> cephadm 2023-04-06T10:19:59.998609+ mgr.host9.arhpnd (mgr.4516356) 5785
> : cephadm [DBG] Host "host4" marked as offline. Skipping network refresh
> cephadm 2023-04-06T10:19:59.998633+ mgr.host9.arhpnd (mgr.4516356) 5786
> : cephadm [DBG] Host "host4" marked as offline. Skipping device refresh
> cephadm 2023-04-06T10:19:59.998659+ mgr.host9.arhpnd (mgr.4516356) 5787
> : cephadm [DBG] Host "host4" marked as offline. Skipping osdspec preview
> refresh
> cephadm 2023-04-06T10:19:59.998682+ mgr.host9.arhpnd (mgr.4516356) 5788
> : cephadm [DBG] Host "host4" marked as offline. Skipping autotune
> cluster 2023-04-06T10:20:00.000151+ mon.host8 (mon.0) 158587 : cluster
> [ERR] Health detail: HEALTH_ERR 9 hosts fail cephadm check; Upgrade: failed
> due to an unexpected exception
> cluster 2023-04-06T10:20:00.000191+ mon.host8 (mon.0) 158588 : cluster
> [ERR] [WRN] CEPHADM_HOST_CHECK_FAILED: 9 hosts fail cephadm check
> cluster 2023-04-06T10:20:00.000202+ mon.host8 (mon.0) 158589 : cluster
> [ERR] host host7 (x.x.x.x) failed check: Unable to reach remote host
> host7. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000213+ mon.host8 (mon.0) 158590 : cluster
> [ERR] host host2 (x.x.x.x) failed check: Unable to reach remote host
> host2. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000220+ mon.host8 (mon.0) 158591 : cluster
> [ERR] host host8 (x.x.x.x) failed check: Unable to reach remote host
> host8. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000228+ mon.host8 (mon.0) 158592 : cluster
> [ERR] host host4 (x.x.x.x) failed check: Unable to reach remote host
> host4. Process exited with non-zero exit status 3
> cluster 2023-04-06T10:20:00.000240+ mon.host8 (mon.0) 158593 : cluster
> [ERR] host host3 (x.x.x.x) failed check: Unable to reach remote host
> host3. Process exited with non-zero exit status 3
>
> and here are some outputs of the commands:
>
> [root@host8 ~]# ceph -s
>   cluster:
> id: xxx
> health: HEALTH_ERR
> 9 hosts fail cephadm check
> Upgrade: failed due to an unexpected exception
>
>   services:
> mon: 5 daemons, quorum host8,host1,host7,host2,host9 (age 2w)
> mgr: host9.arhpnd(active, since 105m), standbys: host8.jowfih,
> host1.warjsr, host2.qyavjj
> mds: 1/1 daemons up, 3 standby
> osd: 37 osds: 37 up (since 8h), 37 in (since 3w)
>
>   data:
>
>
>   io:
> client:
>
>   progress:
> Upgrade to 17.2.5 (0s)
>   []
>
> [root@host8 ~]# ceph orch upgrade status
> {
> "target_image": "my-private-repo/quay-io/ceph/ceph@sha256
> :34c763383e3323c6bb35f3f2229af9f466518d9db926111277f5e27ed543c427",
> "in_progress": true,
> "which": "Upgrading all daemon types on all hosts",
> "services_complete": [],
> "progress": "3/59 daemons upgraded",
> "message": 

[ceph-users] Re: ceph orch ps mon, mgr, osd shows for version, image and container id

2023-03-31 Thread Adam King
gt;
> The unknown is only for osd, mon and mgr  services. It is across all
> nodes.
>
> Also  some other items missing are  PORTS, STATUS (time),  MEM USE,
>
> NAME
> HOST
> PORTSSTATUS REFRESHED  AGE  MEM USE  MEM LIM
> VERSION IMAGE ID  CONTAINER ID
>
> rgw.default.default.zp3110b001a0103.ftizjg   zp3110b001a0103
> *:8080   running (12h) 5m ago   8M 145M-
> 16.2.5  6e73176320aa  bd6c4d4262b3
>
> alertmanager.zp3110b001a0101
> zp3110b001a0101  running   3m ago
> 8M--   
> 
>
> mds.cephfs.zp3110b001a0102.sihibe
> zp3110b001a0102   stopped   9m ago
> 4M--  
>
> mgr.zp3110b001a0101
> zp3110b001a0101   running   3m ago   8M
> --  
>
> mgr.zp3110b001a0102
> zp3110b001a0102   running   9m ago   8M
> --  
>
> mon.zp3110b001a0101
>  zp3110b001a0101   running   3m ago   8M
> -2048M  
>
> mon.zp3110b001a0102
>  zp3110b001a0102   running   9m ago   8M
> -2048M  
>
>
>
> Thank you,
>
> Anantha
>
>
>
> *From:* Adam King 
> *Sent:* Thursday, March 30, 2023 8:08 AM
> *To:* Adiga, Anantha 
> *Cc:* ceph-users@ceph.io
> *Subject:* Re: [ceph-users] ceph orch ps mon, mgr, osd shows 
> for version, image and container id
>
>
>
> if you put a copy of the cephadm binary onto one of these hosts (e.g.
> a002s002) and run "cephadm ls" what does it give for the OSDs? That's where
> the orch ps information comes from.
>
>
>
> On Thu, Mar 30, 2023 at 10:48 AM  wrote:
>
> Hi ,
>
> Why is ceph orch ps showing ,unknown  version, image and container id ?
>
> root@a002s002:~# cephadm shell ceph mon versions
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> {
> "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
> pacific (stable)": 3
> }
> root@a002s002:~# cephadm shell ceph mgr versions
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> {
> "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
> pacific (stable)": 3
> }
>
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type mgr
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAME  HOST  PORTS  STATUS   REFRESHED  AGE  MEM USE  MEM LIM
> VERSIONIMAGE ID
> mgr.a002s002  a002s002 running 4m ago  11M--
>   
> mgr.a002s003  a002s003 running87s ago  11M--
>   
> mgr.a002s004  a002s004 running 4m ago  11M--
>   
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type mon
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAME  HOST  PORTS  STATUSREFRESHED  AGE  MEM USE  MEM
> LIM  VERSIONIMAGE ID  CONTAINER ID
> mon.a002s002  a002s002 running  4m ago  11M-
> 2048M 
> mon.a002s003  a002s003 running 95s ago  11M-
> 2048M 
> mon.a002s004  a002s004 running (4w) 4m ago   5M1172M
> 2048M  16.2.5 6e73176320aa  d38b94e00d28
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type osd
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAMEHOST  PORTS  STATUS   REFRESHED  AGE  MEM USE  MEM LIM
> VERSIONIMAGE ID
> osd.0   a002s002 running 8m ago  11M-10.9G
>   
> osd.1   a002s003 running 5m ago  11M-10.9G
>   
> osd.10  a002s004 running 8m ago  11M-10.9G
>   
> osd.11  a002s003 running 5m ago  11M-10.9G
>   
> osd.12  a002s002 running 8m ago  11M-10.9G
>   
> osd.13  a002s004 running 8m ago  11M-10.9G
>   
> osd.14  a002s003 running 5m ago  11M-10.9G
>   
> osd.15  a002s002 running 8m ago  11M-10.9G
>   
> osd.16  a002s004 running 8m ago  11M-10.9G
>   
> osd.17  a002s003 running 5m ago  11M-10.9G
>   
> osd.18  a002s002 running 8m ago  11M-10.9G
>   
> osd.19  a002s004 running 8m ago  11M-10.9G
>   
> osd.2   a002s004 running 8m ago  11M-10.9G
>   
> osd.20  a002s003 running 5m ago  11M-10.9G
>   
> osd.21  a002s002 running 8m ago  11M-10.9G
>   
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephadm - Error ENOENT: Module not found

2023-03-30 Thread Adam King
for the specific issue with that traceback, you can probably resolve that
by removing the stored upgrade state. We put it at
`mgr/cephadm/upgrade_state` I believe (can check "ceph config-key ls" and
look for something related to upgrade state if that doesn't work) so
running "ceph config-key rm mgr/cephadm/upgrade_state" should remove the
old one. Then I'd say manually downgrade the mgr daemons to avoid this
happening again (process is roughly the same as
https://docs.ceph.com/en/quincy/cephadm/upgrade/#upgrading-to-a-version-that-supports-staggered-upgrade-from-one-that-doesn-t)
and at that point you should be able to try using an upgrade command again.

On Thu, Mar 30, 2023 at 11:07 AM  wrote:

> Hello,
> After a successful upgrade of a Ceph cluster from 16.2.7 to 16.2.11, I
> needed to downgrade it back to 16.2.7 as I found an issue with the new
> version.
>
> I expected that running the downgrade with:`ceph orch upgrade start
> --ceph-version 16.2.7` should have worked fine. However, it blocked right
> after the downgrade of the first MGR daemon. In fact, the downgraded daemon
> is not able to use the cephadm module anymore. Any `ceph orch` command
> fails with the following error:
>
> ```
> $ ceph orch ps
> Error ENOENT: Module not found
> ```
> And the downgrade process is therefore blocked.
>
> These are the logs of the MGR when issuing the command:
>
> ```
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.557+ 7f828fe8c700  0 log_channel(audit) log
> [DBG] : from='client.3136173 -' entity='client.admin' cmd=[{"prefix": "orch
> ps", "target": ["mon-mgr", ""]}]: dispatch
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.558+ 7f829068d700  0 [orchestrator DEBUG
> root] _oremote orchestrator -> cephadm.list_daemons(*(None, None),
> **{'daemon_id': None, 'host': None, 'refresh': False})
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.558+ 7f829068d700 -1 no module 'cephadm'
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.558+ 7f829068d700  0 [orchestrator DEBUG
> root] _oremote orchestrator -> cephadm.get_feature_set(*(), **{})
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.558+ 7f829068d700 -1 no module 'cephadm'
> Mar 28 12:13:15 astano03
> ceph-c57586c4-8e44-11eb-a116-248a07aa8d2e-mgr-astano03-qtzccn[2232770]:
> debug 2023-03-28T10:13:15.558+ 7f829068d700 -1 mgr.server reply reply
> (2) No such file or directory Module not found
> ```
>
> Other interesting MGR logs are:
> ```
>  2023-03-28T11:05:59.519+ 7fcd16314700  4 mgr get_store get_store key:
> mgr/cephadm/upgrade_state
>  2023-03-28T11:05:59.519+ 7fcd16314700 -1 mgr load Failed to construct
> class in 'cephadm'
>  2023-03-28T11:05:59.519+ 7fcd16314700 -1 mgr load Traceback (most
> recent call last):
> e "/usr/share/ceph/mgr/cephadm/module.py", line 450, in __init__
> elf.upgrade = CephadmUpgrade(self)
> e "/usr/share/ceph/mgr/cephadm/upgrade.py", line 111, in __init__
> elf.upgrade_state: Optional[UpgradeState] =
> UpgradeState.from_json(json.loads(t))
> e "/usr/share/ceph/mgr/cephadm/upgrade.py", line 92, in from_json
> eturn cls(**c)
> rror: __init__() got an unexpected keyword argument 'daemon_types'
>
>  2023-03-28T11:05:59.521+ 7fcd16314700 -1 mgr operator() Failed to run
> module in active mode ('cephadm')
> ```
> Which seem to relate to the new feature of staggered upgrades.
>
> Please note that before, everything was working fine with version 16.2.7.
>
> I am currently stuck in this situation with only one MGR daemon on version
> 16.2.11 which is the only one still working fine:
>
> ```
> [root@astano01 ~]# ceph orch ps | grep mgr
> mgr.astano02.mzmewnastano02  *:8443,9283  running
> (5d) 43s ago   2y 455M-  16.2.11  7a63bce27215  e2d7806acf16
> mgr.astano03.qtzccnastano03  *:8443,9283  running
> (3m) 22s ago  95m 383M-  16.2.7   463ec4b1fdc0  cc0d88864fa1
> ```
>
> Does anyone already faced this issue or knows how can I make the 16.2.7
> MGR load the cephadm module correctly?
>
> Thanks in advance for any help!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph orch ps mon, mgr, osd shows for version, image and container id

2023-03-30 Thread Adam King
if you put a copy of the cephadm binary onto one of these hosts (e.g.
a002s002) and run "cephadm ls" what does it give for the OSDs? That's where
the orch ps information comes from.

On Thu, Mar 30, 2023 at 10:48 AM  wrote:

> Hi ,
>
> Why is ceph orch ps showing ,unknown  version, image and container id ?
>
> root@a002s002:~# cephadm shell ceph mon versions
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> {
> "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
> pacific (stable)": 3
> }
> root@a002s002:~# cephadm shell ceph mgr versions
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> {
> "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
> pacific (stable)": 3
> }
>
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type mgr
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAME  HOST  PORTS  STATUS   REFRESHED  AGE  MEM USE  MEM LIM
> VERSIONIMAGE ID
> mgr.a002s002  a002s002 running 4m ago  11M--
>   
> mgr.a002s003  a002s003 running87s ago  11M--
>   
> mgr.a002s004  a002s004 running 4m ago  11M--
>   
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type mon
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAME  HOST  PORTS  STATUSREFRESHED  AGE  MEM USE  MEM
> LIM  VERSIONIMAGE ID  CONTAINER ID
> mon.a002s002  a002s002 running  4m ago  11M-
> 2048M 
> mon.a002s003  a002s003 running 95s ago  11M-
> 2048M 
> mon.a002s004  a002s004 running (4w) 4m ago   5M1172M
> 2048M  16.2.5 6e73176320aa  d38b94e00d28
> root@a002s002:~# cephadm shell ceph orch ps --daemon-type osd
> Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598
> Using recent ceph image
> quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997
> NAMEHOST  PORTS  STATUS   REFRESHED  AGE  MEM USE  MEM LIM
> VERSIONIMAGE ID
> osd.0   a002s002 running 8m ago  11M-10.9G
>   
> osd.1   a002s003 running 5m ago  11M-10.9G
>   
> osd.10  a002s004 running 8m ago  11M-10.9G
>   
> osd.11  a002s003 running 5m ago  11M-10.9G
>   
> osd.12  a002s002 running 8m ago  11M-10.9G
>   
> osd.13  a002s004 running 8m ago  11M-10.9G
>   
> osd.14  a002s003 running 5m ago  11M-10.9G
>   
> osd.15  a002s002 running 8m ago  11M-10.9G
>   
> osd.16  a002s004 running 8m ago  11M-10.9G
>   
> osd.17  a002s003 running 5m ago  11M-10.9G
>   
> osd.18  a002s002 running 8m ago  11M-10.9G
>   
> osd.19  a002s004 running 8m ago  11M-10.9G
>   
> osd.2   a002s004 running 8m ago  11M-10.9G
>   
> osd.20  a002s003 running 5m ago  11M-10.9G
>   
> osd.21  a002s002 running 8m ago  11M-10.9G
>   
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.6 QE Validation status

2023-03-22 Thread Adam King
orch approved.

After reruns, the only failed jobs in the orch run were orch/rook tests
(which are broken currently) and 2 instances of "Test failure:
test_non_existent_cluster". That failure is just a command expecting a zero
return code and an error message instead of a nonzero return code in a
failure case. I think the test got backported without the change of the
error code, either way it's not a big deal.

I also took a brief look at the orchestrator failure from the upgrade
tests  (https://tracker.ceph.com/issues/59121) that Laura saw. In the
instance of it I looked at, It seems like the test is running "orch upgrade
start" and then not running "orch upgrade pause" until about 20 minutes
later, at which point the upgrade has already completed (and I can see all
the daemons got upgraded to the new image in the logs). It looks like it
was waiting on a loop to see a minority of the mons had been upgraded
before pausing the upgrade, but even starting that loop took over 2
minutes, despite the only actions in between being a "ceph orch ps" call
and echoing out a value. Really not sure why it was so slow in running
those commands or why it happened 3 times in the initial run but never in
the reruns, but the failure came from that, and the upgrade itself seems to
still work fine.

 - Adam King
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Upgrade 16.2.11 -> 17.2.0 failed

2023-03-14 Thread Adam King
That's very odd, I haven't seen this before. What container image is the
upgraded mgr running on (to know for sure, can check the podman/docker run
command at the end of the /var/lib/ceph//mgr./unit.run file
on the mgr's host)? Also, could maybe try "ceph mgr module enable cephadm"
to see if it does anything?

On Tue, Mar 14, 2023 at 9:23 AM bbk  wrote:

> Dear List,
>
> Today i was sucessfully upgrading with cephadm from 16.2.8 -> 16.2.9 ->
> 16.2.10 -> 16.2.11
>
> Now i wanted to upgrade to 17.2.0 but after starting the upgrade with
>
> ```
> # ceph orch upgrade start --ceph-version 17.2.0
> ```
>
> The orch manager module seems to be gone now and the upgrade don't seem to
> run.
>
>
> ```
> # ceph orch upgrade status
> Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
>
> # ceph orch set backend cephadm
> Error ENOENT: Module not found
> ```
>
> During the failed upgrade all nodes had the 16.2.11 cephadm installed.
>
> Fortunately the cluster is still running... somehow. I installed the
> latest 17.2.X cephadm on all
> nodes and rebooted them nodes, but this didn't help.
>
> Does someone have a hint?
>
> Yours,
> bbk
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-10 Thread Adam King
The things in "ceph orch ps" output are gathered by checking the contents
of the /var/lib/ceph// directory on the host. Those
"cephadm." files get deployed normally though, and aren't usually
reported in "ceph orch ps" as it should only report things that are
directories rather than files. You could maybe try going and removing them
anyway to see what happens (cephadm should just deploy another one though).
Would be interested anyway in what the contents of
/var/lib/ceph// are on that srvcephprod07 node and also what
"cephadm ls" spits out on that node (you would have to put a copy of the
cephadm tool on the host to run that).

As for the logs, the "cephadm.log" on the host is only the log of what the
cephadm tool has done on that host, not what the cephadm mgr module is
running. Could maybe try "ceph mgr fail; ceph -W cephadm" and let it sit
for a bit to see if you get a traceback printout that way.

On Fri, Mar 10, 2023 at 10:41 AM  wrote:

> looking at ceph orch upgrade check
> I find out
> },
>
> "cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2":
> {
> "current_id": null,
> "current_name": null,
> "current_version": null
> },
>
>
> Could this lead to the issue?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue upgrading 17.2.0 to 17.2.5

2023-03-07 Thread Adam King
>
> Current cluster status says healthy but I cannot deploy new daemons, the
>> mgr information isnt refreshing (5 days old info) under hosts and services
>> but the main dashboard is accurate like ceph -s
>> Ceph -s will show accurate information but things like ceph orch ps
>> --daemon-type mgr will say that I have 5MGRs running which is inaccurate,
>> nor will it let me remove them manually as it says theyre not found
>>
>
Can you try a mgr failover (ceph mgr fail), wait ~5 minutes and then see
what actually gets refreshed (as in check the refreshed column in "ceph
orch ps" and "ceph orch device ls"). Typically when it's having issues like
this where it's "stuck" and not refreshing there is an issue blocking the
refresh on one specific host, so would be good to see if most hosts refresh
and there is only specific host(s) where the refresh doesn't occur.

osd.11  basic
>  container_imagestop
>
osd.47  basic
>  container_image17.2.5
>*

osd.49  basic
>  container_image17.2.5
>*


That looks bad.  Might be worth trying just a "ceph config set osd
container_image
quay.io/ceph/ceph@sha256:2b73ccc9816e0a1ee1dfbe21ba9a8cc085210f1220f597b5050ebfcac4bdd346"
to get all the osd config options onto a valid image. With those options it
will try to use the image "stop" or "17.2.5" when redeploying or upgrading
those OSDs.

On Tue, Mar 7, 2023 at 11:40 AM  wrote:

> Hello at this point I've tried to upgrade a few times so I believe the
> command is long gone. On another forum someone was eluding that i
> accidentally set the image to "stop" instead of running a proper upgrade
> stop command but I couldnt find anything like that on the hosts I ran
> commands from but wouldnt be surprised if i accidentally pasted then wrote
> additional commands to it.
>
> The failing OSD was interesting, ceph didnt report it as a stray daemon
> but i noticed it was showing as a daemon but not as an actual OSD for
> storage in ceph, so I attempted to remove it and it would eventually come
> back.
>
> It had upgraded all the managers, mons to 17.2.5. Some OSDs had upgraded
> as well.
> Current cluster status says healthy but I cannot deploy new daemons, the
> mgr information isnt refreshing (5 days old info) under hosts and services
> but the main dashboard is accurate like ceph -s
> Ceph -s will show accurate information but things like ceph orch ps
> --daemon-type mgr will say that I have 5MGRs running which is inaccurate,
> nor will it let me remove them manually as it says theyre not found
>
> ERROR: Failed command: /usr/bin/docker pull 17.2.5
> 2023-03-06T09:26:55.925386-0700 mgr.mgr.idvkbw [DBG] serve loop sleep
> 2023-03-06T09:26:55.925507-0700 mgr.mgr.idvkbw [DBG] Sleeping for 60
> seconds
> 2023-03-06T09:27:55.925847-0700 mgr.mgr.idvkbw [DBG] serve loop wake
> 2023-03-06T09:27:55.925959-0700 mgr.mgr.idvkbw [DBG] serve loop start
> 2023-03-06T09:27:55.929849-0700 mgr.mgr.idvkbw [DBG] mon_command: 'config
> dump' -> 0 in 0.004s
> 2023-03-06T09:27:55.931625-0700 mgr.mgr.idvkbw [DBG] _run_cephadm :
> command = pull
> 2023-03-06T09:27:55.932025-0700 mgr.mgr.idvkbw [DBG] _run_cephadm : args =
> []
> 2023-03-06T09:27:55.932469-0700 mgr.mgr.idvkbw [DBG] args: --image 17.2.5
> --no-container-init pull
> 2023-03-06T09:27:55.932925-0700 mgr.mgr.idvkbw [DBG] Running command:
> which python3
> 2023-03-06T09:27:55.968793-0700 mgr.mgr.idvkbw [DBG] Running command:
> /usr/bin/python3
> /var/lib/ceph/5058e342-dac7-11ec-ada3-01065e90228d/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e
> --image 17.2.5 --no-container-init pull
> 2023-03-06T09:27:57.278932-0700 mgr.mgr.idvkbw [DBG] code: 1
> 2023-03-06T09:27:57.279045-0700 mgr.mgr.idvkbw [DBG] err: Pulling
> container image 17.2.5...
> Non-zero exit code 1 from /usr/bin/docker pull 17.2.5
> /usr/bin/docker: stdout Using default tag: latest
> /usr/bin/docker: stderr Error response from daemon: pull access denied for
> 17.2.5, repository does not exist or may require 'docker login': denied:
> requested access to the resource is denied
> ERROR: Failed command: /usr/bin/docker pull 17.2.5
>
> 2023-03-06T09:27:57.280517-0700 mgr.mgr.idvkbw [DBG] serve loop
>
> I had stopped the upgrade before so its at
> neteng@mon:~$ ceph orch upgrade status
> {
> "target_image": null,
> "in_progress": false,
> "which": "",
> "services_complete": [],
> "progress": null,
> "message": "",
> "is_paused": false
> }
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___

[ceph-users] Re: upgrading from 15.2.17 to 16.2.11 - Health ERROR

2023-03-07 Thread Adam King
that looks like it was expecting a json structure somewhere and got a blank
string. Is there anything in the logs (ceph log last 100 info cephadm)? If
not, might be worth trying a couple mgr failovers (I'm assuming only one
got upgraded, so first failover would go back to the 15.2.17 one and then a
second failover would go back to the 16.2.11 one) and then rechecking the
logs. I'd expect this to generate a traceback there and it's hard to say
what happened without that.

On Tue, Mar 7, 2023 at 11:42 AM  wrote:

> hi , starting upgrade from 15.2.17 i got this error
> Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0)
>
> Cluster was in health ok before starting.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue upgrading 17.2.0 to 17.2.5

2023-03-06 Thread Adam King
Can I see the output of `ceph orch upgrade status` and `ceph config dump |
grep image`? The "Pulling container image stop" implies somehow (as Eugen
pointed out) that cephadm thinks the image to pull is named "stop" which
means it is likely set as either the image to upgrade to or as one of the
config options.

On Sat, Mar 4, 2023 at 2:06 AM  wrote:

> I initially ran the upgrade fine but it failed @ around 40/100 on an osd,
> so after waiting for  along time i thought I'd try restarting it and then
> restarting the upgrade.
> I am stuck with the below debug error, I have tested docker pull from
> other servers and they dont fail for the ceph images but on ceph it does.
> If i even try to redeploy or add or remove mon damons for example it comes
> up with the same error related to the images.
>
> The error that ceph is giving me is:
> 2023-03-02T07:22:45.063976-0700 mgr.mgr-node.idvkbw [DBG] _run_cephadm :
> args = []
> 2023-03-02T07:22:45.070342-0700 mgr.mgr-node.idvkbw [DBG] args: --image
> stop --no-container-init pull
> 2023-03-02T07:22:45.081086-0700 mgr.mgr-node.idvkbw [DBG] Running command:
> which python3
> 2023-03-02T07:22:45.180052-0700 mgr.mgr-node.idvkbw [DBG] Running command:
> /usr/bin/python3
> /var/lib/ceph/5058e342-dac7-11ec-ada3-01065e90228d/cephadm.059bfc99f5cf36ed881f2494b104711faf4cbf5fc86a9594423cc105cafd9b4e
> --image stop --no-container-init pull
> 2023-03-02T07:22:46.500561-0700 mgr.mgr-node.idvkbw [DBG] code: 1
> 2023-03-02T07:22:46.500787-0700 mgr.mgr-node.idvkbw [DBG] err: Pulling
> container image stop...
> Non-zero exit code 1 from /usr/bin/docker pull stop
> /usr/bin/docker: stdout Using default tag: latest
> /usr/bin/docker: stderr Error response from daemon: pull access denied for
> stop, repository does not exist or may require 'docker login': denied:
> requested access to the resource is denied
> ERROR: Failed command: /usr/bin/docker pull stop
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Missing keyrings on upgraded cluster

2023-02-20 Thread Adam King
so "ceph osd tree destroyed -f json-pretty" shows the nautilus2 host with
the osd id you're trying to replace here? And there are disks marked
available that match the spec (20G rotational disk in this case I guess) in
"ceph orch device ls nautilus2"?

On Mon, Feb 20, 2023 at 10:16 AM Eugen Block  wrote:

> I stumbled upon this option 'osd_id_claims' [2], so I tried to apply a
> replace.yaml to redeploy only the one destroyed disk, but still
> nothing happens with that disk. This is my replace.yaml:
>
> ---snip---
> nautilus:~ # cat replace-osd-7.yaml
> service_type: osd
> service_name: osd
> placement:
>hosts:
>- nautilus2
> spec:
>data_devices:
>  rotational: 1
>  size: '20G:'
>db_devices:
>  rotational: 0
>  size: '13G:16G'
>filter_logic: AND
>objectstore: bluestore
> osd_id_claims:
>nautilus2: ['7']
> ---snip---
>
> I see these lines in the mgr.log:
>
> Feb 20 16:09:03 nautilus3 ceph-mgr[2994]: log_channel(cephadm) log
> [INF] : Found osd claims -> {'nautilus2': ['7']}
> Feb 20 16:09:03 nautilus3 ceph-mgr[2994]: [cephadm INFO
> cephadm.services.osd] Found osd claims for drivegroup None ->
> {'nautilus2': ['7']}
> Feb 20 16:09:03 nautilus3 ceph-mgr[2994]: log_channel(cephadm) log
> [INF] : Found osd claims for drivegroup None -> {'nautilus2': ['7']}
>
> But I see no attempt to actually deploy the OSD.
>
> [2]
>
> https://docs.ceph.com/en/quincy/mgr/orchestrator_modules/#orchestrator-osd-replace
>
> Zitat von Adam King :
>
> > For reference, a stray daemon from cephadm POV is roughly just something
> > that shows up in "ceph node ls" that doesn't have a directory in
> > /var/lib/ceph/. I guess manually making the OSD as you did means
> that
> > didn't end up getting made. I remember the manual osd creation process
> (by
> > manual just meaning not using an orchestrator/cephadm mgr module command)
> > coming up at one point and the we ended up manually running "cephadm
> > deploy" to make sure those directories get created correctly, but I don't
> > think any docs ever got made about it (yet, anyway). Also, is there a
> > tracker issue for it not correctly handling the drivegroup?
> >
> > On Mon, Feb 20, 2023 at 8:58 AM Eugen Block  wrote:
> >
> >> Thanks, Adam.
> >>
> >> Providing the keyring to the cephadm command worked, but the unwanted
> >> (but expected) side effect is that from cephadm perspective it's a
> >> stray daemon. For some reason the orchestrator did apply the desired
> >> drivegroup when I tried to reproduce this morning, but then again
> >> failed just now when I wanted to get rid of the stray daemon. This is
> >> one of the most annoying things with cephadm, I still don't fully
> >> understand when it will correctly apply the identical drivegroup.yml
> >> and when not. Anyway, the conclusion is to not interfere with cephadm
> >> (nothing new here), but since the drivegroup was not applied correctly
> >> I assumed I had to "help out" a bit by manually deploying an OSD.
> >>
> >> Thanks,
> >> Eugen
> >>
> >> Zitat von Adam King :
> >>
> >> > Going off of
> >> >
> >> > ceph --cluster ceph --name client.bootstrap-osd --keyring
> >> > /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
> >> >
> >> > you could try passing "--keyring  cephadm
> >> > ceph-volume command. Something like  'cephadm ceph-volume --keyring
> >> >  -- lvm create'. I'm guessing it's trying to
> run
> >> the
> >> > osd tree command within a container and I know cephadm mounts keyrings
> >> > passed to the ceph-volume command as
> >> > "/var/lib/ceph/bootstrap-osd/ceph.keyring" inside the container.
> >> >
> >> > On Mon, Feb 20, 2023 at 6:35 AM Eugen Block  wrote:
> >> >
> >> >> Hi *,
> >> >>
> >> >> I was playing around on an upgraded test cluster (from N to Q),
> >> >> current version:
> >> >>
> >> >>  "overall": {
> >> >>  "ceph version 17.2.5
> >> >> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 18
> >> >>  }
> >> >>
> >> >> I tried to replace an OSD after destroying it with 'ceph orch osd rm
> >> >> osd.5 --replace'. The OSD was drained successfully and marked as
> >> >> "destroyed" as expected, the zapping

[ceph-users] Re: Missing keyrings on upgraded cluster

2023-02-20 Thread Adam King
For reference, a stray daemon from cephadm POV is roughly just something
that shows up in "ceph node ls" that doesn't have a directory in
/var/lib/ceph/. I guess manually making the OSD as you did means that
didn't end up getting made. I remember the manual osd creation process (by
manual just meaning not using an orchestrator/cephadm mgr module command)
coming up at one point and the we ended up manually running "cephadm
deploy" to make sure those directories get created correctly, but I don't
think any docs ever got made about it (yet, anyway). Also, is there a
tracker issue for it not correctly handling the drivegroup?

On Mon, Feb 20, 2023 at 8:58 AM Eugen Block  wrote:

> Thanks, Adam.
>
> Providing the keyring to the cephadm command worked, but the unwanted
> (but expected) side effect is that from cephadm perspective it's a
> stray daemon. For some reason the orchestrator did apply the desired
> drivegroup when I tried to reproduce this morning, but then again
> failed just now when I wanted to get rid of the stray daemon. This is
> one of the most annoying things with cephadm, I still don't fully
> understand when it will correctly apply the identical drivegroup.yml
> and when not. Anyway, the conclusion is to not interfere with cephadm
> (nothing new here), but since the drivegroup was not applied correctly
> I assumed I had to "help out" a bit by manually deploying an OSD.
>
> Thanks,
> Eugen
>
> Zitat von Adam King :
>
> > Going off of
> >
> > ceph --cluster ceph --name client.bootstrap-osd --keyring
> > /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
> >
> > you could try passing "--keyring  > ceph-volume command. Something like  'cephadm ceph-volume --keyring
> >  -- lvm create'. I'm guessing it's trying to run
> the
> > osd tree command within a container and I know cephadm mounts keyrings
> > passed to the ceph-volume command as
> > "/var/lib/ceph/bootstrap-osd/ceph.keyring" inside the container.
> >
> > On Mon, Feb 20, 2023 at 6:35 AM Eugen Block  wrote:
> >
> >> Hi *,
> >>
> >> I was playing around on an upgraded test cluster (from N to Q),
> >> current version:
> >>
> >>  "overall": {
> >>  "ceph version 17.2.5
> >> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 18
> >>  }
> >>
> >> I tried to replace an OSD after destroying it with 'ceph orch osd rm
> >> osd.5 --replace'. The OSD was drained successfully and marked as
> >> "destroyed" as expected, the zapping also worked. At this point I
> >> didn't have an osd spec in place because all OSDs were adopted during
> >> the upgrade process. So I created a new spec which was not applied
> >> successfully (I'm wondering if there's another/new issue with
> >> ceph-volume, but that's not the focus here), so I tried it manually
> >> with 'cephadm ceph-volume lvm create'. I'll add the output at the end
> >> for a better readability. Apparently, there's no boostrap-osd keyring
> >> for cephadm so it can't search the desired osd_id in the osd tree, the
> >> command it tries is this:
> >>
> >> ceph --cluster ceph --name client.bootstrap-osd --keyring
> >> /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
> >>
> >> In the local filesystem the required keyring is present, though:
> >>
> >> nautilus:~ # cat /var/lib/ceph/bootstrap-osd/ceph.keyring
> >> [client.bootstrap-osd]
> >>  key = AQBOCbpgixIsOBAAgBzShsFg/l1bOze4eTZHug==
> >>  caps mgr = "allow r"
> >>  caps mon = "profile bootstrap-osd"
> >>
> >> Is there something missing during the adoption process? Or are the
> >> docs lacking some upgrade info? I found a section about putting
> >> keyrings under management [1], but I'm not sure if that's what's
> >> missing here.
> >> Any insights are highly appreciated!
> >>
> >> Thanks,
> >> Eugen
> >>
> >> [1]
> >>
> >>
> https://docs.ceph.com/en/quincy/cephadm/operations/#putting-a-keyring-under-management
> >>
> >>
> >> ---snip---
> >> nautilus:~ # cephadm ceph-volume lvm create --osd-id 5 --data /dev/sde
> >> --block.db /dev/sdb --block.db-size 5G
> >> Inferring fsid 
> >> Using recent ceph image
> >> /ceph/ceph@sha256
> >> :af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92
> >> Non-zero exit code 1 from /usr/bin/podman r

[ceph-users] Re: Missing keyrings on upgraded cluster

2023-02-20 Thread Adam King
Going off of

ceph --cluster ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json

you could try passing "--keyring  -- lvm create'. I'm guessing it's trying to run the
osd tree command within a container and I know cephadm mounts keyrings
passed to the ceph-volume command as
"/var/lib/ceph/bootstrap-osd/ceph.keyring" inside the container.

On Mon, Feb 20, 2023 at 6:35 AM Eugen Block  wrote:

> Hi *,
>
> I was playing around on an upgraded test cluster (from N to Q),
> current version:
>
>  "overall": {
>  "ceph version 17.2.5
> (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 18
>  }
>
> I tried to replace an OSD after destroying it with 'ceph orch osd rm
> osd.5 --replace'. The OSD was drained successfully and marked as
> "destroyed" as expected, the zapping also worked. At this point I
> didn't have an osd spec in place because all OSDs were adopted during
> the upgrade process. So I created a new spec which was not applied
> successfully (I'm wondering if there's another/new issue with
> ceph-volume, but that's not the focus here), so I tried it manually
> with 'cephadm ceph-volume lvm create'. I'll add the output at the end
> for a better readability. Apparently, there's no boostrap-osd keyring
> for cephadm so it can't search the desired osd_id in the osd tree, the
> command it tries is this:
>
> ceph --cluster ceph --name client.bootstrap-osd --keyring
> /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
>
> In the local filesystem the required keyring is present, though:
>
> nautilus:~ # cat /var/lib/ceph/bootstrap-osd/ceph.keyring
> [client.bootstrap-osd]
>  key = AQBOCbpgixIsOBAAgBzShsFg/l1bOze4eTZHug==
>  caps mgr = "allow r"
>  caps mon = "profile bootstrap-osd"
>
> Is there something missing during the adoption process? Or are the
> docs lacking some upgrade info? I found a section about putting
> keyrings under management [1], but I'm not sure if that's what's
> missing here.
> Any insights are highly appreciated!
>
> Thanks,
> Eugen
>
> [1]
>
> https://docs.ceph.com/en/quincy/cephadm/operations/#putting-a-keyring-under-management
>
>
> ---snip---
> nautilus:~ # cephadm ceph-volume lvm create --osd-id 5 --data /dev/sde
> --block.db /dev/sdb --block.db-size 5G
> Inferring fsid 
> Using recent ceph image
> /ceph/ceph@sha256
> :af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92
> Non-zero exit code 1 from /usr/bin/podman run --rm --ipc=host
> --stop-signal=SIGTERM --authfile=/etc/ceph/podman-auth.json --net=host
> --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk
> --init -e
> CONTAINER_IMAGE=/ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92
> -e NODE_NAME=nautilus -e CEPH_USE_RANDOM_NONCE=1 -e
> CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v
> /var/run/ceph/:/var/run/ceph:z -v
> /var/log/ceph/:/var/log/ceph:z -v
> /var/lib/ceph//crash:/var/lib/ceph/crash:z -v /dev:/dev -v
> /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v
> /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v
> /tmp/ceph-tmpuydvbhuk:/etc/ceph/ceph.conf:z
> /ceph/ceph@sha256:af50ec26db7ee177e1ec1b553a0d6a9dbad2c3cc0da2f8f46d012184a79d4f92
> lvm create --osd-id 5 --data /dev/sde --block.db /dev/sdb --block.db-size
> 5G
> /usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning
> msg="Path \"/etc/SUSEConnect\" from \"/etc/containers/mounts.conf\"
> doesn't exist, skipping"
> /usr/bin/podman: stderr time="2023-02-20T09:02:49+01:00" level=warning
> msg="Path \"/etc/zypp/credentials.d/SCCcredentials\" from
> \"/etc/containers/mounts.conf\" doesn't exist, skipping"
> /usr/bin/podman: stderr Running command: /usr/bin/ceph-authtool
> --gen-print-key
> /usr/bin/podman: stderr Running command: /usr/bin/ceph --cluster ceph
> --name client.bootstrap-osd --keyring
> /var/lib/ceph/bootstrap-osd/ceph.keyring osd tree -f json
> /usr/bin/podman: stderr  stderr: 2023-02-20T08:02:50.848+
> 7fd255e30700 -1 auth: unable to find a keyring on
> /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin:
> (2) No such file or
> directory
> /usr/bin/podman: stderr  stderr: 2023-02-20T08:02:50.848+
> 7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at
> /etc/ceph/ceph.client.bootstrap-osd.keyring,/etc/ceph/ceph.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,
> disabling
> cephx
> /usr/bin/podman: stderr  stderr: 2023-02-20T08:02:50.852+
> 7fd255e30700 -1 auth: unable to find a keyring on
> /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory
> /usr/bin/podman: stderr  stderr: 2023-02-20T08:02:50.852+
> 7fd255e30700 -1 AuthRegistry(0x7fd250060d50) no keyring found at
> /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx
> /usr/bin/podman: stderr  stderr: 2023-02-20T08:02:50.856+
> 7fd255e30700 -1 auth: unable to find a keyring on
> 

[ceph-users] Re: Any issues with podman 4.2 and Quincy?

2023-02-13 Thread Adam King
That table is definitely a bit out of date. We've been doing some testing
with more recent podman versions and the only issues I'm aware of specific
to the podman version are https://tracker.ceph.com/issues/58532 and
https://tracker.ceph.com/issues/57018 (which are really the same issue
affecting two different things). If you aren't using iscsi or grafana
deployed by cephadm, there's been no problems I know about.

On Mon, Feb 13, 2023 at 1:53 PM Vladimir Brik <
vladimir.b...@icecube.wisc.edu> wrote:

> Has anybody run into issues with Quincy and podman 4.2?
>
> 4x podman series are not mentioned in
> https://docs.ceph.com/en/quincy/cephadm/compatibility/ but
> podman 3x is no longer available in Alma Linux
>
>
>
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CEPHADM_STRAY_DAEMON does not exist, how do I remove knowledge of it from ceph?

2023-02-01 Thread Adam King
I know there's a bug where when downsizing by multiple mons at once through
cephadm this ghost stray mon daemon thing can end up happening (I think
something about cephadm removing them too quickly in succession, not
totally sure). In those cases, just doing a mgr failover ("ceph mgr fail")
always cleared the warnings after a couple minutes. That might be worth a
try if you haven't done so already and you have at least two mgr daemons in
the cluster.

On Wed, Feb 1, 2023 at 3:56 PM  wrote:

> Hi All,
>
> I'm getting this error while setting up a ceph cluster. I'm relatively new
> to ceph, so there is no telling what kind of mistakes I've been making. I'm
> using cephadm, ceph v16 and I apparently have a stray daemon. But it also
> doesn't seem to exist and I can't get ceph to forget about it.
>
> $ ceph health detail
> [WRN] CEPHADM_STRAY_DAEMON: 1 stray daemon(s) not managed by cephadm
> stray daemon mon.cmon01 on host cmgmt01 not managed by cephadm
>
> mon.cmon01 also shows up in dashboard->hosts as running on cmgmt01. It
> does not show up in the monitors section though.
>
> But, there isn't a monitor daemon running on that machine at all (no
> podman container, not in process list, not listening on a port).
>
> On that host in cephadm shell,
> # ceph orch daemon rm mon.cmon01 --force
> Error EINVAL: Unable to find daemon(s) ['mon.cmon01']
>
> I don't currently have any real data on the cluster, so I've also tried
> deleting the existing pools (except device_health_metrics) in case ceph was
> connecting that monitor to one of the pools.
>
> I'm not sure what to try next in order to get ceph to forget about that
> daemon.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.11 pacific QE validation status

2023-01-20 Thread Adam King
cephadm approved. Known failures.

On Fri, Jan 20, 2023 at 11:39 AM Yuri Weinstein  wrote:

> The overall progress on this release is looking much better and if we
> can approve it we can plan to publish it early next week.
>
> Still seeking approvals
>
> rados - Neha, Laura
> rook - Sébastien Han
> cephadm - Adam
> dashboard - Ernesto
> rgw - Casey
> rbd - Ilya (full rbd run in progress now)
> krbd - Ilya
> fs - Venky, Patrick
> upgrade/nautilus-x (pacific) - passed thx Adam Kraitman!
> upgrade/octopus-x (pacific) - almost passed, still running 1 job
> upgrade/pacific-p2p - Neha (same as in 16.2.8)
> powercycle - Brad (see new SELinux denials)
>
> On Tue, Jan 17, 2023 at 10:45 AM Yuri Weinstein 
> wrote:
> >
> > OK I will rerun failed jobs filtering rhel in
> >
> > Thx!
> >
> > On Tue, Jan 17, 2023 at 10:43 AM Adam Kraitman 
> wrote:
> > >
> > > Hey the satellite issue was fixed
> > >
> > > Thanks
> > >
> > > On Tue, Jan 17, 2023 at 7:43 PM Laura Flores 
> wrote:
> > >>
> > >> This was my summary of rados failures. There was nothing new or amiss,
> > >> although it is important to note that runs were done with filtering
> out
> > >> rhel 8.
> > >>
> > >> I will leave it to Neha for final approval.
> > >>
> > >> Failures:
> > >> 1. https://tracker.ceph.com/issues/58258
> > >> 2. https://tracker.ceph.com/issues/58146
> > >> 3. https://tracker.ceph.com/issues/58458
> > >> 4. https://tracker.ceph.com/issues/57303
> > >> 5. https://tracker.ceph.com/issues/54071
> > >>
> > >> Details:
> > >> 1. rook: kubelet fails from connection refused - Ceph -
> Orchestrator
> > >> 2. test_cephadm.sh: Error: Error initializing source docker://
> > >> quay.ceph.io/ceph-ci/ceph:master - Ceph - Orchestrator
> > >> 3. qa/workunits/post-file.sh: postf...@drop.ceph.com: Permission
> denied
> > >> - Ceph
> > >> 4. rados/cephadm: Failed to fetch package version from
> > >>
> https://shaman.ceph.com/api/search/?status=ready=ceph=default=ubuntu%2F22.04%2Fx86_64=b34ca7d1c2becd6090874ccda56ef4cd8dc64bf7
> > >> - Ceph - Orchestrator
> > >> 5. rados/cephadm/osds: Invalid command: missing required parameter
> > >> hostname() - Ceph - Orchestrator
> > >>
> > >> On Tue, Jan 17, 2023 at 9:48 AM Yuri Weinstein 
> wrote:
> > >>
> > >> > Please see the test results on the rebased RC 6.6 in this comment:
> > >> >
> > >> > https://tracker.ceph.com/issues/58257#note-2
> > >> >
> > >> > We're still having infrastructure issues making testing difficult.
> > >> > Therefore all reruns were done excluding the rhel 8 distro
> > >> > ('--filter-out rhel_8')
> > >> >
> > >> > Also, the upgrades failed and Adam is looking into this.
> > >> >
> > >> > Seeking new approvals
> > >> >
> > >> > rados - Neha, Laura
> > >> > rook - Sébastien Han
> > >> > cephadm - Adam
> > >> > dashboard - Ernesto
> > >> > rgw - Casey
> > >> > rbd - Ilya
> > >> > krbd - Ilya
> > >> > fs - Venky, Patrick
> > >> > upgrade/nautilus-x (pacific) - Adam Kraitman
> > >> > upgrade/octopus-x (pacific) - Adam Kraitman
> > >> > upgrade/pacific-p2p - Neha - Adam Kraitman
> > >> > powercycle - Brad
> > >> >
> > >> > Thx
> > >> >
> > >> > On Fri, Jan 6, 2023 at 8:37 AM Yuri Weinstein 
> wrote:
> > >> > >
> > >> > > Happy New Year all!
> > >> > >
> > >> > > This release remains to be in "progress"/"on hold" status as we
> are
> > >> > > sorting all infrastructure-related issues.
> > >> > >
> > >> > > Unless I hear objections, I suggest doing a full rebase/retest QE
> > >> > > cycle (adding PRs merged lately) since it's taking much longer
> than
> > >> > > anticipated when sepia is back online.
> > >> > >
> > >> > > Objections?
> > >> > >
> > >> > > Thx
> > >> > > YuriW
> > >> > >
> > >> > > On Thu, Dec 15, 2022 at 9:14 AM Yuri Weinstein <
> ywein...@redhat.com>
> > >> > wrote:
> > >> > > >
> > >> > > > Details of this release are summarized here:
> > >> > > >
> > >> > > > https://tracker.ceph.com/issues/58257#note-1
> > >> > > > Release Notes - TBD
> > >> > > >
> > >> > > > Seeking approvals for:
> > >> > > >
> > >> > > > rados - Neha (https://github.com/ceph/ceph/pull/49431 is still
> being
> > >> > > > tested and will be merged soon)
> > >> > > > rook - Sébastien Han
> > >> > > > cephadm - Adam
> > >> > > > dashboard - Ernesto
> > >> > > > rgw - Casey (rwg will be rerun on the latest SHA1)
> > >> > > > rbd - Ilya, Deepika
> > >> > > > krbd - Ilya, Deepika
> > >> > > > fs - Venky, Patrick
> > >> > > > upgrade/nautilus-x (pacific) - Neha, Laura
> > >> > > > upgrade/octopus-x (pacific) - Neha, Laura
> > >> > > > upgrade/pacific-p2p - Neha - Neha, Laura
> > >> > > > powercycle - Brad
> > >> > > > ceph-volume - Guillaume, Adam K
> > >> > > >
> > >> > > > Thx
> > >> > > > YuriW
> > >> > ___
> > >> > Dev mailing list -- d...@ceph.io
> > >> > To unsubscribe send an email to dev-le...@ceph.io
> > >> >
> > >>
> > >>
> > >> --
> > >>
> > >> Laura Flores
> > >>
> > >> She/Her/Hers
> > >>
> > >> Software Engineer, Ceph 

[ceph-users] Re: 16.2.11 pacific QE validation status

2022-12-19 Thread Adam King
cephadm approved. rados/cephadm failures are mostly caused by
https://github.com/ceph/ceph/pull/49285 not being merged (which just
touches tests and docs so wouldn't block a release).

Thanks
 - Adam King

On Thu, Dec 15, 2022 at 12:15 PM Yuri Weinstein  wrote:

> Details of this release are summarized here:
>
> https://tracker.ceph.com/issues/58257#note-1
> Release Notes - TBD
>
> Seeking approvals for:
>
> rados - Neha (https://github.com/ceph/ceph/pull/49431 is still being
> tested and will be merged soon)
> rook - Sébastien Han
> cephadm - Adam
> dashboard - Ernesto
> rgw - Casey (rwg will be rerun on the latest SHA1)
> rbd - Ilya, Deepika
> krbd - Ilya, Deepika
> fs - Venky, Patrick
> upgrade/nautilus-x (pacific) - Neha, Laura
> upgrade/octopus-x (pacific) - Neha, Laura
> upgrade/pacific-p2p - Neha - Neha, Laura
> powercycle - Brad
> ceph-volume - Guillaume, Adam K
>
> Thx
> YuriW
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


  1   2   >