from:"Jeremy Hansen"

[ceph-users] Re: Snapshot automation/scheduling for rbd?

2024-02-05 Thread Jeremy Hansen

Thanks. I think the only issue with doing snapshots via Cloudstack is 
potentially having to pause an instance for an extended period of time. I 
haven’t tested this yet but based on the docs, I think kvm has to be paused 
regardless.

What about added volumes? Does an instance have to pause of you’re only 
snapshotting added volumes and not the root disk?

Couple of questions. If I snapshot an rbd image from the ceph side, does that 
require an instance pause and is there a graceful way, perhaps through the api 
to do the full mapping of instance volumes -> Ceph block image named? So I can 
understand what block images belong to which Cloudstack instance. I never 
understood how to properly trace a volume from instance to Ceph image.

Thanks!

> On Saturday, Feb 03, 2024 at 10:47 AM, Jayanth Reddy 
> mailto:jayanthreddy5...@gmail.com)> wrote:
> Hi,
> For CloudStack with RBD, you should be able to control the snapshot placement 
> using the global setting "snapshot.backup.to.secondary". Setting this to 
> false makes snapshots be placed directly on Ceph instead of secondary 
> storage. See if you can perform recurring snapshots. I know that there are 
> limitations with KVM and disk snapshots but good to give it a try.
>
> Thanks
>
>
> Get Outlook for Android (https://aka.ms/AAb9ysg)
> From: Jeremy Hansen 
> Sent: Saturday, February 3, 2024 11:39:19 PM
> To: ceph-users@ceph.io 
> Subject: [ceph-users] Re: Snapshot automation/scheduling for rbd?
>
>
> Am I just off base here or missing something obvious?
>
> Thanks
>
>
>
> > On Thursday, Feb 01, 2024 at 2:13 AM, Jeremy Hansen  > (mailto:jer...@skidrow.la)> wrote:
> > Can rbd image snapshotting be scheduled like CephFS snapshots? Maybe I 
> > missed it in the documentation but it looked like scheduling snapshots 
> > wasn’t a feature for block images. I’m still running Pacific. We’re trying 
> > to devise a sufficient backup plan for Cloudstack and other things residing 
> > in Ceph.
> >
> > Thanks.
> > -jeremy
> >
> >
> >

signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Snapshot automation/scheduling for rbd?

2024-02-05 Thread Jeremy Hansen

Can you share your script? Thanks!

> On Saturday, Feb 03, 2024 at 10:35 AM, Marc  (mailto:m...@f1-outsourcing.eu)> wrote:
> I am having a script that checks on each node what vm's are active and then 
> the script makes a snap shot of their rbd's. It first issues some command to 
> the vm to freeze the fs if the vm supports it.
>
>
> >
> > Am I just off base here or missing something obvious?
> >
> > Thanks
> >
> >
> >
> >
> > On Thursday, Feb 01, 2024 at 2:13 AM, Jeremy Hansen  > <mailto:jer...@skidrow.la> > wrote:
> >
> > Can rbd image snapshotting be scheduled like CephFS snapshots? Maybe
> > I missed it in the documentation but it looked like scheduling snapshots
> > wasn’t a feature for block images. I’m still running Pacific. We’re trying
> > to devise a sufficient backup plan for Cloudstack and other things residing
> > in Ceph.
> >
> > Thanks.
> > -jeremy
> >
> >
> >
>


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Snapshot automation/scheduling for rbd?

2024-02-03 Thread Jeremy Hansen

Am I just off base here or missing something obvious?

Thanks

> On Thursday, Feb 01, 2024 at 2:13 AM, Jeremy Hansen  (mailto:jer...@skidrow.la)> wrote:
> Can rbd image snapshotting be scheduled like CephFS snapshots? Maybe I missed 
> it in the documentation but it looked like scheduling snapshots wasn’t a 
> feature for block images. I’m still running Pacific. We’re trying to devise a 
> sufficient backup plan for Cloudstack and other things residing in Ceph.
>
> Thanks.
> -jeremy
>
>
>


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Snapshot automation/scheduling for rbd?

2024-02-01 Thread Jeremy Hansen

Can rbd image snapshotting be scheduled like CephFS snapshots? Maybe I missed 
it in the documentation but it looked like scheduling snapshots wasn’t a 
feature for block images. I’m still running Pacific. We’re trying to devise a 
sufficient backup plan for Cloudstack and other things residing in Ceph.

Thanks.
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Upgrading from 16.2.11?

2024-01-04 Thread Jeremy Hansen

I’d like to upgrade from 16.2.11 to the latest version. Is it possible to do 
this in one jump or do I need to go from 16.2.11 -> 16.2.14 -> 17.1.0 -> 17.2.7 
-> 18.1.0 -> 18.2.1? I’m using cephadm.

Thanks
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph as rootfs?

2024-01-03 Thread Jeremy Hansen

Is it possible to use Ceph as a root filesystem for a pxe booted host?

Thanks

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stray host/daemon

2023-12-01 Thread Jeremy Hansen

Found my previous post regarding this issue.

Fixed by restarting mgr daemons.

-jeremy

> On Friday, Dec 01, 2023 at 3:04 AM, Me  (mailto:jer...@skidrow.la)> wrote:
> I think I ran in to this before but I forget the fix:
>
> HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
> [WRN] CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by 
> cephadm
> stray host cn06.ceph.fu.intra has 1 stray daemons: ['mon.cn03']
>
>
> Pacific 16.2.11
>
> How do I clear this?
>
> Thanks
> -jeremy
>
>
>


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Stray host/daemon

2023-12-01 Thread Jeremy Hansen

I think I ran in to this before but I forget the fix:

HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
[WRN] CEPHADM_STRAY_HOST: 1 stray host(s) with 1 daemon(s) not managed by 
cephadm
stray host cn06.ceph.fu.intra has 1 stray daemons: ['mon.cn03']

Pacific 16.2.11

How do I clear this?

Thanks
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removed host still active, sort of?

2023-06-11 Thread Jeremy Hansen

Got around this issue by restarting the mgr daemons.

-jeremy

> On Saturday, Jun 10, 2023 at 11:26 PM, Me  (mailto:jer...@skidrow.la)> wrote:
> I see this in the web interface in Hosts and under cn03’s devices tab
>
> SAMSUNG_HD502HI_S1VFJ9ASB08190
> Unknown
> n/a
> sdg
> mon.cn04
>
>
> 1 total
>
>
>
>
> Which doesn’t make sense. There is no daemons running on this host and I 
> noticed the daemon lists looks like its one that should be on another node. 
> There is already a mon.cn04 running on the cn04 node.
>
> -jeremy
>
>
>
> > On Saturday, Jun 10, 2023 at 11:10 PM, Me  > (mailto:jer...@skidrow.la)> wrote:
> > I also see this error in the logs:
> >
> > 6/10/23 11:09:01 PM[ERR]host cn03.ceph does not exist Traceback (most 
> > recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", 
> > line 125, in wrapper return OrchResult(f(*args, **kwargs)) File 
> > "/usr/share/ceph/mgr/cephadm/module.py", line 1625, in remove_host 
> > self.inventory.rm_host(host) File 
> > "/usr/share/ceph/mgr/cephadm/inventory.py", line 108, in rm_host 
> > self.assert_host(host) File "/usr/share/ceph/mgr/cephadm/inventory.py", 
> > line 93, in assert_host raise OrchestratorError('host %s does not exist' % 
> > host) orchestrator._interface.OrchestratorError: host cn03.ceph does not 
> > exist
> >
> >
> >
> > > On Saturday, Jun 10, 2023 at 10:41 PM, Me  > > (mailto:jer...@skidrow.la)> wrote:
> > > I’m going through the process of transitioning to new hardware. Pacific 
> > > 16.2.11.
> > >
> > > I drained the host, all daemons were removed. Did the ceph orch host rm 
> > > 
> > >
> > > [ceph: root@cn01 /]# ceph orch host rm cn03.ceph
> > > Error EINVAL: host cn03.ceph does not exist
> > >
> > >
> > > Yet I see it here:
> > >
> > > ceph osd crush tree |grep cn03
> > > -10 0 host cn03
> > >
> > >
> > > Web interface says:
> > >
> > > ceph health
> > > HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
> > >
> > >
> > > ceph orch host rm cn03.ceph --force
> > > Error EINVAL: host cn03.ceph does not exist
> > >
> > >
> > > No daemons are running. Something has a bad state.
> > >
> > >
> > > What can I do to clear this up? The previous host went without a problem 
> > > and when all services were drained and I did the remove, it just 
> > > completely disappeared as expected.
> > >
> > > Thanks
> > > -jeremy
> > >
> > >
> > >


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removed host still active, sort of?

2023-06-10 Thread Jeremy Hansen

I see this in the web interface in Hosts and under cn03’s devices tab

SAMSUNG_HD502HI_S1VFJ9ASB08190
Unknown
n/a
sdg
mon.cn04

1 total

Which doesn’t make sense. There is no daemons running on this host and I 
noticed the daemon lists looks like its one that should be on another node. 
There is already a mon.cn04 running on the cn04 node.

-jeremy

> On Saturday, Jun 10, 2023 at 11:10 PM, Me  (mailto:jer...@skidrow.la)> wrote:
> I also see this error in the logs:
>
> 6/10/23 11:09:01 PM[ERR]host cn03.ceph does not exist Traceback (most recent 
> call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, 
> in wrapper return OrchResult(f(*args, **kwargs)) File 
> "/usr/share/ceph/mgr/cephadm/module.py", line 1625, in remove_host 
> self.inventory.rm_host(host) File "/usr/share/ceph/mgr/cephadm/inventory.py", 
> line 108, in rm_host self.assert_host(host) File 
> "/usr/share/ceph/mgr/cephadm/inventory.py", line 93, in assert_host raise 
> OrchestratorError('host %s does not exist' % host) 
> orchestrator._interface.OrchestratorError: host cn03.ceph does not exist
>
>
>
> > On Saturday, Jun 10, 2023 at 10:41 PM, Me  > (mailto:jer...@skidrow.la)> wrote:
> > I’m going through the process of transitioning to new hardware. Pacific 
> > 16.2.11.
> >
> > I drained the host, all daemons were removed. Did the ceph orch host rm 
> > 
> >
> > [ceph: root@cn01 /]# ceph orch host rm cn03.ceph
> > Error EINVAL: host cn03.ceph does not exist
> >
> >
> > Yet I see it here:
> >
> > ceph osd crush tree |grep cn03
> > -10 0 host cn03
> >
> >
> > Web interface says:
> >
> > ceph health
> > HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
> >
> >
> > ceph orch host rm cn03.ceph --force
> > Error EINVAL: host cn03.ceph does not exist
> >
> >
> > No daemons are running. Something has a bad state.
> >
> >
> > What can I do to clear this up? The previous host went without a problem 
> > and when all services were drained and I did the remove, it just completely 
> > disappeared as expected.
> >
> > Thanks
> > -jeremy
> >
> >
> >


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Removed host still active, sort of?

2023-06-10 Thread Jeremy Hansen

I also see this error in the logs:

6/10/23 11:09:01 PM[ERR]host cn03.ceph does not exist Traceback (most recent 
call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in 
wrapper return OrchResult(f(*args, **kwargs)) File 
"/usr/share/ceph/mgr/cephadm/module.py", line 1625, in remove_host 
self.inventory.rm_host(host) File "/usr/share/ceph/mgr/cephadm/inventory.py", 
line 108, in rm_host self.assert_host(host) File 
"/usr/share/ceph/mgr/cephadm/inventory.py", line 93, in assert_host raise 
OrchestratorError('host %s does not exist' % host) 
orchestrator._interface.OrchestratorError: host cn03.ceph does not exist

> On Saturday, Jun 10, 2023 at 10:41 PM, Me  (mailto:jer...@skidrow.la)> wrote:
> I’m going through the process of transitioning to new hardware. Pacific 
> 16.2.11.
>
> I drained the host, all daemons were removed. Did the ceph orch host rm 
> 
>
> [ceph: root@cn01 /]# ceph orch host rm cn03.ceph
> Error EINVAL: host cn03.ceph does not exist
>
>
> Yet I see it here:
>
> ceph osd crush tree |grep cn03
> -10 0 host cn03
>
>
> Web interface says:
>
> ceph health
> HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm
>
>
> ceph orch host rm cn03.ceph --force
> Error EINVAL: host cn03.ceph does not exist
>
>
> No daemons are running. Something has a bad state.
>
>
> What can I do to clear this up? The previous host went without a problem and 
> when all services were drained and I did the remove, it just completely 
> disappeared as expected.
>
> Thanks
> -jeremy
>
>
>


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Removed host still active, sort of?

2023-06-10 Thread Jeremy Hansen

I’m going through the process of transitioning to new hardware. Pacific 16.2.11.

I drained the host, all daemons were removed. Did the ceph orch host rm 


[ceph: root@cn01 /]# ceph orch host rm cn03.ceph
Error EINVAL: host cn03.ceph does not exist

Yet I see it here:

ceph osd crush tree |grep cn03
-10 0 host cn03

Web interface says:

ceph health
HEALTH_WARN 1 stray host(s) with 1 daemon(s) not managed by cephadm

ceph orch host rm cn03.ceph --force
Error EINVAL: host cn03.ceph does not exist

No daemons are running. Something has a bad state.

What can I do to clear this up? The previous host went without a problem and 
when all services were drained and I did the remove, it just completely 
disappeared as expected.

Thanks
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph drain not removing daemons

2023-06-09 Thread Jeremy Hansen

Figured out how to cleanly relocate daemons via the interface. All is good.

-jeremy

> On Friday, Jun 09, 2023 at 2:04 PM, Me  (mailto:jer...@skidrow.la)> wrote:
> I’m doing a drain on a host using cephadm, Pacific, 16.2.11.
>
> ceph orch host drain
>
> removed all the OSDs, but these daemons remain:
>
> grafana.cn06 cn06.ceph.la1 *:3000 stopped 5m ago 18M - -   
> 
> mds.btc.cn06.euxhdu cn06.ceph.la1 running (2d) 5m ago 17M 29.4M - 16.2.11 
> de4b0b384ad4 017f7ef441ff
> mgr.cn06.rpkpwg cn06.ceph.la1 *:8443,9283 running (2d) 5m ago 10M 223M - 
> 16.2.11 de4b0b384ad4 f1b89b453ef3
>
>
> I manually stopped grafana.
>
> I expected these daemons to be removed as well. Is there an extra step I need 
> to do here so I can remove the host cleanly?
>
> Thanks!
> -jeremy
>
>
>


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph drain not removing daemons

2023-06-09 Thread Jeremy Hansen

I’m doing a drain on a host using cephadm, Pacific, 16.2.11.

ceph orch host drain

removed all the OSDs, but these daemons remain:

grafana.cn06 cn06.ceph.la1 *:3000 stopped 5m ago 18M - -   

mds.btc.cn06.euxhdu cn06.ceph.la1 running (2d) 5m ago 17M 29.4M - 16.2.11 
de4b0b384ad4 017f7ef441ff
mgr.cn06.rpkpwg cn06.ceph.la1 *:8443,9283 running (2d) 5m ago 10M 223M - 
16.2.11 de4b0b384ad4 f1b89b453ef3

I manually stopped grafana.

I expected these daemons to be removed as well. Is there an extra step I need 
to do here so I can remove the host cleanly?

Thanks!
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] unable to calc client keyring client.admin placement PlacementSpec(label='_admin'): Cannot place : No matching hosts for label _admin

2023-03-03 Thread Jeremy Hansen

3/3/23 2:13:53 AM[WRN]unable to calc client keyring client.admin placement 
PlacementSpec(label='_admin'): Cannot place : No matching hosts for label _admin

I keep seeing this warning in the logs. I’m not really sure what action to take 
to resolve this issue.

Thanks
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade not doing anything...

2023-02-27 Thread Jeremy Hansen

I’m not exactly sure what I did, but it’s going through now. I did a

ceph orch upgrade check --ceph-version 16.2.7

my current version….

and I did a pause and resume. Now daemons are upgrading to 16.2.11.

-jeremy

> On Monday, Feb 27, 2023 at 11:07 PM, Me  (mailto:jer...@skidrow.la)> wrote:
> [ceph: root@cn01 /]# ceph -W cephadm,
> cluster:
> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> health: HEALTH_OK
>
> services:
> mon: 5 daemons, quorum cn05,cn02,cn03,cn04,cn01 (age 111m)
> mgr: cn06.rpkpwg(active, since 7h), standbys: cn02.arszct, cn03.elmwhu
> mds: 2/2 daemons up, 2 standby
> osd: 35 osds: 35 up (since 111m), 35 in (since 5h)
>
> data:
> volumes: 2/2 healthy
> pools: 8 pools, 545 pgs
> objects: 8.13M objects, 7.7 TiB
> usage: 31 TiB used, 95 TiB / 126 TiB avail
> pgs: 545 active+clean
>
> io:
> client: 4.1 MiB/s rd, 885 KiB/s wr, 128 op/s rd, 14 op/s wr
>
> progress:
> Upgrade to quay.io/ceph/ceph:v16.2.11 (0s)
> []
>
> Cluster is healthy.
>
> Is there an easy way to see if anything was upgraded through the orchestrator?
>
> -jeremy
>
>
>
> > On Monday, Feb 27, 2023 at 10:58 PM, Curt  > (mailto:light...@gmail.com)> wrote:
> > Did any of your cluster get partial upgrade? What about ceph -W cephadm, 
> > does that return anything or just hang, also what about ceph health detail? 
> > You can always try ceph orch upgrade pause and then orch upgrade resume, 
> > might kick something loose, so to speak.
> > On Tue, Feb 28, 2023, 10:39 Jeremy Hansen  > (mailto:jer...@skidrow.la)> wrote:
> > > {
> > > "target_image": "quay.io/ceph/ceph:v16.2.11 
> > > (http://quay.io/ceph/ceph:v16.2.11)",
> > > "in_progress": true,
> > > "services_complete": [],
> > > "progress": "",
> > > "message": ""
> > > }
> > >
> > > Hasn’t changed in the past two hours.
> > >
> > > -jeremy
> > >
> > >
> > >
> > > > On Monday, Feb 27, 2023 at 10:22 PM, Curt  > > > (mailto:light...@gmail.com)> wrote:
> > > > What does Ceph orch upgrade status return?
> > > > On Tue, Feb 28, 2023, 10:16 Jeremy Hansen  > > > (mailto:jer...@skidrow.la)> wrote:
> > > > > I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the 
> > > > > documentation, I cut and paste the orchestrator command to begin the 
> > > > > upgrade, but I mistakenly pasted directly from the docs and it 
> > > > > initiated an “upgrade” to 16.2.6. I stopped the upgrade per the docs 
> > > > > and reissued the command specifying 16.2.11 but now I see no progress 
> > > > > in ceph -s. Cluster is healthy but it feels like the upgrade process 
> > > > > is just paused for some reason.
> > > > >
> > > > > Thanks!
> > > > > -jeremy
> > > > >
> > > > >
> > > > >
> > > > > ___
> > > > > ceph-users mailing list -- ceph-users@ceph.io 
> > > > > (mailto:ceph-users@ceph.io)
> > > > > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > > > > (mailto:ceph-users-le...@ceph.io)


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade not doing anything...

2023-02-27 Thread Jeremy Hansen

[ceph: root@cn01 /]# ceph -W cephadm,
cluster:
id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
health: HEALTH_OK

services:
mon: 5 daemons, quorum cn05,cn02,cn03,cn04,cn01 (age 111m)
mgr: cn06.rpkpwg(active, since 7h), standbys: cn02.arszct, cn03.elmwhu
mds: 2/2 daemons up, 2 standby
osd: 35 osds: 35 up (since 111m), 35 in (since 5h)

data:
volumes: 2/2 healthy
pools: 8 pools, 545 pgs
objects: 8.13M objects, 7.7 TiB
usage: 31 TiB used, 95 TiB / 126 TiB avail
pgs: 545 active+clean

io:
client: 4.1 MiB/s rd, 885 KiB/s wr, 128 op/s rd, 14 op/s wr

progress:
Upgrade to quay.io/ceph/ceph:v16.2.11 (0s)
[]

Cluster is healthy.

Is there an easy way to see if anything was upgraded through the orchestrator?

-jeremy

> On Monday, Feb 27, 2023 at 10:58 PM, Curt  (mailto:light...@gmail.com)> wrote:
> Did any of your cluster get partial upgrade? What about ceph -W cephadm, does 
> that return anything or just hang, also what about ceph health detail? You 
> can always try ceph orch upgrade pause and then orch upgrade resume, might 
> kick something loose, so to speak.
> On Tue, Feb 28, 2023, 10:39 Jeremy Hansen  (mailto:jer...@skidrow.la)> wrote:
> > {
> > "target_image": "quay.io/ceph/ceph:v16.2.11 
> > (http://quay.io/ceph/ceph:v16.2.11)",
> > "in_progress": true,
> > "services_complete": [],
> > "progress": "",
> > "message": ""
> > }
> >
> > Hasn’t changed in the past two hours.
> >
> > -jeremy
> >
> >
> >
> > > On Monday, Feb 27, 2023 at 10:22 PM, Curt  > > (mailto:light...@gmail.com)> wrote:
> > > What does Ceph orch upgrade status return?
> > > On Tue, Feb 28, 2023, 10:16 Jeremy Hansen  > > (mailto:jer...@skidrow.la)> wrote:
> > > > I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the 
> > > > documentation, I cut and paste the orchestrator command to begin the 
> > > > upgrade, but I mistakenly pasted directly from the docs and it 
> > > > initiated an “upgrade” to 16.2.6. I stopped the upgrade per the docs 
> > > > and reissued the command specifying 16.2.11 but now I see no progress 
> > > > in ceph -s. Cluster is healthy but it feels like the upgrade process is 
> > > > just paused for some reason.
> > > >
> > > > Thanks!
> > > > -jeremy
> > > >
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io 
> > > > (mailto:ceph-users@ceph.io)
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > > > (mailto:ceph-users-le...@ceph.io)


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade not doing anything...

2023-02-27 Thread Jeremy Hansen

{
"target_image": "quay.io/ceph/ceph:v16.2.11",
"in_progress": true,
"services_complete": [],
"progress": "",
"message": ""
}

Hasn’t changed in the past two hours.

-jeremy

> On Monday, Feb 27, 2023 at 10:22 PM, Curt  (mailto:light...@gmail.com)> wrote:
> What does Ceph orch upgrade status return?
> On Tue, Feb 28, 2023, 10:16 Jeremy Hansen  (mailto:jer...@skidrow.la)> wrote:
> > I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the documentation, I 
> > cut and paste the orchestrator command to begin the upgrade, but I 
> > mistakenly pasted directly from the docs and it initiated an “upgrade” to 
> > 16.2.6. I stopped the upgrade per the docs and reissued the command 
> > specifying 16.2.11 but now I see no progress in ceph -s. Cluster is healthy 
> > but it feels like the upgrade process is just paused for some reason.
> >
> > Thanks!
> > -jeremy
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io (mailto:ceph-users@ceph.io)
> > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > (mailto:ceph-users-le...@ceph.io)


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Upgrade not doing anything...

2023-02-27 Thread Jeremy Hansen

I’m trying to upgrade from 16.2.7 to 16.2.11. Reading the documentation, I cut 
and paste the orchestrator command to begin the upgrade, but I mistakenly 
pasted directly from the docs and it initiated an “upgrade” to 16.2.6. I 
stopped the upgrade per the docs and reissued the command specifying 16.2.11 
but now I see no progress in ceph -s. Cluster is healthy but it feels like the 
upgrade process is just paused for some reason.

Thanks!
-jeremy



signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 1 stray daemon(s) not managed by cephadm

2022-07-25 Thread Jeremy Hansen

How do I track down what is the stray daemon?

Thanks
-jeremy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Two osd's assigned to one device

2022-07-25 Thread Jeremy Hansen

I have a situation (not sure how it happened), but Ceph believe I have two
OSD's assigned to a single device.

I tried to delete osd.2 and osd.3, but it just hangs.  I'm also trying to
zap sdc, which claims it does not have an osd, but I'm unable to zap it.
Any suggestions?


/dev/sdb
HDD
TOSHIBA
MG04SCA40EE
3.6 TiB
osd.2 osd.3
/dev/sdc
SSD
SAMSUNG
MZILT3T8HBLS/007
3.5 TiB
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

I use Ubiquiti equipment, mainly because I'm not a network admin...  I
rebooted the 10G switches and now everything is working and recovering.  I
hate when there's not a definitive answer but that's kind of the deal when
you use Ubiquiti stuff.  Thank you Sean and Frank.  Frank, you were right.
It made no sense because from a very basic point of view the network seemed
fine, but Sean's ping revealed that it clearly wasn't.

Thank you!
-jeremy


On Mon, Jul 25, 2022 at 3:08 PM Sean Redmond 
wrote:

> Yea, assuming you can ping with a lower MTU, check the MTU on your
> switching.
>
> On Mon, 25 Jul 2022, 23:05 Jeremy Hansen, 
> wrote:
>
>> That results in packet loss:
>>
>> [root@cn01 ~]# ping -M do -s 8972 192.168.30.14
>> PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data.
>> ^C
>> --- 192.168.30.14 ping statistics ---
>> 3 packets transmitted, 0 received, 100% packet loss, time 2062ms
>>
>> That's very weird...  but this gives me something to figure out.  Hmmm.
>> Thank you.
>>
>> On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond 
>> wrote:
>>
>>> Looks good, just confirm it with a large ping with don't fragment flag
>>> set between each host.
>>>
>>> ping -M do -s 8972 [destination IP]
>>>
>>>
>>> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, 
>>> wrote:
>>>
>>>> MTU is the same across all hosts:
>>>>
>>>> - cn01.ceph.la1.clx.corp-
>>>> enp2s0: flags=4163  mtu 9000
>>>> inet 192.168.30.11  netmask 255.255.255.0  broadcast
>>>> 192.168.30.255
>>>> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid
>>>> 0x20
>>>> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
>>>> RX packets 3163785  bytes 213625 (1.9 GiB)
>>>> RX errors 0  dropped 0  overruns 0  frame 0
>>>> TX packets 6890933  bytes 40233267272 (37.4 GiB)
>>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> - cn02.ceph.la1.clx.corp-
>>>> enp2s0: flags=4163  mtu 9000
>>>> inet 192.168.30.12  netmask 255.255.255.0  broadcast
>>>> 192.168.30.255
>>>> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid
>>>> 0x20
>>>> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
>>>> RX packets 3976256  bytes 2761764486 (2.5 GiB)
>>>> RX errors 0  dropped 0  overruns 0  frame 0
>>>> TX packets 9270324  bytes 56984933585 (53.0 GiB)
>>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> - cn03.ceph.la1.clx.corp-
>>>> enp2s0: flags=4163  mtu 9000
>>>> inet 192.168.30.13  netmask 255.255.255.0  broadcast
>>>> 192.168.30.255
>>>> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid
>>>> 0x20
>>>> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
>>>> RX packets 13081847  bytes 93614795356 (87.1 GiB)
>>>> RX errors 0  dropped 0  overruns 0  frame 0
>>>> TX packets 4001854  bytes 2536322435 (2.3 GiB)
>>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> - cn04.ceph.la1.clx.corp-
>>>> enp2s0: flags=4163  mtu 9000
>>>> inet 192.168.30.14  netmask 255.255.255.0  broadcast
>>>> 192.168.30.255
>>>> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid
>>>> 0x20
>>>> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
>>>> RX packets 60018  bytes 5622542 (5.3 MiB)
>>>> RX errors 0  dropped 0  overruns 0  frame 0
>>>> TX packets 59889  bytes 17463794 (16.6 MiB)
>>>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>>>
>>>> - cn05.ceph.la1.clx.corp-
>>>> enp2s0: flags=4163  mtu 9000
>>>> inet 192.168.30.15  netmask 255.255.255.0  broadcast
>>>> 192.168.30.255
>>>> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid
>>>> 0x20
>>>> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
>>>> RX packets 69163  bytes 8085511 (7.7 MiB)
>>>>     RX errors 0  dropped 0  overruns 0  frame 0
>>>> TX packets 73539  bytes 17069869 (16.2

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

That results in packet loss:

[root@cn01 ~]# ping -M do -s 8972 192.168.30.14
PING 192.168.30.14 (192.168.30.14) 8972(9000) bytes of data.
^C
--- 192.168.30.14 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2062ms

That's very weird...  but this gives me something to figure out.  Hmmm.
Thank you.

On Mon, Jul 25, 2022 at 3:01 PM Sean Redmond 
wrote:

> Looks good, just confirm it with a large ping with don't fragment flag set
> between each host.
>
> ping -M do -s 8972 [destination IP]
>
>
> On Mon, 25 Jul 2022, 22:56 Jeremy Hansen, 
> wrote:
>
>> MTU is the same across all hosts:
>>
>> - cn01.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.11  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
>> RX packets 3163785  bytes 213625 (1.9 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 6890933  bytes 40233267272 (37.4 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn02.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.12  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
>> RX packets 3976256  bytes 2761764486 (2.5 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 9270324  bytes 56984933585 (53.0 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn03.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.13  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
>> RX packets 13081847  bytes 93614795356 (87.1 GiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 4001854  bytes 2536322435 (2.3 GiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn04.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.14  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
>> RX packets 60018  bytes 5622542 (5.3 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 59889  bytes 17463794 (16.6 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn05.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.15  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
>> RX packets 69163  bytes 8085511 (7.7 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 73539  bytes 17069869 (16.2 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> - cn06.ceph.la1.clx.corp-
>> enp2s0: flags=4163  mtu 9000
>> inet 192.168.30.16  netmask 255.255.255.0  broadcast
>> 192.168.30.255
>> inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
>> ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
>> RX packets 23570  bytes 2251531 (2.1 MiB)
>> RX errors 0  dropped 0  overruns 0  frame 0
>> TX packets 22268  bytes 16186794 (15.4 MiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>>
>> 10G.
>>
>> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
>> wrote:
>>
>>> Is the MTU in n the new rack set correctly?
>>>
>>> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
>>> wrote:
>>>
>>>> I transitioned some servers to a new rack and now I'm having major
>>>> issues
>>>> with Ceph upon bringing things back up.
>>>>
>>>> I believe the issue may be related to the ceph nodes coming back up with
>>>> different IPs before VLANs were set.  That's just a guess because I
>>>> can't
>>>> think of any other reason this would happen.
&

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

Does ceph do any kind of io fencing if it notices an anomaly?  Do I need to
do something to re-enable these hosts if they get marked as bad?

On Mon, Jul 25, 2022 at 2:56 PM Jeremy Hansen 
wrote:

> MTU is the same across all hosts:
>
> - cn01.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.11  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
> RX packets 3163785  bytes 213625 (1.9 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 6890933  bytes 40233267272 (37.4 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn02.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.12  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
> RX packets 3976256  bytes 2761764486 (2.5 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 9270324  bytes 56984933585 (53.0 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn03.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.13  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
> RX packets 13081847  bytes 93614795356 (87.1 GiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 4001854  bytes 2536322435 (2.3 GiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn04.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.14  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
> RX packets 60018  bytes 5622542 (5.3 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 59889  bytes 17463794 (16.6 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn05.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.15  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
> RX packets 69163  bytes 8085511 (7.7 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 73539  bytes 17069869 (16.2 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> - cn06.ceph.la1.clx.corp-
> enp2s0: flags=4163  mtu 9000
> inet 192.168.30.16  netmask 255.255.255.0  broadcast 192.168.30.255
> inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
> ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
> RX packets 23570  bytes 2251531 (2.1 MiB)
> RX errors 0  dropped 0  overruns 0  frame 0
> TX packets 22268  bytes 16186794 (15.4 MiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>
> 10G.
>
> On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
> wrote:
>
>> Is the MTU in n the new rack set correctly?
>>
>> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
>> wrote:
>>
>>> I transitioned some servers to a new rack and now I'm having major issues
>>> with Ceph upon bringing things back up.
>>>
>>> I believe the issue may be related to the ceph nodes coming back up with
>>> different IPs before VLANs were set.  That's just a guess because I can't
>>> think of any other reason this would happen.
>>>
>>> Current state:
>>>
>>> Every 2.0s: ceph -s
>>>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>>
>>>   cluster:
>>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>>> health: HEALTH_WARN
>>> 1 filesystem is degraded
>>> 2 MDSs report slow metadata IOs
>>> 2/5 mons down, quorum cn02,cn03,cn01
>>> 9 osds down
>>> 3 hosts (17 osds) down
>>> Reduced data availability: 97 pgs inactive, 9 pgs down
>>> Degraded data redundancy: 13860144/30824413 objects degraded
>>> (44.965%), 411 pgs degraded, 482 pgs undersized
>>>
>>>   servi

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

MTU is the same across all hosts:

- cn01.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.11  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:728d  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:72:8d  txqueuelen 1000  (Ethernet)
RX packets 3163785  bytes 213625 (1.9 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 6890933  bytes 40233267272 (37.4 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn02.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.12  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:ff0c  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:ff:0c  txqueuelen 1000  (Ethernet)
RX packets 3976256  bytes 2761764486 (2.5 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 9270324  bytes 56984933585 (53.0 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn03.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.13  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feba  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:fe:ba  txqueuelen 1000  (Ethernet)
RX packets 13081847  bytes 93614795356 (87.1 GiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 4001854  bytes 2536322435 (2.3 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn04.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.14  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:6f89  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:6f:89  txqueuelen 1000  (Ethernet)
RX packets 60018  bytes 5622542 (5.3 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 59889  bytes 17463794 (16.6 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn05.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.15  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:7245  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:72:45  txqueuelen 1000  (Ethernet)
RX packets 69163  bytes 8085511 (7.7 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 73539  bytes 17069869 (16.2 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

- cn06.ceph.la1.clx.corp-
enp2s0: flags=4163  mtu 9000
inet 192.168.30.16  netmask 255.255.255.0  broadcast 192.168.30.255
inet6 fe80::3e8c:f8ff:feed:feab  prefixlen 64  scopeid 0x20
ether 3c:8c:f8:ed:fe:ab  txqueuelen 1000  (Ethernet)
RX packets 23570  bytes 2251531 (2.1 MiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 22268  bytes 16186794 (15.4 MiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

10G.

On Mon, Jul 25, 2022 at 2:51 PM Sean Redmond 
wrote:

> Is the MTU in n the new rack set correctly?
>
> On Mon, 25 Jul 2022, 11:30 Jeremy Hansen, 
> wrote:
>
>> I transitioned some servers to a new rack and now I'm having major issues
>> with Ceph upon bringing things back up.
>>
>> I believe the issue may be related to the ceph nodes coming back up with
>> different IPs before VLANs were set.  That's just a guess because I can't
>> think of any other reason this would happen.
>>
>> Current state:
>>
>> Every 2.0s: ceph -s
>>cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>
>>   cluster:
>> id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>> health: HEALTH_WARN
>> 1 filesystem is degraded
>> 2 MDSs report slow metadata IOs
>> 2/5 mons down, quorum cn02,cn03,cn01
>> 9 osds down
>> 3 hosts (17 osds) down
>> Reduced data availability: 97 pgs inactive, 9 pgs down
>> Degraded data redundancy: 13860144/30824413 objects degraded
>> (44.965%), 411 pgs degraded, 482 pgs undersized
>>
>>   services:
>> mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>> cn04
>> mgr: cn02.arszct(active, since 5m)
>> mds: 2/2 daemons up, 2 standby
>> osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>>
>>   data:
>> volumes: 1/2 healthy, 1 recovering
>> pools:   8 pools, 545 pgs
>> objects: 7.71M objects, 6.7 TiB
>> usage:   15 TiB used, 39 TiB / 54 TiB avail
>> pgs: 0.367% pgs unknown
>>  17.431% pgs not active
>>  13860144/30824413 objects degraded (44.965%)
>>

[ceph-users] Re: [Warning Possible spam] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

ate
active+undersized+remapped, last acting [9,6]
pg 9.8f is stuck undersized for 62m, current state
active+undersized+remapped, last acting [19,26,17]
pg 9.90 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [35,26]
pg 9.91 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [17,5]
pg 9.92 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,26]
pg 9.93 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [19,26,5]
pg 9.94 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,11]
pg 9.95 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,19]
pg 9.96 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [17,6]
pg 9.97 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,9,16]
pg 9.98 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [6,21]
pg 9.99 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [10,9]
pg 9.9a is stuck undersized for 61m, current state
active+undersized+remapped, last acting [4,16,10]
pg 9.9b is stuck undersized for 61m, current state
active+undersized+degraded, last acting [12,4,11]
pg 9.9c is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,16]
pg 9.9d is stuck undersized for 62m, current state
active+undersized+degraded, last acting [26,35]
pg 9.9f is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,17,26]
pg 12.70 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,35]
pg 12.71 is active+undersized+degraded, acting [6,12]
pg 12.72 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [10,14,4]
pg 12.73 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [5,17,11]
pg 12.78 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [5,8,35]
pg 12.79 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [4,17]
pg 12.7a is stuck undersized for 62m, current state
active+undersized+degraded, last acting [10,21]
pg 12.7b is stuck undersized for 62m, current state
active+undersized+remapped, last acting [17,21,11]
pg 12.7c is stuck undersized for 62m, current state
active+undersized+degraded, last acting [32,21,16]
pg 12.7d is stuck undersized for 61m, current state
active+undersized+degraded, last acting [35,6,9]
pg 12.7e is stuck undersized for 61m, current state
active+undersized+degraded, last acting [26,4]
pg 12.7f is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,14]

It's no longer giving me the ssh key issues but hasn't done anything to
improve my situation.  When the machines came up with a different IP, did
this somehow throw off some kind of ssh known hosts file or pub key
exchange?  It's all very strange why a momentary bad IP could wreak so much
havoc.

Thank you
-jeremy


On Mon, Jul 25, 2022 at 1:44 PM Frank Schilder  wrote:

> I don't use ceph-adm  and I also don't know how you got the "some more
> info". However, I did notice that it contains instructions, starting at
> "Please make sure that the host is reachable ...". How about starting to
> follow those?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Jeremy Hansen 
> Sent: 25 July 2022 22:32:32
> To: ceph-users@ceph.io
> Subject: [Warning Possible spam]  [ceph-users] Re: Issues after a shutdown
>
> Here's some more info:
>
> HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2
> filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down,
> quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data
> availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
> 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
> undersized
> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
> daemon osd.3 on cn01.ceph is in error state
> daemon osd.2 on cn01.ceph is in error state
> [WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check
> host cn04.ceph (192.168.30.14) failed check: Failed to connect to
> cn04.ceph (192.168.30.14).
> Please make sure that the host is reachable and accepts connections using
> the cephadm SSH key
>
> To add the cephadm SSH key to the host:
> > ceph cephadm get-pub-key > ~/ceph.pub
> > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14
>
> To check that the hos

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

s stuck undersized for 34m, current state
active+undersized+remapped, last acting [32,35,4]
pg 9.78 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [14,10]
pg 9.79 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,32]
pg 9.7b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,12,5]
pg 9.7c is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,35,10]
pg 9.7d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [5,19,10]
pg 9.7e is stuck undersized for 35m, current state
active+undersized+remapped, last acting [21,10,17]
pg 9.80 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,4,17]
pg 9.81 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [14,26]
pg 9.82 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [26,16]
pg 9.83 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,4]
pg 9.84 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,35,6]
pg 9.85 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [32,12,9]
pg 9.86 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [35,5,8]
pg 9.87 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,12]
pg 9.88 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,32,35]
pg 9.89 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [10,14,4]
pg 9.8a is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,19]
pg 9.8b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,35]
pg 9.8c is stuck undersized for 31m, current state
active+undersized+remapped, last acting [10,19,5]
pg 9.8d is stuck undersized for 35m, current state
active+undersized+remapped, last acting [9,6]
pg 9.8f is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,26,17]
pg 9.90 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [35,26]
pg 9.91 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [17,5]
pg 9.92 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,26]
pg 9.93 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,26,5]
pg 9.94 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,11]
pg 9.95 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,19]
pg 9.96 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [17,6]
pg 9.97 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,9,16]
pg 9.98 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [6,21]
pg 9.99 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [10,9]
pg 9.9a is stuck undersized for 34m, current state
active+undersized+remapped, last acting [4,16,10]
pg 9.9b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [12,4,11]
pg 9.9c is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,16]
pg 9.9d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [26,35]
pg 9.9f is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,17,26]
pg 12.70 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,35]
pg 12.71 is active+undersized+degraded, acting [6,12]
pg 12.72 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [10,14,4]
pg 12.73 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [5,17,11]
pg 12.78 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [5,8,35]
pg 12.79 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,17]
pg 12.7a is stuck undersized for 35m, current state
active+undersized+degraded, last acting [10,21]
pg 12.7b is stuck undersized for 35m, current state
active+undersized+remapped, last acting [17,21,11]
pg 12.7c is stuck undersized for 35m, current state
active+undersized+degraded, last acting [32,21,16]
pg 12.7d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [35,6,9]
pg 12.7e is stuck undersized for 34m, current state
active+undersized+degraded, last acting [26,4]
pg 12.7f is stuck undersized for 35m, current state
active+undersiz

[ceph-users] Re: Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

thru 31627 down_at 31208
last_clean_interval [30974,31195) [v2:
192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up
9200a57e-2845-43ff-9787-8f1f3158fe90
osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
last_clean_interval [25521,30350) [v2:
192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
20c55d85-cf9a-4133-a189-7fdad2318f58
osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
last_clean_interval [25516,30314) [v2:
192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
77e0ef8f-c047-4f84-afb2-a8ad054e562f
osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
last_clean_interval [30958,31195) [v2:
192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up
2d2de0cb-6d41-4957-a473-2bbe9ce227bf
osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
last_clean_interval [25491,29492) [v2:
192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
26114668-68b2-458b-89c2-cbad5507ab75



>
> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> farnsworth.mcfad...@gmail.com> wrote:
> >
> > I transitioned some servers to a new rack and now I'm having major issues
> > with Ceph upon bringing things back up.
> >
> > I believe the issue may be related to the ceph nodes coming back up with
> > different IPs before VLANs were set.  That's just a guess because I can't
> > think of any other reason this would happen.
> >
> > Current state:
> >
> > Every 2.0s: ceph -s
> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> >
> >  cluster:
> >id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> >health: HEALTH_WARN
> >1 filesystem is degraded
> >2 MDSs report slow metadata IOs
> >2/5 mons down, quorum cn02,cn03,cn01
> >9 osds down
> >3 hosts (17 osds) down
> >Reduced data availability: 97 pgs inactive, 9 pgs down
> >Degraded data redundancy: 13860144/30824413 objects degraded
> > (44.965%), 411 pgs degraded, 482 pgs undersized
> >
> >  services:
> >mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
> > cn04
> >mgr: cn02.arszct(active, since 5m)
> >mds: 2/2 daemons up, 2 standby
> >osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
> >
> >  data:
> >volumes: 1/2 healthy, 1 recovering
> >pools:   8 pools, 545 pgs
> >objects: 7.71M objects, 6.7 TiB
> >usage:   15 TiB used, 39 TiB / 54 TiB avail
> >pgs: 0.367% pgs unknown
> > 17.431% pgs not active
> > 13860144/30824413 objects degraded (44.965%)
> > 1137693/30824413 objects misplaced (3.691%)
> > 280 active+undersized+degraded
> > 67  undersized+degraded+remapped+backfilling+peered
> > 57  active+undersized+remapped
> > 45  active+clean+remapped
> > 44  active+undersized+degraded+remapped+backfilling
> > 18  undersized+degraded+peered
> > 10  active+undersized
> > 9   down
> > 7   active+clean
> > 3   active+undersized+remapped+backfilling
> > 2   active+undersized+degraded+remapped+backfill_wait
> > 2   unknown
> > 1   undersized+peered
> >
> >  io:
> >client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> >recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> >
> > I have to disable and re-enable the dashboard just to use it.  It seems
> to
> > get bogged down after a few moments.
> >
> > The three servers that were moved to the new rack Ceph has marked as
> > "Down", but if I do a cephadm host-check, they all seem to pass:
> >
> >  ceph  
> > - cn01.ceph.-
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > - cn02.ceph.-
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled a

[ceph-users] Issues after a shutdown

2022-07-25 Thread Jeremy Hansen

I transitioned some servers to a new rack and now I'm having major issues
with Ceph upon bringing things back up.

I believe the issue may be related to the ceph nodes coming back up with
different IPs before VLANs were set.  That's just a guess because I can't
think of any other reason this would happen.

Current state:

Every 2.0s: ceph -s
   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022

  cluster:
id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
health: HEALTH_WARN
1 filesystem is degraded
2 MDSs report slow metadata IOs
2/5 mons down, quorum cn02,cn03,cn01
9 osds down
3 hosts (17 osds) down
Reduced data availability: 97 pgs inactive, 9 pgs down
Degraded data redundancy: 13860144/30824413 objects degraded
(44.965%), 411 pgs degraded, 482 pgs undersized

  services:
mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
cn04
mgr: cn02.arszct(active, since 5m)
mds: 2/2 daemons up, 2 standby
osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs

  data:
volumes: 1/2 healthy, 1 recovering
pools:   8 pools, 545 pgs
objects: 7.71M objects, 6.7 TiB
usage:   15 TiB used, 39 TiB / 54 TiB avail
pgs: 0.367% pgs unknown
 17.431% pgs not active
 13860144/30824413 objects degraded (44.965%)
 1137693/30824413 objects misplaced (3.691%)
 280 active+undersized+degraded
 67  undersized+degraded+remapped+backfilling+peered
 57  active+undersized+remapped
 45  active+clean+remapped
 44  active+undersized+degraded+remapped+backfilling
 18  undersized+degraded+peered
 10  active+undersized
 9   down
 7   active+clean
 3   active+undersized+remapped+backfilling
 2   active+undersized+degraded+remapped+backfill_wait
 2   unknown
 1   undersized+peered

  io:
client:   170 B/s rd, 0 op/s rd, 0 op/s wr
recovery: 168 MiB/s, 158 keys/s, 166 objects/s

I have to disable and re-enable the dashboard just to use it.  It seems to
get bogged down after a few moments.

The three servers that were moved to the new rack Ceph has marked as
"Down", but if I do a cephadm host-check, they all seem to pass:

 ceph  
- cn01.ceph.-
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
- cn02.ceph.-
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
- cn03.ceph.-
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
- cn04.ceph.-
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
- cn05.ceph.-
podman|docker (/usr/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
- cn06.ceph.-
podman (/usr/bin/podman) version 4.0.2 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK

It seems to be recovering with what it has left, but a large amount of OSDs
are down.  When trying to restart one of the down'd OSDs, I see a huge dump.

Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+ 7fce14a6c080  0 osd.34 30689 done with init,
starting boot process
Jul 25 03:19:38 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:19:38.532+ 7fce14a6c080  1 osd.34 30689 start_boot
Jul 25 03:20:10 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:10.655+ 7fcdfd12d700  1 osd.34 30689 start_boot
Jul 25 03:20:41 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:20:41.159+ 7fcdfd12d700  1 osd.34 30689 start_boot
Jul 25 03:21:11 cn06.ceph
ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
2022-07-25T10:21:11.662+ 7fcdfd12d700  1 osd.34 30689 start_boot

At this point it just keeps printing start_boot, but the dashboard has it
marked as "in" but "down".

On these three hosts that moved, there were a bunch marked as "out" and
"down", and some with "in" but "down".

Not sure where to go next.  I'm going to let the recovery continue and hope
that my 4x replication on these pools saves me.

Not sure where to go from here.  Any help is very much appreciated.  This
Ceph cluster holds all of our Cloudstack images...  it would be terrible to
lose this data.

[ceph-users] Network issues with a CephFS client mount via a Cloudstack instance

2021-08-30 Thread Jeremy Hansen

I’m going to also post this to the Cloudstack list as well.

Attempting to rsync a large file to the Ceph volume, the instance becomes 
unresponsive at the network level. It eventually returns but it will 
continually drop offline as the file copies. Dmesg shows this on the Cloudstack 
host machine:

[ 7144.888744] e1000e :00:19.0 eno1: Detected Hardware Unit Hang:
TDH <80>
TDT 
next_to_use 
next_to_clean <7f>
buffer_info[next_to_clean]:
time_stamp <100686d46>
next_to_watch <80>
jiffies <100687140>
next_to_watch.status <0>
MAC Status <80083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
[ 7146.872563] e1000e :00:19.0 eno1: Detected Hardware Unit Hang:
TDH <80>
TDT 
next_to_use 
next_to_clean <7f>
buffer_info[next_to_clean]:
time_stamp <100686d46>
next_to_watch <80>
jiffies <100687900>
next_to_watch.status <0>
MAC Status <80083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
[ 7148.856703] e1000e :00:19.0 eno1: Detected Hardware Unit Hang:
TDH <80>
TDT 
next_to_use 
next_to_clean <7f>
buffer_info[next_to_clean]:
time_stamp <100686d46>
next_to_watch <80>
jiffies <1006880c0>
next_to_watch.status <0>
MAC Status <80083>
PHY Status <796d>
PHY 1000BASE-T Status <3800>
PHY Extended Status <3000>
PCI Status <10>
[ 7150.199756] e1000e :00:19.0 eno1: Reset adapter unexpectedly

The host machine:

System Information
Manufacturer: Dell Inc.
Product Name: OptiPlex 990

Running CentOS 8.4.

I also see the same error on another host of a different hw type:

Manufacturer: Hewlett-Packard
Product Name: HP Compaq 8200 Elite SFF PC

but both are using e1000 drivers.

I upgraded the kernel to 5.13.x and I thought this fixed the issue, but now I 
see the error again.

Migrating the instance to a bigger server class machine (also e1000e, old 
Rackable system) where I have a bigger pipe via bonding, I don’t seem to have 
the issue.

Just curious if this could be a known bug with e1000e and if there is any kind 
of work around.

Thanks
-jeremy

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Only 2/5 mon services running

2021-06-07 Thread Jeremy Hansen

It looks like the second mon server was down from my reboot.  Restarted and 
everything is functional again but I still can’t figure out why only 2 out of 
the 5 mon servers is down and won’t start.  If they were functioning, I 
probably wouldn’t have noticing the cluster being down.

Thanks
-jeremy


> On Jun 7, 2021, at 7:53 PM, Jeremy Hansen  wrote:
> 
> Signed PGP part
> 
> In an attempt to troubleshoot why only 2/5 mon services were running, I 
> believe I’ve broke something:
> 
> [ceph: root@cn01 /]# ceph orch ls
> NAME   PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
> alertmanager  1/1  81s ago9d   count:1
> crash 6/6  7m ago 9d   *
> grafana   1/1  80s ago9d   count:1
> mds.testfs2/2  81s ago9d   
> cn01.ceph.la1.clx.corp;cn02.ceph.la1.clx.corp;cn03.ceph.la1.clx.corp;cn04.ceph.la1.clx.corp;cn05.ceph.la1.clx.corp;cn06.ceph.la1.clx.corp;count:2
> mgr   2/2  81s ago9d   count:2
> mon   2/5  81s ago9d   count:5
> node-exporter 6/6  7m ago 9d   *
> osd.all-available-devices   20/26  7m ago 9d   *
> osd.unmanaged 7/7  7m ago -
> prometheus2/2  80s ago9d   count:2
> 
> I tried to stop and start the mon service, but now the cluster is pretty much 
> unresponsive, I’m assuming because I stopped mon:
> 
> [ceph: root@cn01 /]# ceph orch stop mon
> Scheduled to stop mon.cn01 on host 'cn01.ceph.la1.clx.corp'
> Scheduled to stop mon.cn02 on host 'cn02.ceph.la1.clx.corp'
> Scheduled to stop mon.cn03 on host 'cn03.ceph.la1.clx.corp'
> Scheduled to stop mon.cn04 on host 'cn04.ceph.la1.clx.corp'
> Scheduled to stop mon.cn05 on host 'cn05.ceph.la1.clx.corp'
> [ceph: root@cn01 /]# ceph orch start mon
> 
> 
> ^CCluster connection aborted
> 
> 
> Now even after a reboot of the cluster, it’s unresponsive.  How do I get mon 
> started again?
> 
> I’m going through Ceph and breaking things left and right, so I apologize for 
> all the questions.  I learn best from breaking things and figuring out how to 
> resolve the issues.
> 
> 
> Thank you
> -jeremy
> 
> 



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Only 2/5 mon services running

2021-06-07 Thread Jeremy Hansen


In an attempt to troubleshoot why only 2/5 mon services were running, I believe 
I’ve broke something:

[ceph: root@cn01 /]# ceph orch ls
NAME   PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager  1/1  81s ago9d   count:1
crash 6/6  7m ago 9d   *
grafana   1/1  80s ago9d   count:1
mds.testfs2/2  81s ago9d   
cn01.ceph.la1.clx.corp;cn02.ceph.la1.clx.corp;cn03.ceph.la1.clx.corp;cn04.ceph.la1.clx.corp;cn05.ceph.la1.clx.corp;cn06.ceph.la1.clx.corp;count:2
mgr   2/2  81s ago9d   count:2
mon   2/5  81s ago9d   count:5
node-exporter 6/6  7m ago 9d   *
osd.all-available-devices   20/26  7m ago 9d   *
osd.unmanaged 7/7  7m ago -
prometheus2/2  80s ago9d   count:2

I tried to stop and start the mon service, but now the cluster is pretty much 
unresponsive, I’m assuming because I stopped mon:

[ceph: root@cn01 /]# ceph orch stop mon
Scheduled to stop mon.cn01 on host 'cn01.ceph.la1.clx.corp'
Scheduled to stop mon.cn02 on host 'cn02.ceph.la1.clx.corp'
Scheduled to stop mon.cn03 on host 'cn03.ceph.la1.clx.corp'
Scheduled to stop mon.cn04 on host 'cn04.ceph.la1.clx.corp'
Scheduled to stop mon.cn05 on host 'cn05.ceph.la1.clx.corp'
[ceph: root@cn01 /]# ceph orch start mon


^CCluster connection aborted


Now even after a reboot of the cluster, it’s unresponsive.  How do I get mon 
started again?

I’m going through Ceph and breaking things left and right, so I apologize for 
all the questions.  I learn best from breaking things and figuring out how to 
resolve the issues.


Thank you
-jeremy


signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Global Recovery Event

2021-06-07 Thread Jeremy Hansen

This seems to have recovered on its own.

Thank you
-jeremy

> On Jun 7, 2021, at 5:44 PM, Neha Ojha  wrote:
> 
> On Mon, Jun 7, 2021 at 5:24 PM Jeremy Hansen  <mailto:jer...@skidrow.la>> wrote:
>> 
>> 
>> I’m seeing this in my health status:
>> 
>>  progress:
>>Global Recovery Event (13h)
>>  [] (remaining: 5w)
>> 
>> I’m not sure how this was initiated but this is a cluster with almost zero 
>> objects.  Is there a way to halt this process?  Why would it estimate 5 
>> weeks to recover a cluster with almost zero data?
> 
> You could be running into https://tracker.ceph.com/issues/49988 
> <https://tracker.ceph.com/issues/49988>. You
> can try to run "ceph progress clear" and see if that helps or just
> turn the progress module off and turn it back on.
> 
> - Neha
> 
>> 
>> [ceph: root@cn01 /]# ceph -s -w
>>  cluster:
>>id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>>health: HEALTH_OK
>> 
>>  services:
>>mon: 2 daemons, quorum cn02,cn05 (age 13h)
>>mgr: cn01.ceph.la1.clx.corp.xnkoft(active, since 13h), standbys: 
>> cn02.arszct
>>mds: 1/1 daemons up, 1 standby
>>osd: 27 osds: 27 up (since 13h), 27 in (since 16h)
>> 
>>  data:
>>volumes: 1/1 healthy
>>pools:   3 pools, 65 pgs
>>objects: 22.09k objects, 86 GiB
>>usage:   261 GiB used, 98 TiB / 98 TiB avail
>>pgs: 65 active+clean
>> 
>>  progress:
>>Global Recovery Event (13h)
>>  [] (remaining: 5w)
>> 
>> 
>> 
>> Thanks
>> -jeremy
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> <mailto:ceph-users-le...@ceph.io>


signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Global Recovery Event

2021-06-07 Thread Jeremy Hansen


I’m seeing this in my health status:

  progress:
Global Recovery Event (13h)
  [] (remaining: 5w)

I’m not sure how this was initiated but this is a cluster with almost zero 
objects.  Is there a way to halt this process?  Why would it estimate 5 weeks 
to recover a cluster with almost zero data?

[ceph: root@cn01 /]# ceph -s -w
  cluster:
id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d
health: HEALTH_OK

  services:
mon: 2 daemons, quorum cn02,cn05 (age 13h)
mgr: cn01.ceph.la1.clx.corp.xnkoft(active, since 13h), standbys: cn02.arszct
mds: 1/1 daemons up, 1 standby
osd: 27 osds: 27 up (since 13h), 27 in (since 16h)

  data:
volumes: 1/1 healthy
pools:   3 pools, 65 pgs
objects: 22.09k objects, 86 GiB
usage:   261 GiB used, 98 TiB / 98 TiB avail
pgs: 65 active+clean

  progress:
Global Recovery Event (13h)
  [] (remaining: 5w)



Thanks
-jeremy


signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)

2021-06-07 Thread Jeremy Hansen

cephadm rm-daemon --name osd.29

on the node with the stale daemon did the trick.

-jeremy


> On Jun 7, 2021, at 2:24 AM, Jeremy Hansen  wrote:
> 
> Signed PGP part
> So I found the failed daemon:
> 
> [root@cn05 ~]# systemctl  | grep 29
> 
> ● ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d@osd.29.service
>   loaded failed failedCeph 
> osd.29 for bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> 
> But I’ve already replaced this osd, so this is perhaps left over from a 
> previous osd.29 on this host.  How would I go about removing this cleanly and 
> more important, in a way that Ceph is aware of the change, therefore clearing 
> the warning.
> 
> Thanks
> -jeremy
> 
> 
>> On Jun 7, 2021, at 1:54 AM, Jeremy Hansen  wrote:
>> 
>> Signed PGP part
>> Thank you.  So I see this:
>> 
>> 2021-06-07T08:41:24.133493+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
>> (mgr.224161) 1494 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
>> 2021-06-07T08:44:37.650022+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
>> (mgr.224161) 1592 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
>> 2021-06-07T08:47:07.039405+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
>> (mgr.224161) 1667 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
>> 2021-06-07T08:51:00.094847+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
>> (mgr.224161) 1785 : cephadm [INF] Reconfiguring osd.29 (monmap changed)…
>> 
>> Yet…
>> 
>> ceph osd ls
>> 0
>> 1
>> 2
>> 3
>> 4
>> 5
>> 6
>> 7
>> 8
>> 9
>> 10
>> 11
>> 12
>> 13
>> 14
>> 16
>> 17
>> 18
>> 20
>> 22
>> 23
>> 24
>> 26
>> 27
>> 31
>> 33
>> 34
>> 
>> So how would I approach fixing this?
>> 
>>> On Jun 7, 2021, at 1:10 AM, 赵贺东  wrote:
>>> 
>>> Hello Jeremy Hansen,
>>> 
>>> try:
>>> ceph log last cephadm
>>> 
>>> or see files below
>>> /var/log/ceph/cephadm.log
>>> 
>>> 
>>> 
>>>> On Jun 7, 2021, at 15:49, Jeremy Hansen  wrote:
>>>> 
>>>> What’s the proper way to track down where this error is coming from?  
>>>> Thanks.
>>>> 
>>>> 
>>>> 6/7/21 12:40:00 AM
>>>> [WRN]
>>>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
>>>> 
>>>> 6/7/21 12:40:00 AM
>>>> [WRN]
>>>> Health detail: HEALTH_WARN 1 failed cephadm daemon(s)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
> 
> 



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)

2021-06-07 Thread Jeremy Hansen

So I found the failed daemon:

[root@cn05 ~]# systemctl  | grep 29

● ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d@osd.29.service  
loaded failed failedCeph osd.29 
for bfa2ad58-c049-11eb-9098-3c8cf8ed728d

But I’ve already replaced this osd, so this is perhaps left over from a 
previous osd.29 on this host.  How would I go about removing this cleanly and 
more important, in a way that Ceph is aware of the change, therefore clearing 
the warning.

Thanks
-jeremy


> On Jun 7, 2021, at 1:54 AM, Jeremy Hansen  wrote:
> 
> Signed PGP part
> Thank you.  So I see this:
> 
> 2021-06-07T08:41:24.133493+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
> (mgr.224161) 1494 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
> 2021-06-07T08:44:37.650022+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
> (mgr.224161) 1592 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
> 2021-06-07T08:47:07.039405+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
> (mgr.224161) 1667 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
> 2021-06-07T08:51:00.094847+ mgr.cn01.ceph.la1.clx.corp.xnkoft 
> (mgr.224161) 1785 : cephadm [INF] Reconfiguring osd.29 (monmap changed)…
> 
> Yet…
> 
> ceph osd ls
> 0
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> 8
> 9
> 10
> 11
> 12
> 13
> 14
> 16
> 17
> 18
> 20
> 22
> 23
> 24
> 26
> 27
> 31
> 33
> 34
> 
> So how would I approach fixing this?
> 
>> On Jun 7, 2021, at 1:10 AM, 赵贺东  wrote:
>> 
>> Hello Jeremy Hansen,
>> 
>> try:
>> ceph log last cephadm
>> 
>> or see files below
>> /var/log/ceph/cephadm.log
>> 
>> 
>> 
>>> On Jun 7, 2021, at 15:49, Jeremy Hansen  wrote:
>>> 
>>> What’s the proper way to track down where this error is coming from?  
>>> Thanks.
>>> 
>>> 
>>> 6/7/21 12:40:00 AM
>>> [WRN]
>>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
>>> 
>>> 6/7/21 12:40:00 AM
>>> [WRN]
>>> Health detail: HEALTH_WARN 1 failed cephadm daemon(s)
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)

2021-06-07 Thread Jeremy Hansen

Thank you.  So I see this:

2021-06-07T08:41:24.133493+ mgr.cn01.ceph.la1.clx.corp.xnkoft (mgr.224161) 
1494 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
2021-06-07T08:44:37.650022+ mgr.cn01.ceph.la1.clx.corp.xnkoft (mgr.224161) 
1592 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
2021-06-07T08:47:07.039405+ mgr.cn01.ceph.la1.clx.corp.xnkoft (mgr.224161) 
1667 : cephadm [INF] Reconfiguring osd.29 (monmap changed)...
2021-06-07T08:51:00.094847+ mgr.cn01.ceph.la1.clx.corp.xnkoft (mgr.224161) 
1785 : cephadm [INF] Reconfiguring osd.29 (monmap changed)…

Yet…

ceph osd ls
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
16
17
18
20
22
23
24
26
27
31
33
34

So how would I approach fixing this?

> On Jun 7, 2021, at 1:10 AM, 赵贺东  wrote:
> 
> Hello Jeremy Hansen,
> 
> try:
> ceph log last cephadm
> 
> or see files below
> /var/log/ceph/cephadm.log
> 
> 
> 
>> On Jun 7, 2021, at 15:49, Jeremy Hansen  wrote:
>> 
>> What’s the proper way to track down where this error is coming from?  Thanks.
>> 
>> 
>> 6/7/21 12:40:00 AM
>> [WRN]
>> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
>> 
>> 6/7/21 12:40:00 AM
>> [WRN]
>> Health detail: HEALTH_WARN 1 failed cephadm daemon(s)
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)

2021-06-07 Thread Jeremy Hansen

What’s the proper way to track down where this error is coming from?  Thanks.


6/7/21 12:40:00 AM
[WRN]
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)

6/7/21 12:40:00 AM
[WRN]
Health detail: HEALTH_WARN 1 failed cephadm daemon(s)










signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] HEALTH_WARN Reduced data availability: 33 pgs inactive

2021-05-28 Thread Jeremy Hansen

I’m trying to understand this situation:

ceph health detail
HEALTH_WARN Reduced data availability: 33 pgs inactive
[WRN] PG_AVAILABILITY: Reduced data availability: 33 pgs inactive
pg 1.0 is stuck inactive for 20h, current state unknown, last acting []
pg 2.0 is stuck inactive for 20h, current state unknown, last acting []
pg 2.1 is stuck inactive for 20h, current state unknown, last acting []
pg 2.2 is stuck inactive for 20h, current state unknown, last acting []
pg 2.3 is stuck inactive for 20h, current state unknown, last acting []
pg 2.4 is stuck inactive for 20h, current state unknown, last acting []
pg 2.5 is stuck inactive for 20h, current state unknown, last acting []
pg 2.6 is stuck inactive for 20h, current state unknown, last acting []
pg 2.7 is stuck inactive for 20h, current state unknown, last acting []
pg 2.8 is stuck inactive for 20h, current state unknown, last acting []
pg 2.9 is stuck inactive for 20h, current state unknown, last acting []
pg 2.a is stuck inactive for 20h, current state unknown, last acting []
pg 2.b is stuck inactive for 20h, current state unknown, last acting []
pg 2.c is stuck inactive for 20h, current state unknown, last acting []
pg 2.d is stuck inactive for 20h, current state unknown, last acting []
pg 2.e is stuck inactive for 20h, current state unknown, last acting []
pg 2.f is stuck inactive for 20h, current state unknown, last acting []
pg 2.10 is stuck inactive for 20h, current state unknown, last acting []
pg 2.11 is stuck inactive for 20h, current state unknown, last acting []
pg 2.12 is stuck inactive for 20h, current state unknown, last acting []
pg 2.13 is stuck inactive for 20h, current state unknown, last acting []
pg 2.14 is stuck inactive for 20h, current state unknown, last acting []
pg 2.15 is stuck inactive for 20h, current state unknown, last acting []
pg 2.16 is stuck inactive for 20h, current state unknown, last acting []
pg 2.17 is stuck inactive for 20h, current state unknown, last acting []
pg 2.18 is stuck inactive for 20h, current state unknown, last acting []
pg 2.19 is stuck inactive for 20h, current state unknown, last acting []
pg 2.1a is stuck inactive for 20h, current state unknown, last acting []
pg 2.1b is stuck inactive for 20h, current state unknown, last acting []
pg 2.1c is stuck inactive for 20h, current state unknown, last acting []
pg 2.1d is stuck inactive for 20h, current state unknown, last acting []
pg 2.1e is stuck inactive for 20h, current state unknown, last acting []
pg 2.1f is stuck inactive for 20h, current state unknown, last acting []
[ceph: root@cn01 /]# date
Sat May 29 01:28:37 UTC 2021
[ceph: root@cn01 /]# ceph pg dump_stuck inactive
PG_STAT  STATEUP  UP_PRIMARY  ACTING  ACTING_PRIMARY
2.1f unknown  []  -1  []  -1
2.1e unknown  []  -1  []  -1
2.1d unknown  []  -1  []  -1
2.1c unknown  []  -1  []  -1
2.1b unknown  []  -1  []  -1
2.1a unknown  []  -1  []  -1
2.19 unknown  []  -1  []  -1
2.18 unknown  []  -1  []  -1
2.17 unknown  []  -1  []  -1
2.16 unknown  []  -1  []  -1
2.15 unknown  []  -1  []  -1
2.14 unknown  []  -1  []  -1
2.13 unknown  []  -1  []  -1
2.12 unknown  []  -1  []  -1
2.11 unknown  []  -1  []  -1
2.10 unknown  []  -1  []  -1
2.f  unknown  []  -1  []  -1
2.9  unknown  []  -1  []  -1
2.b  unknown  []  -1  []  -1
2.c  unknown  []  -1  []  -1
2.e  unknown  []  -1  []  -1
2.a  unknown  []  -1  []  -1
2.d  unknown  []  -1  []  -1
2.8  unknown  []  -1  []  -1
2.7  unknown  []  -1  []  -1
2.6  unknown  []  -1  []  -1
2.5  unknown  []  -1  []  -1
2.0  unknown  []  -1  []  -1
1.0  unknown  []  -1  []  -1
2.3  unknown  []  -1  []  -1
2.1  unknown  []  -1  []  -1
2.2  unknown  []  -1  []  -1
2.4  unknown  []  -1  []  -1
ok


[ceph: root@cn01 /]# ceph pg 2.4 query
Couldn't parse JSON : Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1310, in 
retval = main()
  File "/usr/bin/ceph", line 1230, in main
si

[ceph-users] Re: Remapping OSDs under a PG

2021-05-28 Thread Jeremy Hansen

:15.122042+
2.100 0  00  00   0
0  active+clean21h  0'0254:42   [10,17,7]p10   [10,17,7]p10  
2021-05-28T00:46:38.770867+  2021-05-28T00:46:15.122042+
2.110 0  00  00   0
0  active+clean21h  0'0254:42   [34,20,1]p34   [34,20,1]p34  
2021-05-28T00:46:39.572906+  2021-05-28T00:46:15.122042+
2.120 0  00  00   0
0  active+clean21h  0'0254:56[5,24,26]p5[5,24,26]p5  
2021-05-28T00:46:38.802818+  2021-05-28T00:46:15.122042+
2.130 0  00  00   0
0  active+clean21h  0'0254:42   [21,35,3]p21   [21,35,3]p21  
2021-05-28T00:46:39.517117+  2021-05-28T00:46:15.122042+
2.140 0  00  00   0
0  active+clean21h  0'0254:42[18,9,7]p18[18,9,7]p18  
2021-05-28T00:46:38.078800+  2021-05-28T00:46:15.122042+
2.150 0  00  00   0
0  active+clean21h  0'0254:42[7,14,34]p7[7,14,34]p7  
2021-05-28T00:46:38.748425+  2021-05-28T00:46:15.122042+
2.160 0  00  00   0
0  active+clean21h  0'0254:42 [0,23,7]p0 [0,23,7]p0  
2021-05-28T00:46:42.000503+  2021-05-28T00:46:15.122042+
2.170 0  00  00   0
0  active+clean21h  0'0254:42   [21,5,11]p21   [21,5,11]p21  
2021-05-28T00:46:46.515686+  2021-05-28T00:46:15.122042+
2.180 0  00  00   0
0  active+clean21h  0'0254:42   [18,9,33]p18   [18,9,33]p18  
2021-05-28T00:46:40.104875+  2021-05-28T00:46:15.122042+
2.190 0  00  00   0
0  active+clean21h  0'0254:42   [13,23,4]p13   [13,23,4]p13  
2021-05-28T00:46:38.739980+  2021-05-28T00:46:35.469823+
2.1a0 0  00  00   0
0  active+clean21h  0'0254:42[3,23,28]p3[3,23,28]p3  
2021-05-28T00:46:41.549389+  2021-05-28T00:46:15.122042+
2.1b0 0  00  00   0
0  active+clean21h  0'0254:56[5,28,23]p5[5,28,23]p5  
2021-05-28T00:46:40.824368+  2021-05-28T00:46:15.122042+
2.1c0 0  00  00   0
0  active+clean21h  0'0254:42  [33,29,31]p33  [33,29,31]p33  
2021-05-28T00:46:38.106675+  2021-05-28T00:46:15.122042+
2.1d0 0  00  00   0
0  active+clean21h  0'0254:42  [10,33,28]p10  [10,33,28]p10  
2021-05-28T00:46:39.785338+  2021-05-28T00:46:15.122042+
2.1e0 0  00  00   0
0  active+clean21h  0'0254:42[3,21,13]p3[3,21,13]p3  
2021-05-28T00:46:40.584803+  2021-05-28T00:46:40.584803+
2.1f0 0  00  00   0
0  active+clean21h  0'0254:42   [22,7,34]p22   [22,7,34]p22  
2021-05-28T00:46:38.061932+  2021-05-28T00:46:15.122042+

PG 1.0, which has all the objects, is still using osd.28, which is an ssd drive.

ceph pg ls-by-pool device_health_metrics
PG   OBJECTS  DEGRADED  MISPLACED  UNFOUND  BYTES  OMAP_BYTES*  OMAP_KEYS*  LOG 
 STATE SINCE  VERSION  REPORTED  UP ACTING 
SCRUB_STAMP  DEEP_SCRUB_STAMP
1.0   41 0  00  00   0   71 
 active+clean22h   205'71   253:484  [28,33,10]p28  [28,33,10]p28  
2021-05-27T14:44:37.466384+  2021-05-26T04:23:11.758060+


Also, I attempted to add my “crush location” and I believe I’m missing 
something fundamental.  It claims no change but that doesn’t make sense because 
I haven’t previously specified this information:

ceph osd crush set osd.24 3.63869 root=default datacenter=la1 rack=rack1 
host=cn06 room=room1 row=6
set item id 24 name 'osd.24' weight 3.63869 at location 
{datacenter=la1,host=cn06,rack=rack1,room=room1,root=default,row=6}: no change

My end goal is to create a crush map that is away of two separate racks with 
independent UPS power to increase our availability in the event of power going 
out on one of our racks.

Thank you
-jeremy


> On May 28, 2021, at 5:01 AM, Jeremy Hansen  wrote:
> 
> I’m continuing to read and it’s becoming more clear.
> 
> The CRUSH map seems pretty amazing!
> 
> -jeremy
> 
>

[ceph-users] Re: Remapping OSDs under a PG

2021-05-28 Thread Jeremy Hansen

I’m continuing to read and it’s becoming more clear. 

The CRUSH map seems pretty amazing!

-jeremy

> On May 28, 2021, at 1:10 AM, Jeremy Hansen  wrote:
> 
> Thank you both for your response.  So this leads me to the next question:
> 
> ceph osd crush rule create-replicated
> 
> 
> What is  and  in this case?
> 
> It also looks like this is responsible for things like “rack awareness” type 
> attributes which is something I’d like to utilize.:
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 zone
> type 10 region
> type 11 root
> This is something I will eventually take advantage of as well.
> 
> Thank you!
> -jeremy
> 
> 
>> On May 28, 2021, at 12:03 AM, Janne Johansson  wrote:
>> 
>> Create a crush rule that only chooses non-ssd drives, then
>> ceph osd pool set  crush_rule YourNewRuleName
>> and it will move over to the non-ssd OSDs.
>> 
>> Den fre 28 maj 2021 kl 02:18 skrev Jeremy Hansen :
>>> 
>>> 
>>> I’m very new to Ceph so if this question makes no sense, I apologize.  
>>> Continuing to study but I thought an answer to this question would help me 
>>> understand Ceph a bit more.
>>> 
>>> Using cephadm, I set up a cluster.  Cephadm automatically creates a pool 
>>> for Ceph metrics.  It looks like one of my ssd osd’s was allocated for the 
>>> PG.  I’d like to understand how to remap this PG so it’s not using the SSD 
>>> OSDs.
>>> 
>>> ceph pg map 1.0
>>> osdmap e205 pg 1.0 (1.0) -> up [28,33,10] acting [28,33,10]
>>> 
>>> OSD 28 is the SSD.
>>> 
>>> Is this possible?  Does this make any sense?  I’d like to reserve the SSDs 
>>> for their own pool.
>>> 
>>> Thank you!
>>> -jeremy
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> 
>> -- 
>> May the most significant bit of your life be positive.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Remapping OSDs under a PG

2021-05-28 Thread Jeremy Hansen

Thank you both for your response.  So this leads me to the next question:

ceph osd crush rule create-replicated


What is  and  in this case?

It also looks like this is responsible for things like “rack awareness” type 
attributes which is something I’d like to utilize.:

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
This is something I will eventually take advantage of as well.

Thank you!
-jeremy


> On May 28, 2021, at 12:03 AM, Janne Johansson  wrote:
> 
> Create a crush rule that only chooses non-ssd drives, then
> ceph osd pool set  crush_rule YourNewRuleName
> and it will move over to the non-ssd OSDs.
> 
> Den fre 28 maj 2021 kl 02:18 skrev Jeremy Hansen :
>> 
>> 
>> I’m very new to Ceph so if this question makes no sense, I apologize.  
>> Continuing to study but I thought an answer to this question would help me 
>> understand Ceph a bit more.
>> 
>> Using cephadm, I set up a cluster.  Cephadm automatically creates a pool for 
>> Ceph metrics.  It looks like one of my ssd osd’s was allocated for the PG.  
>> I’d like to understand how to remap this PG so it’s not using the SSD OSDs.
>> 
>> ceph pg map 1.0
>> osdmap e205 pg 1.0 (1.0) -> up [28,33,10] acting [28,33,10]
>> 
>> OSD 28 is the SSD.
>> 
>> Is this possible?  Does this make any sense?  I’d like to reserve the SSDs 
>> for their own pool.
>> 
>> Thank you!
>> -jeremy
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Remapping OSDs under a PG

2021-05-27 Thread Jeremy Hansen


I’m very new to Ceph so if this question makes no sense, I apologize.  
Continuing to study but I thought an answer to this question would help me 
understand Ceph a bit more.

Using cephadm, I set up a cluster.  Cephadm automatically creates a pool for 
Ceph metrics.  It looks like one of my ssd osd’s was allocated for the PG.  I’d 
like to understand how to remap this PG so it’s not using the SSD OSDs.

ceph pg map 1.0
osdmap e205 pg 1.0 (1.0) -> up [28,33,10] acting [28,33,10]

OSD 28 is the SSD.

Is this possible?  Does this make any sense?  I’d like to reserve the SSDs for 
their own pool.

Thank you!
-jeremy


signature.asc
Description: Message signed with OpenPGP
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

43 matches

Mail list logo