[ceph-users] Re: Ceph Cluster Taking An Awful Long Time To Rebalance

2021-03-16 Thread duluxoz

Yeap - that was the issue: an incorrect CRUSH rule

Thanks for the help

Dulux-Oz

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Erasure-coded Block Device Image Creation With qemu-img - Help

2021-03-16 Thread duluxoz

Hi Guys,

So, new issue (I'm gonna get the hang of this if it kills me :-) ).

I have a working/healthy Ceph (Octopus) Cluster (with qemu-img, libvert, 
etc, installed), and an erasure-coded pool called "my_pool". I now need 
to create a "my_data" image within the "my_pool" pool. As this is for a 
KVM host / block device (hence the qemu-img et.al.) I'm attempting to 
use qemu-img, so the command I am using is:


```

qemu-img create -f rbd rbd:my_pool/my_data 1T

```

The error message I received was:

```

qemu-img: rbd:my_pool/my_data: error rbd create: Operation not supported

```

So, I tried the 'raw' rbd command:

```

rbd create -s 1T my_pool/my_data

```

and got the error:

```

_add_image_to_directory: error adding image to directory: (95) Operation 
not supported

rbd: create error: (95) Operation not supported

```

So I don't believe the issue is with the 'qemu-img' command - but I may 
be wrong.


After doing some research I *think* I need to specify a replicated (as 
opposed to erasure-coded) pool for my_pool's metadata (eg 
'my_pool_metadata'), and thus use the command:


```

rbd create -s 1T --data-pool my_pool my_pool_metadata/my_data

```

First Question: Is this correct?

Second Question: What is the qemu-img equivalent command - is it:

```

qemu-img create -f rbd rbd:--data-pool my_pool my_pool_metadata/my_data 1T

```

or something similar?

Thanks in advance

Dulux-Oz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Diskless boot for Ceph nodes

2021-03-16 Thread Stefan Kooman

On 3/16/21 6:37 PM, Stephen Smith6 wrote:
Hey folks - thought I'd check and see if anyone has ever tried to use 
ephemeral (tmpfs / ramfs based) boot disks for Ceph nodes?


croit.io does that quite succesfully I believe [1].

Gr. Stefan

[1]: https://www.croit.io/software/features
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Tony Liu
"but you may see significant performance improvement with a
second "cluster" network in a large cluster."

"does not usually have a significant impact on overall performance."

The above two statements look conflict to me and cause confusing.

What's the purpose of "cluster" network, simply increasing total
bandwidth or for some isolations?

For example, 
1 network on 1 bonding with 2 x 40GB ports
vs.
2 networks on 2 bonding each with 2 x 20GB ports

They have the same total bandwidth 80GB, so they will support
the same performance, right?


Thanks!
Tony
> -Original Message-
> From: Andrew Walker-Brown 
> Sent: Tuesday, March 16, 2021 9:18 AM
> To: Tony Liu ; Stefan Kooman ;
> Dave Hall ; ceph-users 
> Subject: RE: [ceph-users] Re: Networking Idea/Question
> 
> https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
> 
> 
> 
> Sent from Mail   for
> Windows 10
> 
> 
> 
> From: Tony Liu 
> Sent: 16 March 2021 16:16
> To: Stefan Kooman  ; Dave Hall
>  ; ceph-users 
> Subject: [ceph-users] Re: Networking Idea/Question
> 
> 
> 
> > -Original Message-
> > From: Stefan Kooman 
> > Sent: Tuesday, March 16, 2021 4:10 AM
> > To: Dave Hall ; ceph-users 
> > Subject: [ceph-users] Re: Networking Idea/Question
> >
> > On 3/15/21 5:34 PM, Dave Hall wrote:
> > > Hello,
> > >
> > > If anybody out there has tried this or thought about it, I'd like to
> > > know...
> > >
> > > I've been thinking about ways to squeeze as much performance as
> > > possible from the NICs  on a Ceph OSD node.  The nodes in our
> cluster
> > > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > > Currently, one port is assigned to the front-side network, and one
> to
> > > the back-side network.  However, there are times when the traffic on
> > > one side or the other is more intense and might benefit from a bit
> > more bandwidth.
> >
> > What is (are) the reason(s) to choose a separate cluster and public
> > network?
> 
> That used to be the recommendation to separate client traffic and
> cluster traffic. I heard it's not true any more as the latest.
> It would be good if someone can point to the right link of such
> recommendation.
> 
> 
> Thanks!
> Tony
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Diskless boot for Ceph nodes

2021-03-16 Thread Nico Schottelius

On 2021-03-16 22:06, Stefan Kooman wrote:

On 3/16/21 6:37 PM, Stephen Smith6 wrote:
Hey folks - thought I'd check and see if anyone has ever tried to use 
ephemeral (tmpfs / ramfs based) boot disks for Ceph nodes?


croit.io does that quite succesfully I believe [1].


Same here at ungleich, all our ceph nodes with the exception of the 
monitors are PXE booted. And as a matter of fact, 90% of them even 
netboot via IPv6 only using ipxe.


From an operational point of view I believe there is not much difference 
to a regular setup, besides you'll need to have a trigger in place to 
configure the servers when rebooting, as the boot image itself is rather 
empty in our case.


Best regards,

Nico

--
Renewable hosting with full IPv6 support. Checkout 
www.datacenterlight.ch for details.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about migrating from iSCSI to RBD

2021-03-16 Thread Richard Bade
Hi Justin,
I did some testing with iscsi a year or so ago. It was just using
standard rbd images in the backend so yes I think your theory of
stopping iscsi to release the locks and then providing access to the
rbd image would work.

Rich

On Wed, 17 Mar 2021 at 09:53, Justin Goetz  wrote:
>
> Hello!
>
> I was hoping to inquire if anyone here has attempted similar operations,
> and if they ran into any issues. To give a brief overview of my
> situation, I have a standard octopus cluster running 15.2.2, with
> ceph-iscsi installed via ansible. The original scope of a project we
> were working on changed, and we no longer need the iSCSI overhead added
> to the project (the machine using CEPH is Linux, so we would like to use
> native RBD block devices instead).
>
> Ideally we would create some new pools and migrate the data from the
> iSCSI pools over to the new pools, however, due to the massive amount of
> data (close to 200 TB), we lack the physical resources necessary to copy
> the files.
>
> Digging a bit on the backend of the pools utilized by ceph-iscsi, it
> appears that the iSCSI utility uses standard RBD images on the actual
> backend:
>
> ~]# rbd info iscsi/pool-name
> rbd image 'pool-name':
>  size 200 TiB in 52428800 objects
>  order 22 (4 MiB objects)
>  snapshot_count: 0
>  id: 137b45a37ad84a
>  block_name_prefix: rbd_data.137b45a37ad84a
>  format: 2
>  features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>  op_features:
>  flags: object map invalid, fast diff invalid
>  create_timestamp: Thu Nov 12 16:14:31 2020
>  access_timestamp: Tue Mar 16 16:13:41 2021
>  modify_timestamp: Tue Mar 16 16:15:36 2021
>
> And I can also see that, like a standard rbd image, our 1st iSCSI
> gateway currently holds the lock on the image:
>
> ]# rbd lock ls --pool iscsi pool-name
> There is 1 exclusive lock on this image.
> Locker  ID  Address
> client.3618592  auto 259361792  10.101.12.61:0/1613659642
>
> Theoretically speaking, would I be able to simply stop & disable the
> tcmu-runner processes on all iSCSI gateways in our cluster, which would
> release the lock on the RBD image, then create another user with rwx
> permissions to the iscsi pool? Would this work, or am I missing
> something that would come back to bite me later on?
>
> Looking for any advice on this topic. Thanks in advance for reading!
>
> --
>
> Justin Goetz
> Systems Engineer, TeraSwitch Inc.
> jgo...@teraswitch.com
> 412-945-7045 (NOC) | 412-459-7945 (Direct)
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about migrating from iSCSI to RBD

2021-03-16 Thread Justin Goetz

Hello!

I was hoping to inquire if anyone here has attempted similar operations, 
and if they ran into any issues. To give a brief overview of my 
situation, I have a standard octopus cluster running 15.2.2, with 
ceph-iscsi installed via ansible. The original scope of a project we 
were working on changed, and we no longer need the iSCSI overhead added 
to the project (the machine using CEPH is Linux, so we would like to use 
native RBD block devices instead).


Ideally we would create some new pools and migrate the data from the 
iSCSI pools over to the new pools, however, due to the massive amount of 
data (close to 200 TB), we lack the physical resources necessary to copy 
the files.


Digging a bit on the backend of the pools utilized by ceph-iscsi, it 
appears that the iSCSI utility uses standard RBD images on the actual 
backend:


~]# rbd info iscsi/pool-name
rbd image 'pool-name':
    size 200 TiB in 52428800 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 137b45a37ad84a
    block_name_prefix: rbd_data.137b45a37ad84a
    format: 2
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    op_features:
    flags: object map invalid, fast diff invalid
    create_timestamp: Thu Nov 12 16:14:31 2020
    access_timestamp: Tue Mar 16 16:13:41 2021
    modify_timestamp: Tue Mar 16 16:15:36 2021

And I can also see that, like a standard rbd image, our 1st iSCSI 
gateway currently holds the lock on the image:


]# rbd lock ls --pool iscsi pool-name
There is 1 exclusive lock on this image.
Locker  ID  Address
client.3618592  auto 259361792  10.101.12.61:0/1613659642

Theoretically speaking, would I be able to simply stop & disable the 
tcmu-runner processes on all iSCSI gateways in our cluster, which would 
release the lock on the RBD image, then create another user with rwx 
permissions to the iscsi pool? Would this work, or am I missing 
something that would come back to bite me later on?


Looking for any advice on this topic. Thanks in advance for reading!

--

Justin Goetz
Systems Engineer, TeraSwitch Inc.
jgo...@teraswitch.com
412-945-7045 (NOC) | 412-459-7945 (Direct)

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_max_backfills = 1 for one OSD

2021-03-16 Thread Frank Schilder
ceph config rm ...

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dave Hall 
Sent: 16 March 2021 16:41:47
To: ceph-users
Subject: [ceph-users] osd_max_backfills = 1 for one OSD

Hello,

I've been trying to get an OSD ready for removal for about a week.  RIght
now I have 2 pgs backfilling and 21 in backfill_wait.  Looking around I
found reference to the following command (followed by my output).

# ceph config dump | egrep
"osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
global advanced osd_max_backfills   10

  osd  advanced osd_max_backfills   20

  osd  advanced osd_recovery_max_active 20

osd.1  advanced osd_max_backfills   1


I'm not sure why osd.1 is special.  I'd like to know how to get rid of
this.  If I increase this value for osd.1 it will likely speed up my
backfills, but the special setting for osd.1 will persist.  There must be a
way to unset this so the general setting is applied, right?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] osd_max_backfills = 1 for one OSD

2021-03-16 Thread Dave Hall
Hello,

I've been trying to get an OSD ready for removal for about a week.  RIght
now I have 2 pgs backfilling and 21 in backfill_wait.  Looking around I
found reference to the following command (followed by my output).

# ceph config dump | egrep
"osd_recovery_max_active|osd_recovery_op_priority|osd_max_backfills"
global advanced osd_max_backfills   10

  osd  advanced osd_max_backfills   20

  osd  advanced osd_recovery_max_active 20

osd.1  advanced osd_max_backfills   1


I'm not sure why osd.1 is special.  I'd like to know how to get rid of
this.  If I increase this value for osd.1 it will likely speed up my
backfills, but the special setting for osd.1 will persist.  There must be a
way to unset this so the general setting is applied, right?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: *****SPAM***** Diskless boot for Ceph nodes

2021-03-16 Thread Marc
croit.io(?) have some solution for booting from the network. They are very 
active here, you will undoubtedly hear from them ;)

> -Original Message-
> Sent: 16 March 2021 18:37
> To: ceph-users@ceph.io
> Subject: *SPAM* [ceph-users] Diskless boot for Ceph nodes
> 
> Hey folks - thought I'd check and see if anyone has ever tried to use
> ephemeral (tmpfs / ramfs based) boot disks for Ceph nodes?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Diskless boot for Ceph nodes

2021-03-16 Thread Stephen Smith6
Hey folks - thought I'd check and see if anyone has ever tried to use ephemeral (tmpfs / ramfs based) boot disks for Ceph nodes?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inactive pg, how to make it active / or delete

2021-03-16 Thread Szabo, Istvan (Agoda)
Can’t bring back, now trying to make the cluster hosts equal.

This signal error bot going away from the osd start.

I think I’ll try the ceph-objectstore tool pg export import on the died osds 
and put back to another. Let’s see.

> On 2021. Mar 16., at 18:54, Frank Schilder  wrote:
> 
> The PG says blocked_by at least 2 of your down-OSDs. When you look at the 
> history (past_intervals), it needs to backfill from the down OSDs 
> (down_osds_we_would_probe). Since its more than 1, it can't proceed. You need 
> to get the OSDs up.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: Szabo, Istvan (Agoda) 
> Sent: 16 March 2021 10:44:10
> To: Ceph Users
> Subject: [ceph-users] Inactive pg, how to make it active / or delete
> 
> Hi,
> 
> I have 4 inactive pg in my cluster, the osds are dies where it was before. 
> How can I make it work again? Maybe just threw away because last backfill=max?
> Based on the pg query it is totally up on other osds.
> It is an EC 3+1.
> 
> This is an example inactive pg:
> 
> ceph pg 44.1f0 query
> {
>"snap_trimq": "[]",
>"snap_trimq_len": 0,
>"state": "incomplete",
>"epoch": 5541839,
>"up": [
>46,
>34,
>62,
>74
>],
>"acting": [
>46,
>34,
>62,
>74
>],
>"info": {
>"pgid": "44.1f0s0",
>"last_update": "4863820'2109288",
>"last_complete": "4863820'2109288",
>"log_tail": "3881944'2103139",
>"last_user_version": 11189093,
>"last_backfill": "MAX",
>"purged_snaps": [],
>"history": {
>"epoch_created": 192123,
>"epoch_pool_created": 175799,
>"last_epoch_started": 4865266,
>"last_interval_started": 4865265,
>"last_epoch_clean": 4865007,
>"last_interval_clean": 4865006,
>"last_epoch_split": 192123,
>"last_epoch_marked_full": 0,
>"same_up_since": 5541572,
>"same_interval_since": 5541572,
>"same_primary_since": 5520915,
>"last_scrub": "4863820'2109288",
>"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
>"last_deep_scrub": "4808731'2109261",
>"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
>"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
>"prior_readable_until_ub": 0
>},
>"stats": {
>"version": "4863820'2109288",
>"reported_seq": "12355046",
>"reported_epoch": "5541839",
>"state": "incomplete",
>"last_fresh": "2021-03-16T10:39:21.058569+0100",
>"last_change": "2021-03-16T10:39:21.058569+0100",
>"last_active": "2021-01-20T01:07:18.246158+0100",
>"last_peered": "2021-01-20T01:07:13.931842+0100",
>"last_clean": "2021-01-20T01:07:07.392736+0100",
>"last_became_active": "2021-01-20T01:07:09.187047+0100",
>"last_became_peered": "2021-01-20T01:07:09.187047+0100",
>"last_unstale": "2021-03-16T10:39:21.058569+0100",
>"last_undegraded": "2021-03-16T10:39:21.058569+0100",
>"last_fullsized": "2021-03-16T10:39:21.058569+0100",
>"mapping_epoch": 5541572,
>"log_start": "3881944'2103139",
>"ondisk_log_start": "3881944'2103139",
>"created": 192123,
>"last_epoch_clean": 4865007,
>"parent": "0.0",
>"parent_split_bits": 9,
>"last_scrub": "4863820'2109288",
>"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
>"last_deep_scrub": "4808731'2109261",
>"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
>"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
>"log_size": 6149,
>"ondisk_log_size": 6149,
>"stats_invalid": false,
>"dirty_stats_invalid": false,
>"omap_stats_invalid": false,
>"hitset_stats_invalid": false,
>"hitset_bytes_stats_invalid": false,
>"pin_stats_invalid": false,
>"manifest_stats_invalid": false,
>"snaptrimq_len": 0,
>"stat_sum": {
>"num_bytes": 356705195545,
>"num_objects": 98594,
>"num_object_clones": 758,
>"num_object_copies": 394376,
>"num_objects_missing_on_primary": 0,
>"num_objects_missing": 0,
>"num_objects_degraded": 0,
>"num_objects_misplaced": 0,
>"num_objects_unfound": 0,
>"num_objects_dirty": 98594,
>"num_whiteouts": 750,
>"num_read": 30767,
>"num_read_kb": 37109881,
>"num_write": 31023,
>

[ceph-users] Re: Has anyone contact Data for Samsung Datacenter SSD Support ?

2021-03-16 Thread Jake Grimmett

Hi Christoph,

We use the DC Toolkit to over-provision PM983 ssd's


[root@pcterm25-1g ~]# ./Samsung_SSD_DC_Toolkit_for_Linux_V2.1 -L

Samsung DC Toolkit Version 2.1.L.Q.0
Copyright (C) 2017 SAMSUNG Electronics Co. Ltd. All rights reserved.

---
| Disk   | Path   | Model  | Serial  | 
Firmware | Optionrom | Capacity | Drive  | Total Bytes | NVMe Driver |
| Number |    |    | Number  
|  | Version   |  | Health | Written | |

---
| 0:c    | /dev/nvme0 | SAMSUNG MZQLB1T9HAJR-7 | S439NE0M100721  | 
EDA5202Q | XNUSRG29  |  1788 GB | GOOD   |  0.00 TB    | Unknown |

---

smartctl reports:
Total NVM Capacity: 1,920,383,410,176 [1.92 TB]

converting this to LBA:
1,920,383,410,176 / 512 = 3750748848

based on 960Gb 983DCT example on page 4 of white paper,
estimate 25% OP needed for 3DWPD

0.75 * 3750748848 = 2813061636  (1.34TB)

Set the Capacity:
[root@pcterm25-1g ~]# ./Samsung_SSD_DC_Toolkit_for_Linux_V2.1 --disk 0:c 
--nvme-management-namespace --set-lba 2813061636


Samsung DC Toolkit Version 2.1.L.Q.0
Copyright (C) 2017 SAMSUNG Electronics Co. Ltd. All rights reserved.


Disk Number: 0:c | Model Name: SAMSUNG MZQLB1T9HAJR-7 | Firmware 
Version: EDA5202Q


  [[ WARNING ]]

  Please Note that Set LBA feature feature may change the disk 
information and you will lose your data
  Please Ensure that data backup is taken before proceeding to Set LBA 
feature
  If you are sure then only proceed, otherwise restart the application 
after taking a backup

  Continue Set LBA feature ? [ yes ]: yes

[SUCCESS] Namespace Management feature completed successfully


[root@pcterm25-1g ~]# ./Samsung_SSD_DC_Toolkit_for_Linux_V2.1 -L

Samsung DC Toolkit Version 2.1.L.Q.0
Copyright (C) 2017 SAMSUNG Electronics Co. Ltd. All rights reserved.

---
| Disk   | Path   | Model  | Serial  | 
Firmware | Optionrom | Capacity | Drive  | Total Bytes | NVMe Driver |
| Number |    |    | Number  
|  | Version   |  | Health | Written | |

---
| 0:c    | /dev/nvme0 | SAMSUNG MZQLB1T9HAJR-7 | S439NE0M100721  | 
EDA5202Q | XNUSRG29  |  1341 GB | GOOD   |  0.00 TB    | Unknown |

---

smartctl confirms:

Total NVM Capacity: 1,920,383,410,176 [1.92 TB]
Unallocated NVM Capacity:   480,095,852,544 [480 GB]
Namespace 1 Size/Capacity:  1,440,287,557,632 [1.44 TB]

Best,

Jake

On 15/03/2021 19:44, Reed Dier wrote:

Not a direct answer to your question, but it looks like Samsung's DC Toolkit 
may allow for user adjusted over-provisioning.
https://www.samsung.com/semiconductor/global.semi.static/S190311-SAMSUNG-Memory-Over-Provisioning-White-paper.pdf
 


Reed


On Mar 11, 2021, at 7:25 AM, Christoph Adomeit  
wrote:

Hi,

I hope someone here can help me out with some contact data, email-adress or 
phone Number for Samsung Datacenter SSD Support ? If I contact Standard Samsung 
Datacenter Support they tell me they are not there 

[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Andrew Walker-Brown
https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Sent from Mail for Windows 10

From: Tony Liu
Sent: 16 March 2021 16:16
To: Stefan Kooman; Dave 
Hall; ceph-users
Subject: [ceph-users] Re: Networking Idea/Question

> -Original Message-
> From: Stefan Kooman 
> Sent: Tuesday, March 16, 2021 4:10 AM
> To: Dave Hall ; ceph-users 
> Subject: [ceph-users] Re: Networking Idea/Question
>
> On 3/15/21 5:34 PM, Dave Hall wrote:
> > Hello,
> >
> > If anybody out there has tried this or thought about it, I'd like to
> > know...
> >
> > I've been thinking about ways to squeeze as much performance as
> > possible from the NICs  on a Ceph OSD node.  The nodes in our cluster
> > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > Currently, one port is assigned to the front-side network, and one to
> > the back-side network.  However, there are times when the traffic on
> > one side or the other is more intense and might benefit from a bit
> more bandwidth.
>
> What is (are) the reason(s) to choose a separate cluster and public
> network?

That used to be the recommendation to separate client traffic and
cluster traffic. I heard it's not true any more as the latest.
It would be good if someone can point to the right link of such
recommendation.


Thanks!
Tony

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Tony Liu
> -Original Message-
> From: Stefan Kooman 
> Sent: Tuesday, March 16, 2021 4:10 AM
> To: Dave Hall ; ceph-users 
> Subject: [ceph-users] Re: Networking Idea/Question
> 
> On 3/15/21 5:34 PM, Dave Hall wrote:
> > Hello,
> >
> > If anybody out there has tried this or thought about it, I'd like to
> > know...
> >
> > I've been thinking about ways to squeeze as much performance as
> > possible from the NICs  on a Ceph OSD node.  The nodes in our cluster
> > (6 x OSD, 3 x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.
> > Currently, one port is assigned to the front-side network, and one to
> > the back-side network.  However, there are times when the traffic on
> > one side or the other is more intense and might benefit from a bit
> more bandwidth.
> 
> What is (are) the reason(s) to choose a separate cluster and public
> network?

That used to be the recommendation to separate client traffic and
cluster traffic. I heard it's not true any more as the latest.
It would be good if someone can point to the right link of such
recommendation.


Thanks!
Tony

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Inactive pg, how to make it active / or delete

2021-03-16 Thread Frank Schilder
The PG says blocked_by at least 2 of your down-OSDs. When you look at the 
history (past_intervals), it needs to backfill from the down OSDs 
(down_osds_we_would_probe). Since its more than 1, it can't proceed. You need 
to get the OSDs up.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Szabo, Istvan (Agoda) 
Sent: 16 March 2021 10:44:10
To: Ceph Users
Subject: [ceph-users] Inactive pg, how to make it active / or delete

Hi,

I have 4 inactive pg in my cluster, the osds are dies where it was before. How 
can I make it work again? Maybe just threw away because last backfill=max?
Based on the pg query it is totally up on other osds.
It is an EC 3+1.

This is an example inactive pg:

ceph pg 44.1f0 query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "incomplete",
"epoch": 5541839,
"up": [
46,
34,
62,
74
],
"acting": [
46,
34,
62,
74
],
"info": {
"pgid": "44.1f0s0",
"last_update": "4863820'2109288",
"last_complete": "4863820'2109288",
"log_tail": "3881944'2103139",
"last_user_version": 11189093,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 192123,
"epoch_pool_created": 175799,
"last_epoch_started": 4865266,
"last_interval_started": 4865265,
"last_epoch_clean": 4865007,
"last_interval_clean": 4865006,
"last_epoch_split": 192123,
"last_epoch_marked_full": 0,
"same_up_since": 5541572,
"same_interval_since": 5541572,
"same_primary_since": 5520915,
"last_scrub": "4863820'2109288",
"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"last_deep_scrub": "4808731'2109261",
"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"prior_readable_until_ub": 0
},
"stats": {
"version": "4863820'2109288",
"reported_seq": "12355046",
"reported_epoch": "5541839",
"state": "incomplete",
"last_fresh": "2021-03-16T10:39:21.058569+0100",
"last_change": "2021-03-16T10:39:21.058569+0100",
"last_active": "2021-01-20T01:07:18.246158+0100",
"last_peered": "2021-01-20T01:07:13.931842+0100",
"last_clean": "2021-01-20T01:07:07.392736+0100",
"last_became_active": "2021-01-20T01:07:09.187047+0100",
"last_became_peered": "2021-01-20T01:07:09.187047+0100",
"last_unstale": "2021-03-16T10:39:21.058569+0100",
"last_undegraded": "2021-03-16T10:39:21.058569+0100",
"last_fullsized": "2021-03-16T10:39:21.058569+0100",
"mapping_epoch": 5541572,
"log_start": "3881944'2103139",
"ondisk_log_start": "3881944'2103139",
"created": 192123,
"last_epoch_clean": 4865007,
"parent": "0.0",
"parent_split_bits": 9,
"last_scrub": "4863820'2109288",
"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"last_deep_scrub": "4808731'2109261",
"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"log_size": 6149,
"ondisk_log_size": 6149,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 356705195545,
"num_objects": 98594,
"num_object_clones": 758,
"num_object_copies": 394376,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 98594,
"num_whiteouts": 750,
"num_read": 30767,
"num_read_kb": 37109881,
"num_write": 31023,
"num_write_kb": 66655410,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 1000136,
"num_bytes_recovered": 3631523354253,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
  

[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Stefan Kooman

On 3/15/21 5:34 PM, Dave Hall wrote:

Hello,

If anybody out there has tried this or thought about it, I'd like to 
know...


I've been thinking about ways to squeeze as much performance as possible 
from the NICs  on a Ceph OSD node.  The nodes in our cluster (6 x OSD, 3 
x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.  Currently, one port 
is assigned to the front-side network, and one to the back-side 
network.  However, there are times when the traffic on one side or the 
other is more intense and might benefit from a bit more bandwidth.


What is (are) the reason(s) to choose a separate cluster and public network?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Dave Hall
Burkhard,

I woke up with the same conclusion - LACP load balancing can break down
when the traffic traverses a router since the IP headers have the router as
the destination address and thus the Ethernet header has the same to MAC
addresses.

(I think that in a pure layer 2 fabric the MAC addresses vary enough to
produce reasonable - not perfect - LACP load balancing.)

Then add VXLAN, SDN, and other newer networking technologies and it all
gets even more confusing.

But I always come back to the starter cluster, likely a proof of concept
demonstration, that might be built with left-over parts.  Networking is
frequently an afterthought.  In this case node-level traffic management -
weighted fair queueing - could make all the difference.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu

On Tue, Mar 16, 2021 at 4:20 AM Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de> wrote:

> Hi,
>
> On 16.03.21 03:40, Dave Hall wrote:
> > Andrew,
> >
> > I agree that the choice of hash function is important for LACP. My
> > thinking has always been to stay down in layers 2 and 3.  With enough
> > hosts it seems likely that traffic would be split close to evenly.
> > Heads or tails - 50% of the time you're right.  TCP ports should also
> > be nearly equally split, but listening ports could introduce some
> > asymmetry.
>
>
> Just a comment on the hashing methods. LACP specs does not include
> layer3+4, so running it is somewhat outside of the spec.
>
> The main reason for it being present it the fact that LACP load
> balancing does not work well in case of routing. If all your clients are
> in a different network reachable via a gateway, all your traffic will be
> directed to the MAC address of the gateway. As a result all that traffic
> will use a single link only.
>
> Also keep in mind that these hashing methods only affect the traffic the
> originate from the corresponding system. In case of a ceph host only the
> traffic sent from the host is controlled by it; the traffic from the
> switch to the host uses the switch's hashing setting.
>
>
> We use layer 3+4 hashing on all baremetal hosts (including ceph hosts)
> and all switches, and traffic is roughly evenly distributed between the
> links.
>
>
> Regards,
>
> Burkhard
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Inactive pg, how to make it active / or delete

2021-03-16 Thread Szabo, Istvan (Agoda)
Hi,

I have 4 inactive pg in my cluster, the osds are dies where it was before. How 
can I make it work again? Maybe just threw away because last backfill=max?
Based on the pg query it is totally up on other osds.
It is an EC 3+1.

This is an example inactive pg:

ceph pg 44.1f0 query
{
"snap_trimq": "[]",
"snap_trimq_len": 0,
"state": "incomplete",
"epoch": 5541839,
"up": [
46,
34,
62,
74
],
"acting": [
46,
34,
62,
74
],
"info": {
"pgid": "44.1f0s0",
"last_update": "4863820'2109288",
"last_complete": "4863820'2109288",
"log_tail": "3881944'2103139",
"last_user_version": 11189093,
"last_backfill": "MAX",
"purged_snaps": [],
"history": {
"epoch_created": 192123,
"epoch_pool_created": 175799,
"last_epoch_started": 4865266,
"last_interval_started": 4865265,
"last_epoch_clean": 4865007,
"last_interval_clean": 4865006,
"last_epoch_split": 192123,
"last_epoch_marked_full": 0,
"same_up_since": 5541572,
"same_interval_since": 5541572,
"same_primary_since": 5520915,
"last_scrub": "4863820'2109288",
"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"last_deep_scrub": "4808731'2109261",
"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"prior_readable_until_ub": 0
},
"stats": {
"version": "4863820'2109288",
"reported_seq": "12355046",
"reported_epoch": "5541839",
"state": "incomplete",
"last_fresh": "2021-03-16T10:39:21.058569+0100",
"last_change": "2021-03-16T10:39:21.058569+0100",
"last_active": "2021-01-20T01:07:18.246158+0100",
"last_peered": "2021-01-20T01:07:13.931842+0100",
"last_clean": "2021-01-20T01:07:07.392736+0100",
"last_became_active": "2021-01-20T01:07:09.187047+0100",
"last_became_peered": "2021-01-20T01:07:09.187047+0100",
"last_unstale": "2021-03-16T10:39:21.058569+0100",
"last_undegraded": "2021-03-16T10:39:21.058569+0100",
"last_fullsized": "2021-03-16T10:39:21.058569+0100",
"mapping_epoch": 5541572,
"log_start": "3881944'2103139",
"ondisk_log_start": "3881944'2103139",
"created": 192123,
"last_epoch_clean": 4865007,
"parent": "0.0",
"parent_split_bits": 9,
"last_scrub": "4863820'2109288",
"last_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"last_deep_scrub": "4808731'2109261",
"last_deep_scrub_stamp": "2021-01-15T21:17:56.729962+0100",
"last_clean_scrub_stamp": "2021-01-19T23:44:40.885414+0100",
"log_size": 6149,
"ondisk_log_size": 6149,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 356705195545,
"num_objects": 98594,
"num_object_clones": 758,
"num_object_copies": 394376,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 98594,
"num_whiteouts": 750,
"num_read": 30767,
"num_read_kb": 37109881,
"num_write": 31023,
"num_write_kb": 66655410,
"num_scrub_errors": 0,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 1000136,
"num_bytes_recovered": 3631523354253,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0,
   

[ceph-users] Re: Unhealthy Cluster | Remove / Purge duplicate osds | Fix daemon

2021-03-16 Thread Sebastian Wagner
Hi Oliver,

I don't know how you managed to remove all MGRs from the cluster, but
there is the documentation to manually recover from this:


> https://docs.ceph.com/en/latest/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon

Hope that helps,
Sebastian


Am 15.03.21 um 18:24 schrieb Oliver Weinmann:
> Hi Sebastian,
> 
> thanks that seems to have worked. At least on one of the two nodes. But
> now I have another problem. It seems that all mgr daemons are gone and
> ceph command is stuck.
> 
> [root@gedasvl02 ~]# cephadm ls | grep mgr
> 
> I tried to deploy a new mgr but this doesn't seem to work either:
> 
> [root@gedasvl02 ~]# cephadm ls | grep mgr
> [root@gedasvl02 ~]# cephadm deploy --fsid
> d0920c36-2368-11eb-a5de-005056b703af --name mgr.gedaopl03
> INFO:cephadm:Deploy daemon mgr.gedaopl03 ...
> 
> At least I can't see a mgr container on node gedaopl03:
> 
> [root@gedaopl03 ~]# podman ps
> CONTAINER ID  IMAGE
> COMMAND   CREATED STATUS PORTS  NAMES
> 63518d95201b  docker.io/prom/node-exporter:v0.18.1 
> --no-collector.ti...  3 days ago  Up 3 days ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-node-exporter.gedaopl03
> aa9b57fd77b8  docker.io/ceph/ceph:v15   -n
> client.crash.g...  3 days ago  Up 3 days ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-crash.gedaopl03
> 8b02715f9cb4  docker.io/ceph/ceph:v15   -n osd.2 -f
> --set...  3 days ago  Up 3 days ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.2
> 40f15a6357fe  docker.io/ceph/ceph:v15   -n osd.7 -f
> --set...  3 days ago  Up 3 days ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-osd.7
> bda260378239  docker.io/ceph/ceph:v15   -n
> mds.cephfs.ged...  3 days ago  Up 3 days ago
> ceph-d0920c36-2368-11eb-a5de-005056b703af-mds.cephfs.gedaopl03.kybzgy
> [root@gedaopl03 ~]# systemctl --failed
>  
> UNIT  
> LOAD  
> ACTIVE SUB    DESCRIPTION
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@crash.gedaopl03.service 
> loaded
> failed failed Ceph crash.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@mon.gedaopl03.service   
> loaded
> failed failed Ceph mon.gedaopl03 for d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@node-exporter.gedaopl03.service 
> loaded
> failed failed Ceph node-exporter.gedaopl03 for
> d0920c36-2368-11eb-a5de-005056b703af
> ●
> ceph-d0920c36-2368-11eb-a5de-005056b703af@osd.3.service   
> loaded
> failed failed Ceph osd.3 for d0920c36-2368-11eb-a5de-005056b703af
> 
> LOAD   = Reflects whether the unit definition was properly loaded.
> ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
> SUB    = The low-level unit activation state, values depend on unit type.
> 
> 4 loaded units listed. Pass --all to see loaded but inactive units, too.
> To show all installed unit files use 'systemctl list-unit-files'.
> 
> Maybe it's best to just scrap the whole cluster. It is only for testing,
> but I guess it is also a good practice for recovery. :)
> 
> Am 12. März 2021 um 12:35 schrieb Sebastian Wagner :
> 
>> Hi Oliver,
>>
>> # ssh gedaopl02
>> # cephadm rm-daemon osd.0
>>
>> should do the trick.
>>
>> Be careful to remove the broken OSD :-)
>>
>> Best,
>>
>> Sebastian
>>
>> Am 11.03.21 um 22:10 schrieb Oliver Weinmann:
>>> Hi,
>>>
>>> On my 3 node Octopus 15.2.5 test cluster, that I haven't used for quite
>>> a while, I noticed that it shows some errors:
>>>
>>> [root@gedasvl02 ~]# ceph health detail
>>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
>>> INFO:cephadm:Inferring config
>>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> HEALTH_WARN 2 failed cephadm daemon(s)
>>> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
>>>     daemon osd.0 on gedaopl02 is in error state
>>>     daemon node-exporter.gedaopl01 on gedaopl01 is in error state
>>>
>>> The error about the osd.0 is strange since osd.0 is actually up and
>>> running but on a different node. I guess I missed to correctly remove it
>>> from node gedaopl02 and then added a new osd to a different node
>>> gedaopl01 and now there are duplicate osd ids for osd.0 and osd.2.
>>>
>>> [root@gedasvl02 ~]# ceph orch ps
>>> INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
>>> INFO:cephadm:Inferring config
>>> /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
>>> INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
>>> NAME HOST   STATUS    REFRESHED AGE 
>>> VERSION    IMAGE NAME    IMAGE ID 
>>> CONTAINER ID
>>> alertmanager.gedasvl02   gedasvl02  running (6h)  7m ago 4M  
>>> 0.20.0 

[ceph-users] Re: Ceph Cluster Taking An Awful Long Time To Rebalance

2021-03-16 Thread duluxoz

Ah, right, that makes sense - I'll have a go at that

Thank you

On 16/03/2021 19:12, Janne Johansson wrote:

  pgs: 88.889% pgs not active
   6/21 objects misplaced (28.571%)
   256 creating+incomplete

For new clusters, "creating+incomplete" sounds like you created a pool
(with 256 PGs) with some crush rule that doesn't allow it to find
suitable placements, like "replication = 3" and "failure domain =
host" but only having 2 hosts, or something to that effect. Unless you
add hosts (in my example), this will not "fix itself" until you either
add hosts, or change the crush rules to something less reliable.


--
Peregrine IT Signature

*Matthew J BLACK*
  M.Inf.Tech.(Data Comms)
  MBA
  B.Sc.
  MACS (Snr), CP, IP3P

When you want it done /right/ ‒ the first time!

Phone:  +61 4 0411 0089
Email:  matt...@peregrineit.net 
Web:www.peregrineit.net 

View Matthew J BLACK's profile on LinkedIn 



This Email is intended only for the addressee.  Its use is limited to 
that intended by the author at the time and it is not to be distributed 
without the author’s consent.  You must not use or disclose the contents 
of this Email, or add the sender’s Email address to any database, list 
or mailing list unless you are expressly authorised to do so.  Unless 
otherwise stated, Peregrine I.T. Pty Ltd accepts no liability for the 
contents of this Email except where subsequently confirmed in 
writing.  The opinions expressed in this Email are those of the author 
and do not necessarily represent the views of Peregrine I.T. Pty 
Ltd.  This Email is confidential and may be subject to a claim of legal 
privilege.


If you have received this Email in error, please notify the author and 
delete this message immediately.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: should I increase the amount of PGs?

2021-03-16 Thread Boris Behrens
Hi Dan,

my EC profile look very "default" to me.
[root@s3db1 ~]# ceph osd erasure-code-profile ls
default
[root@s3db1 ~]# ceph osd erasure-code-profile get default
k=2
m=1
plugin=jerasure
technique=reed_sol_van

I don't understand the ouput, but the balancing get worse over night:

[root@s3db1 ~]# ceph-scripts/tools/ceph-pool-pg-distribution 11
Searching for PGs in pools: ['11']
Summary: 1024 PGs on 84 osds

Num OSDs with X PGs:
15: 8
16: 7
17: 6
18: 10
19: 1
32: 10
33: 4
34: 6
35: 8
65: 5
66: 5
67: 4
68: 10
[root@s3db1 ~]# ceph-scripts/tools/ceph-pg-histogram --normalize --pool=11
# NumSamples = 84; Min = 4.12; Max = 5.09
# Mean = 4.553355; Variance = 0.052415; SD = 0.228942; Median 4.561608
# each ∎ represents a count of 1
4.1244 - 4.2205 [ 8]: 
4.2205 - 4.3166 [ 6]: ∎∎
4.3166 - 4.4127 [11]: ∎∎∎
4.4127 - 4.5087 [10]: ∎∎
4.5087 - 4.6048 [11]: ∎∎∎
4.6048 - 4.7009 [19]: ∎∎∎
4.7009 - 4.7970 [ 6]: ∎∎
4.7970 - 4.8931 [ 8]: 
4.8931 - 4.9892 [ 4]: 
4.9892 - 5.0852 [ 1]: ∎
[root@s3db1 ~]# ceph osd df tree | sort -nk 17 | tail
 14   hdd   3.63689  1.0 3.6 TiB 2.9 TiB 724 GiB   19 GiB 0 B 724
GiB 80.56 1.07  56 up osd.14
 19   hdd   3.68750  1.0 3.7 TiB 3.0 TiB 2.9 TiB  466 MiB 7.9 GiB 708
GiB 81.25 1.08  53 up osd.19
  4   hdd   3.63689  1.0 3.6 TiB 3.0 TiB 698 GiB  703 MiB 0 B 698
GiB 81.27 1.08  48 up osd.4
 24   hdd   3.63689  1.0 3.6 TiB 3.0 TiB 695 GiB  640 MiB 0 B 695
GiB 81.34 1.08  46 up osd.24
 75   hdd   3.68750  1.0 3.7 TiB 3.0 TiB 2.9 TiB  440 MiB 8.1 GiB 704
GiB 81.35 1.08  48 up osd.75
 71   hdd   3.68750  1.0 3.7 TiB 3.0 TiB 3.0 TiB  7.5 MiB 8.0 GiB 663
GiB 82.44 1.09  47 up osd.71
 76   hdd   3.68750  1.0 3.7 TiB 3.1 TiB 3.0 TiB  251 MiB 9.0 GiB 617
GiB 83.65 1.11  50 up osd.76
 33   hdd   3.73630  1.0 3.7 TiB 3.1 TiB 3.0 TiB  399 MiB 8.1 GiB 618
GiB 83.85 1.11  55 up osd.33
 35   hdd   3.73630  1.0 3.7 TiB 3.1 TiB 3.0 TiB  317 MiB 8.8 GiB 617
GiB 83.87 1.11  50 up osd.35
 34   hdd   3.73630  1.0 3.7 TiB 3.2 TiB 3.1 TiB  451 MiB 8.7 GiB 545
GiB 85.75 1.14  54 up osd.34

Am Mo., 15. März 2021 um 17:23 Uhr schrieb Dan van der Ster <
d...@vanderster.com>:

> Hi,
>
> How wide are your EC profiles? If they are really wide, you might be
> reaching the limits of what is physically possible. Also, I'm not sure
> that upmap in 14.2.11 is very smart about *improving* existing upmap
> rules for a given PG, in the case that a PG already has an upmap-items
> entry but it would help the distribution to add more mapping pairs to
> that entry. What this means, is that it might sometimes be useful to
> randomly remove some upmap entries and see if the balancer does a
> better job when it replaces them.
>
> But before you do that, I re-remembered that looking at the total PG
> numbers is not useful -- you need to check the PGs per OSD for the
> eu-central-1.rgw.buckets.data pool only.
>
> We have a couple tools that can help with this:
>
> 1. To see the PGs per OSD for a given pool:
>
> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-pool-pg-distribution
>
> E.g.: ./ceph-pool-pg-distribution 11  # to see the distribution of
> your eu-central-1.rgw.buckets.data pool.
>
> The output looks like this on my well balanced clusters:
>
> # ceph-scripts/tools/ceph-pool-pg-distribution 15
> Searching for PGs in pools: ['15']
> Summary: 256 pgs on 56 osds
>
> Num OSDs with X PGs:
>  13: 16
>  14: 40
>
> You should expect a trimodal for your cluster.
>
> 2. You can also use another script from that repo to see the PGs per
> OSD normalized to crush weight:
> ceph-scripts/tools/ceph-pg-histogram --normalize --pool=15
>
>This might explain what is going wrong.
>
> Cheers, Dan
>
>
> On Mon, Mar 15, 2021 at 3:04 PM Boris Behrens  wrote:
> >
> > Absolutly:
> > [root@s3db1 ~]# ceph osd df tree
> > ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATA OMAP META
> AVAIL%USE  VAR  PGS STATUS TYPE NAME
> >  -1   673.54224- 674 TiB 496 TiB  468 TiB   97 GiB 1.2 TiB
> 177 TiB 73.67 1.00   -root default
> >  -258.30331-  58 TiB  42 TiB   38 TiB  9.2 GiB  99 GiB
>  16 TiB 72.88 0.99   -host s3db1
> >  23   hdd  14.65039  1.0  15 TiB  11 TiB   11 TiB  714 MiB  25 GiB
> 3.7 TiB 74.87 1.02 194 up osd.23
> >  69   hdd  14.55269  1.0  15 TiB  11 TiB   11 TiB  1.6 GiB  40 GiB
> 3.4 TiB 76.32 1.04 199 up osd.69
> >  73   hdd  14.55269  1.0  15 TiB  11 TiB   11 TiB  1.3 GiB  34 GiB
> 3.8 TiB 74.15 1.01 203 up osd.73
> >  79   hdd   3.63689  1.0 3.6 TiB 2.4 TiB  1.3 TiB  1.8 GiB 0 B
> 1.3 TiB 65.44 0.89  47 up osd.79
> >  80   hdd   

[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Burkhard Linke

Hi,

On 16.03.21 03:40, Dave Hall wrote:

Andrew,

I agree that the choice of hash function is important for LACP. My 
thinking has always been to stay down in layers 2 and 3.  With enough 
hosts it seems likely that traffic would be split close to evenly.  
Heads or tails - 50% of the time you're right.  TCP ports should also 
be nearly equally split, but listening ports could introduce some 
asymmetry.



Just a comment on the hashing methods. LACP specs does not include 
layer3+4, so running it is somewhat outside of the spec.


The main reason for it being present it the fact that LACP load 
balancing does not work well in case of routing. If all your clients are 
in a different network reachable via a gateway, all your traffic will be 
directed to the MAC address of the gateway. As a result all that traffic 
will use a single link only.


Also keep in mind that these hashing methods only affect the traffic the 
originate from the corresponding system. In case of a ceph host only the 
traffic sent from the host is controlled by it; the traffic from the 
switch to the host uses the switch's hashing setting.



We use layer 3+4 hashing on all baremetal hosts (including ceph hosts) 
and all switches, and traffic is roughly evenly distributed between the 
links.



Regards,

Burkhard

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Cluster Taking An Awful Long Time To Rebalance

2021-03-16 Thread Janne Johansson
>  pgs: 88.889% pgs not active
>   6/21 objects misplaced (28.571%)
>   256 creating+incomplete

For new clusters, "creating+incomplete" sounds like you created a pool
(with 256 PGs) with some crush rule that doesn't allow it to find
suitable placements, like "replication = 3" and "failure domain =
host" but only having 2 hosts, or something to that effect. Unless you
add hosts (in my example), this will not "fix itself" until you either
add hosts, or change the crush rules to something less reliable.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Networking Idea/Question

2021-03-16 Thread Dave Hall

Andrew,

I agree that the choice of hash function is important for LACP. My 
thinking has always been to stay down in layers 2 and 3.  With enough 
hosts it seems likely that traffic would be split close to evenly.  
Heads or tails - 50% of the time you're right.  TCP ports should also be 
nearly equally split, but listening ports could introduce some asymmetry.


What I'm concerned about is the next level up:  With the client network 
and the cluster network (Marc's terms are more descriptive) on the same 
NICs/Switch Ports, with or without LACP and LAGs, it seems possible that 
at times the bandwidth consumed by cluster traffic could overwhelm and 
starve the client traffic. Or the other way around, which would be worse 
if the cluster nodes can't communicate on their 'private' network to 
keep the cluster consistent.  These overloads could  happen in the 
packet queues in the NIC drivers, or maybe in the switch fabric.


Maybe these starvation scenarios aren't that likely in clusters with 
10GB networking.  Maybe it's hard to fill up a 10GB pipe, much less 
two.  But it could happen with 1GB NICs, even in LAGs of 4 or 6 ports, 
and eventually it will be possible with faster NVMe drives to easily 
fill a 10GB pipe.


So, what could we do with some of the 'exotic' queuing mechanisms 
available in Linux to keep the balance - to assure that the lesser 
category can transmit proportionally?  (And is 'proportional' the right 
answer, or should one side get a slight advantage?)


-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
On 3/15/2021 12:48 PM, Andrew Walker-Brown wrote:


Dave

That’s the way our cluster is setup. It’s relatively small, 5 hosts, 12 osd’s.

Each host has 2x10G with LACP to the switches.  We’ve vlan’d public/private 
networks.

Making best use of the LACP lag will to a greater extent be down to choosing 
the best hashing policy.  At the moment we’re using layer3+4 on the Linux 
config and switch configs.  We’re monitoring link utilisation to make sure the 
balancing is as close to equal as possible.

Hope this helps

A

Sent from my iPhone

On 15 Mar 2021, at 16:39, Marc  wrote:

I have client and cluster network on one 10gbit port (with different vlans).
I think many smaller clusters do this ;)


I've been thinking about ways to squeeze as much performance as possible
from the NICs  on a Ceph OSD node.  The nodes in our cluster (6 x OSD, 3
x MGR/MON/MDS/RGW) currently have 2 x 10GB ports.  Currently, one port
is assigned to the front-side network, and one to the back-side
network.  However, there are times when the traffic on one side or the
other is more intense and might benefit from a bit more bandwidth.

The idea I had was to bond the two ports together, and to run the
back-side network in a tagged VLAN on the combined 20GB LACP port.  In
order to keep the balance and prevent starvation from either side it
would be necessary to apply some sort of a weighted fair queuing
mechanism via the 'tc' command.  The idea is that if the client side
isn't using up the full 10GB/node, and there is a burst of re-balancing
activity, the bandwidth consumed by the back-side traffic could swell to
15GB or more.   Or vice versa.

 From what I have read and studied, these algorithms are fairly
responsive to changes in load and would thus adjust rapidly if the
demand from either side suddenly changed.

Maybe this is a crazy idea, or maybe it's really cool.  Your thoughts?

Thanks.

-Dave

--
Dave Hall
Binghamton University
kdh...@binghamton.edu
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Safe to remove osd or not? Which statement is correct?

2021-03-16 Thread Szabo, Istvan (Agoda)
Hi Boris,

Yeah, this is the reason:

   -1> 2021-03-15T16:21:35.307+0100 7f8b1fd8d700  5 prioritycache tune_memory 
target: 4294967296mapped: 454098944 unmapped: 
8560640 heap: 462659584 old mem: 
2845415832 new mem: 2845415832
0> 2021-03-15T16:21:35.311+0100 7f8b11570700 -1 *** Caught signal (Aborted) 
**
in thread 7f8b11570700 thread_name:tp_osd_tp


On 2021. Mar 14., at 15:10, Boris Behrens  wrote:


Email received from outside the company. If in doubt don't click links nor open 
attachments!

Hi,
do you know why the OSDs are not starting?

When I had the problem that a start does not work, I tried the 'ceph-volume lvm 
activate --all' on the host, which brought the OSDs back up.

But I can't tell you if it is safe to remove the OSD.

Cheers
 Boris

Am So., 14. März 2021 um 02:38 Uhr schrieb Szabo, Istvan (Agoda) 
mailto:istvan.sz...@agoda.com>>:
Hi Gents,

There is a cluster with 14 hosts in this state:

https://i.ibb.co/HPF3Pdr/6-ACB2-C5-B-6-B54-476-B-835-D-227-E9-BFB1247.jpg

There is a host based crush rule ec 3:1 and there are 3 hosts where are osds 
down.
Unfortunately there are pools with 3 replicas also which is host based.

2 hosts have 2 osds down,1 host has 1 osd down, which means I guess if we don’t 
bring it back, 0.441% data loss is going to happen. Am I right?
I hope I’m not right because there isn’t any missplaced objects, only degraded.

The problem is that the osds are not starting so somehow these 5 osds should be 
removed, but I’m curious which is my statement is correct.

Thank you in advance.



This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io


--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im 
groüen Saal.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: millions slow ops on a cluster without load

2021-03-16 Thread Szabo, Istvan (Agoda)
Yeah, the mtu is on the cluster network’s nic cards are 8982, the ping works 
with 8954 packets between interfaces.

On 2021. Mar 15., at 23:40, Matthew H  wrote:


Might be an MTU problem, have you checked your network and MTU settings?




From: Szabo, Istvan (Agoda) 
Sent: Monday, March 15, 2021 12:08 PM
To: Ceph Users 
Subject: [ceph-users] millions slow ops on a cluster without load

We have a cluster with a huge amount of  warnings like this even if nothing is 
going on in the cluster.

It makes mgr physical memory full, mon db maxed out 5 osds can't start :/

[WRN] slow request osd_op(mds.0.537792:26453 43.38 
43:1d6c5587:::1fe56a6.:head [create,setxattr parent (367) 
in=373b,setxattr layout (30) in=36b] snapc 0=[] RETRY=131 
ondisk+retry+write+known_if_redirected+full_force e5532574) initiated 
2021-03-15T16:44:18.428585+0100 currently delayed

Is there a way to somehow stop it?

MDS service stopped also.


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io