[ceph-users] Transmit rate metric based per bucket

2023-06-19 Thread Szabo, Istvan (Agoda)
Hello,

I'd like to know is there a way to query some metrics/logs in octopus (or if 
has newer version I'm interested for the future too) about the bandwidth used 
in the bucket for put/get operations?

Thank you


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-19 Thread Angelo Hongens
As a sidenote: there's the windows rbd driver which will get you wy 
more performance. It's labeled beta, but it seems to work fine for a lot 
of people. If you have a testlab you could try that.


Angelo.

On 19/06/2023 18:16, Work Ceph wrote:

I see, thanks for the feedback guys!

It is interesting that Ceph Manager does not allow us to export iSCSI
blocks without selecting 2 or more iSCSI portals. Therefore, we will always
use at least two, and as a consequence that feature is not going to be
supported. Can I export an RBD image via iSCSI gateway using only one
portal via GwCli?

@Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI
implementation that we can use to somehow replace the default iSCSI server
in the default Ceph iSCSI Gateway? I didn't quite understand what the
petasan project is, and if it is an OpenSource solution that we can somehow
just pick/select/use one of its modules (e.g. just the iSCSI
implementation) that you guys have.

On Mon, Jun 19, 2023 at 10:07 AM Maged Mokhtar  wrote:


Windows Clustered Shared Volumes and Failover Clustering require the
support of clustered persistence reservations by the block device to
coordinate access by multiple hosts. The default iSCSI implementation in
Ceph does not support this, you can use the iSCSI implementation in
PetaSAN project:

www.petasan.org

which supports this feature and provides a high performance
implementation. We currently use Ceph 17.2.5


On 19/06/2023 14:47, Work Ceph wrote:

Hello guys,

We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD
for some workloads, RadosGW (via S3) for others, and iSCSI for some

Windows

clients.

Recently, we had the need to add some VMWare clusters as clients for the
iSCSI GW and also Windows systems with the use of Clustered Storage

Volumes

(CSV), and we are facing a weird situation. In windows for instance, the
iSCSI block can be mounted, formatted and consumed by all nodes, but when
we add in the CSV it fails with some generic exception. The same happens

in

VMWare, when we try to use it with VMFS it fails.

We do not seem to find the root cause for these errors. However, the

errors

seem to be linked to the situation of multiple nodes consuming the same
block by shared file systems. Have you guys seen this before?

Are we missing some basic configuration in the iSCSI GW?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-19 Thread Work Ceph
I see, thanks for the feedback guys!

It is interesting that Ceph Manager does not allow us to export iSCSI
blocks without selecting 2 or more iSCSI portals. Therefore, we will always
use at least two, and as a consequence that feature is not going to be
supported. Can I export an RBD image via iSCSI gateway using only one
portal via GwCli?

@Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI
implementation that we can use to somehow replace the default iSCSI server
in the default Ceph iSCSI Gateway? I didn't quite understand what the
petasan project is, and if it is an OpenSource solution that we can somehow
just pick/select/use one of its modules (e.g. just the iSCSI
implementation) that you guys have.

On Mon, Jun 19, 2023 at 10:07 AM Maged Mokhtar  wrote:

> Windows Clustered Shared Volumes and Failover Clustering require the
> support of clustered persistence reservations by the block device to
> coordinate access by multiple hosts. The default iSCSI implementation in
> Ceph does not support this, you can use the iSCSI implementation in
> PetaSAN project:
>
> www.petasan.org
>
> which supports this feature and provides a high performance
> implementation. We currently use Ceph 17.2.5
>
>
> On 19/06/2023 14:47, Work Ceph wrote:
> > Hello guys,
> >
> > We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD
> > for some workloads, RadosGW (via S3) for others, and iSCSI for some
> Windows
> > clients.
> >
> > Recently, we had the need to add some VMWare clusters as clients for the
> > iSCSI GW and also Windows systems with the use of Clustered Storage
> Volumes
> > (CSV), and we are facing a weird situation. In windows for instance, the
> > iSCSI block can be mounted, formatted and consumed by all nodes, but when
> > we add in the CSV it fails with some generic exception. The same happens
> in
> > VMWare, when we try to use it with VMFS it fails.
> >
> > We do not seem to find the root cause for these errors. However, the
> errors
> > seem to be linked to the situation of multiple nodes consuming the same
> > block by shared file systems. Have you guys seen this before?
> >
> > Are we missing some basic configuration in the iSCSI GW?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block

Hi,
adding the dev mailing list, hopefully someone there can chime in. But  
apparently the LRC code hasn't been maintained for a few years  
(https://github.com/ceph/ceph/tree/main/src/erasure-code/lrc). Let's  
see...


Zitat von Michel Jouvin :


Hi Eugen,

Thank you very much for these detailed tests that match what I  
observed and reported earlier. I'm happy to see that we have the  
same understanding of how it should work (based on the  
documentation). Is there any other way that this list to enter in  
contact with the plugin developers as it seems they are not  
following this (very high volume) list... Or may somebody pass the  
email thread to one of them?


Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm  
not sure whether I'm completely misunderstanding how it's supposed  
to work or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I  
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a  
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15  
chunks in total across those 3 DCs, one chunk per host, I checked  
the chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4  
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three  
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with  
one missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being  
marked as "down". That was unexpected since with m=3 I expected the  
PG to still be active but degraded. Before test #3 I started all  
OSDs to have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and  
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks,  
but not an entire DC. I get the impression (and I also discussed  
this with a colleague) that LRC with this implementation is either  
designed to loose only single OSDs which can be recovered quicker  
with fewer surviving OSDs and saving bandwidth. Or this is a bug  
because according to the low-level description [1] the algorithm  
works its way up in the reverse order within the configured layers,  
like in this example (not displaying my k, m, l requirements, just  
for reference):


chunk nr    01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be  
recovered, and maybe step 2 also fails, but eventually step 1  
contains the actual k and m chunks which should sustain the loss of  
an entire DC. My impression is that the algorithm somehow doesn't  
arrive at step 1 and therefore the PG stays down although there are  
enough surviving chunks. I'm not sure if my observations and  
conclusion are correct, I'd love to have a comment from the  
developers on this topic. But in this state I would not recommend  
to use the LRC plugin when the resiliency requirements are to  
sustain the loss of an entire DC.


Thanks,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email,  
probably required to understand the discussion here, has been  
stripped down by mailman. To avoid poluting the thread with a long  
output, I put it on at  
https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if  
you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really  
urgent. If you manage to do some tests that help me to understand  
the problem I remain interested. I propose to keep this thread  
for that.


Zitat, I shared my crush map in the email you answered if the  
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like  
a unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made 

[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots

2023-06-19 Thread Janek Bevendorff

Hi Patrick,


The event log size of 3/5 MDS is also very high, still. mds.1, mds.3,
and mds.4 report between 4 and 5 million events, mds.0 around 1.4
million and mds.2 between 0 and 200,000. The numbers have been constant
since my last MDS restart four days ago.

I ran your ceph-gather.sh script a couple of times, but dumps only
mds.0. Should I modify it to dump mds.3 instead so you can have a look?

Yes, please.


The session load on mds.3 had already resolved itself after a few days, 
so I cannot reproduce it any more. Right now, mds.0 has the highest load 
and a steadily growing event log, but it's not crazy (yet). Nonetheless, 
I've sent you my dumps with upload ID 
b95ee882-21e1-4ea1-a419-639a86acc785. The older dumps are from when 
mds.3 was under load, but they are all from mds.0. I also attached a 
newer batch, which I created just a few minutes ago.


Janek


--

Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Michel Jouvin

Hi Eugen,

Thank you very much for these detailed tests that match what I observed 
and reported earlier. I'm happy to see that we have the same 
understanding of how it should work (based on the documentation). Is 
there any other way that this list to enter in contact with the plugin 
developers as it seems they are not following this (very high volume) 
list... Or may somebody pass the email thread to one of them?


Help would be really appreciated. Cheers,

Michel

Le 19/06/2023 à 14:09, Eugen Block a écrit :
Hi, I have a real hardware cluster for testing available now. I'm not 
sure whether I'm completely misunderstanding how it's supposed to work 
or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I 
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a 
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 
chunks in total across those 3 DCs, one chunk per host, I checked the 
chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three 
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with one 
missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being 
marked as "down". That was unexpected since with m=3 I expected the PG 
to still be active but degraded. Before test #3 I started all OSDs to 
have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and 
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks, but 
not an entire DC. I get the impression (and I also discussed this with 
a colleague) that LRC with this implementation is either designed to 
loose only single OSDs which can be recovered quicker with fewer 
surviving OSDs and saving bandwidth. Or this is a bug because 
according to the low-level description [1] the algorithm works its way 
up in the reverse order within the configured layers, like in this 
example (not displaying my k, m, l requirements, just for reference):


chunk nr    01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be 
recovered, and maybe step 2 also fails, but eventually step 1 contains 
the actual k and m chunks which should sustain the loss of an entire 
DC. My impression is that the algorithm somehow doesn't arrive at step 
1 and therefore the PG stays down although there are enough surviving 
chunks. I'm not sure if my observations and conclusion are correct, 
I'd love to have a comment from the developers on this topic. But in 
this state I would not recommend to use the LRC plugin when the 
resiliency requirements are to sustain the loss of an entire DC.


Thanks,
Eugen

[1] 
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email, probably 
required to understand the discussion here, has been stripped down by 
mailman. To avoid poluting the thread with a long output, I put it on 
at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if 
you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent. 
If you manage to do some tests that help me to understand the 
problem I remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the 
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a 
unique use
case to expand my knowledge. I don't use LRC or anything outside 
basic

erasure coding.

What is your current crush steps rule?  I know you made changes 
since your
first post and had some thoughts I wanted to share, but wanted to 
see your
rule first so I could try to visualize the distribution better. 
 The only
way I can currently visualize it working is with more servers, I'm 
thinking
6 or 9 per data center min, but that could be my lack of 

[ceph-users] Re: header_limit in AsioFrontend class

2023-06-19 Thread Casey Bodley
On Sat, Jun 17, 2023 at 8:37 AM Vahideh Alinouri
 wrote:
>
> Dear Ceph Users,
>
> I am writing to request the backporting changes related to the
> AsioFrontend class and specifically regarding the header_limit value.
>
> In the Pacific release of Ceph, the header_limit value in the
> AsioFrontend class was set to 4096. From Quincy release, there has
> been a configurable option introduced to set the header_limit value
> and the default value is 16384.
>
> I would greatly appreciate it if someone from the Ceph development
> team backport this change to the older version.
>
> Best regards,
> Vahideh Alinouri
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

hi Vahideh, i've prepared that pacific backport. you can follow its
progress in https://tracker.ceph.com/issues/61728
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Starting v17.2.5 RGW SSE with default key (likely others) no longer works

2023-06-19 Thread Casey Bodley
On Sat, Jun 17, 2023 at 1:11 PM Jayanth Reddy
 wrote:
>
> Hello Folks,
>
> I've been experimenting with RGW encryption and found this out.
> Focusing on Quincy and Reef dev, for the SSE (any methods) to work, transit
> has to be end to end encrypted, however if there is a proxy, then [1] can
> be made use to tell RGW that SSL is being terminated. As per docs, RGW can
> still continue to accept SSE if rgw_crypt_require_ssl is set to false as an
> overriding item for the requirement of encryption in transit. Below are my
> observations.
>
> Until v17.2.3 (
> quay.io/ceph/ceph@sha256:43f6e905f3e34abe4adbc9042b9d6f6b625dee8fa8d93c2bae53fa9b61c3df1a),
> setting the same key as in [2], would show the object unreadable when
> copied using
> # rados -p default.rgw.buckets.data get
> 03c2ef32-b7c8-4e18-8e0c-ebac10a42f10.17254.1_file.plain file.enc
> The object would be unreadable. The original object is in plain text.
> Ofcourse, with rgw_crypt_require_ssl to false or [1]
>
> However, starting with v17.2.4 onwards and even until my recent testing
> with reef-dev (18.0.0-4353-g1e3835ab
> 1e3835abb2d19ce6ac4149c260ef804f1041d751)
> When I try getting the same object onto the disk using rados command, the
> object (contains plain text) would still be readable.
>
> Has something changed since v17.2.4? I'll also test with Pacific and let
> you know. Not sure if it affects other SSE mechanisms as well.
>
> [1]
> https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_trust_forwarded_https
> [2]
> https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only
>
> Thanks,
> Jayanth Reddy
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

hi Jayanth,

17.2.4 coincides with backports of the SSE-S3 and PutBucketEncryption
features. those changes include a regression where the
rgw_crypt_default_encryption_key configurable no longer applies. you
can track the fix for this in https://tracker.ceph.com/issues/61473
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: same OSD in multiple CRUSH hierarchies

2023-06-19 Thread Budai Laszlo

Hi,

Actually I've learned that it's not needed for a rule to start with a root 
bucket, so I can heve rules that will only consider a subtree of my total 
resources, and achieve what I was trying to do with the different disjunct 
hierarchies.

BTW: it is possible to have different trees with different roots, with some 
OSDs being part of multiple such trees, and create different rules that will 
start with one root or the other. But I was told that this could mess up the 
calculation for pg autoscaler and other housekeeping functions. So it seems a 
better option to have each OSD in one single tree, and use rules that will only 
consider subtrees ...

Regards,
Laszlo


Date: Mon, 19 Jun 2023 07:41:35 +
From: Eugen Block
Subject: [ceph-users] Re: same OSD in multiple CRUSH hierarchies
To:ceph-users@ceph.io
Message-ID:
<20230619074135.horde.gs8nakqgzhlbv0hpymj-...@webmail.nde.ag>
Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes

Hi,
I don't think this is going to work. Each OSD belongs to a specific
host and you can't have multiple buckets (e.g. bucket type "host")
with the same name in the crush tree. But if I understand your
requirement correctly, there should be no need to do it this way. If
you structure your crush tree according to your separation
requirements and the critical pools use designated rules, you can
still have a rule that doesn't care about the data separation but
distributes the replicas across the available hosts (given your
failure domain would be "host"), which is already the default for the
replicated_rule. Did I misunderstand something?

Regards,
Eugen


Zitat von Budai Laszlo:


Hi there,

I'm curious if there is anything against configuring an ODS to be
part in multiple CRUSH hierarchies. I'm thinking of the following
scenario:

I want to create pools that are using distinct sets of OSDs. I want
to make sure that a piece data which replicated at application level
will not end up on the same OSD. So I would creat multiple CRUSH
hierarchies (root - host - osd) but using different OSDs in each,
and different rules that are using those hierarchies. Then I would
create pools with the different rules, and use those different pools
for storing the data for the different application instances. But I
would also like to use the OSDs in the "default hierarchy" set up by
ceph where all the hosts are in the same root bucket, and have the
default replicated rule, so my generic data volumes would be able to
spread across all the OSDs available.

Is there something against this setup?

Thank you for any advice!
Laszlo
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How does a "ceph orch restart SERVICE" affect availability?

2023-06-19 Thread Mikael Öhman
The documentation very briefly explains a few core commands for restarting
things;
https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons
but I feel I'm lacking quite some details of what is safe to do.

I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;

ceph orch restart osd

Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)

Best regards, Mikael
Chalmers University of Technology
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-19 Thread Maged Mokhtar
Windows Clustered Shared Volumes and Failover Clustering require the 
support of clustered persistence reservations by the block device to 
coordinate access by multiple hosts. The default iSCSI implementation in 
Ceph does not support this, you can use the iSCSI implementation in 
PetaSAN project:


www.petasan.org

which supports this feature and provides a high performance 
implementation. We currently use Ceph 17.2.5



On 19/06/2023 14:47, Work Ceph wrote:

Hello guys,

We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD
for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows
clients.

Recently, we had the need to add some VMWare clusters as clients for the
iSCSI GW and also Windows systems with the use of Clustered Storage Volumes
(CSV), and we are facing a weird situation. In windows for instance, the
iSCSI block can be mounted, formatted and consumed by all nodes, but when
we add in the CSV it fails with some generic exception. The same happens in
VMWare, when we try to use it with VMFS it fails.

We do not seem to find the root cause for these errors. However, the errors
seem to be linked to the situation of multiple nodes consuming the same
block by shared file systems. Have you guys seen this before?

Are we missing some basic configuration in the iSCSI GW?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-19 Thread Robert Sander

On 19.06.23 13:47, Work Ceph wrote:


Recently, we had the need to add some VMWare clusters as clients for the
iSCSI GW and also Windows systems with the use of Clustered Storage Volumes
(CSV), and we are facing a weird situation. In windows for instance, the
iSCSI block can be mounted, formatted and consumed by all nodes, but when
we add in the CSV it fails with some generic exception. The same happens in
VMWare, when we try to use it with VMFS it fails.


The iSCSI target used does not support SCSI persistent group 
reservations when in multipath mode.


https://docs.ceph.com/en/quincy/rbd/iscsi-initiators/

AFAIK VMware uses these in VMFS.

Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Grafana service fails to start due to bad directory name after Quincy upgrade

2023-06-19 Thread Eugen Block

Hi,

so grafana is starting successfully now? What did you change?  
Regarding the container images, yes there are defaults in cephadm  
which can be overridden with ceph config. Can you share this output?


ceph config dump | grep container_image

I tend to always use a specific image as described here [2]. I also  
haven't deployed grafana via dashboard yet so I can't really comment  
on that as well as on the warnings you report.


Regards,
Eugen

[2]  
https://docs.ceph.com/en/latest/cephadm/services/monitoring/#using-custom-images


Zitat von "Adiga, Anantha" :


Hi Eugene,

Thank you for your response, here is the update.

The upgrade to Quincy was done  following the cephadm orch upgrade procedure
ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6

Upgrade completed with out errors. After the upgrade, upon creating  
the Grafana service from Ceph dashboard, it deployed Grafana 6.7.4.  
The version is hardcoded in the code, should it not be 8.3.5 as  
listed below in Quincy documentation? See below


[Grafana service started from Cephdashboard]

Quincy documentation states: https://docs.ceph.com/en/latest/releases/quincy/
……documentation snippet
Monitoring and alerting:
43 new alerts have been added (totalling 68) improving observability  
of events affecting: cluster health, monitors, storage devices, PGs  
and CephFS.
Alerts can now be sent externally as SNMP traps via the new SNMP  
gateway service (the MIB is provided).

Improved integrated full/nearfull event notifications.
Grafana Dashboards now use grafonnet format (though they’re still  
available in JSON format).
Stack update: images for monitoring containers have been updated.  
Grafana 8.3.5, Prometheus 2.33.4, Alertmanager 0.23.0 and Node  
Exporter 1.3.1. This reduced exposure to several Grafana  
vulnerabilities (CVE-2021-43798, CVE-2021-39226, CVE-2021-43798,  
CVE-2020-29510, CVE-2020-29511).

……….

I notice that the versions of the remaining stack, that Ceph  
dashboard deploys,  are also older than what is documented.   
Prometheus 2.7.2, Alertmanager 0.16.2 and Node Exporter 0.17.0.


AND 6.7.4 Grafana service reports a few warnings: highlighted below

root@fl31ca104ja0201:/home/general# systemctl status  
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service
●  
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service -  
Ceph grafana.fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
 Loaded: loaded  
(/etc/systemd/system/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@.service;  
enabled; vendor preset: enabled)

 Active: active (running) since Tue 2023-06-13 03:37:58 UTC; 11h ago
   Main PID: 391896 (bash)
  Tasks: 53 (limit: 618607)
 Memory: 17.9M
 CGroup:  
/system.slice/system-ceph\x2dd0a3b6e0\x2dd2c3\x2d11ed\x2dbe05\x2da7a3a1d7a87e.slice/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104j>
 ├─391896 /bin/bash  
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana.fl31ca104ja0201/unit.run
 └─391969 /usr/bin/docker run --rm --ipc=host  
--stop-signal=SIGTERM --net=host --init --name  
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-grafana-fl>
-- Logs begin at Sun 2023-06-11 20:41:51 UTC, end at Tue 2023-06-13  
15:35:12 UTC. --
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="alter user_auth.auth_id to length 190"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="Add OAuth access token to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="Add OAuth refresh token to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="Add OAuth token type to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="Add OAuth expiry to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="Add index to user_id column in user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="create server_lock table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="add index server_lock.operation_uid"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="create user auth token table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  
t=2023-06-13T03:37:59+ lvl=info msg="Executing migration"  
logger=migrator id="add unique index user_auth_token.auth_token"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:  

[ceph-users] Re: Help needed to configure erasure coding LRC plugin

2023-06-19 Thread Eugen Block
Hi, I have a real hardware cluster for testing available now. I'm not  
sure whether I'm completely misunderstanding how it's supposed to work  
or if it's a bug in the LRC plugin.
This cluster has 18 HDD nodes available across 3 rooms (or DCs), I  
intend to use 15 nodes to be able to recover if one node fails.
Given that I need one additional locality chunk per DC I need a  
profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15  
chunks in total across those 3 DCs, one chunk per host, I checked the  
chunk placement and it is correct. This is the profile I created:


ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4  
crush-failure-domain=host crush-locality=room crush-device-class=hdd


I created a pool with only one PG to make the output more readable.

This profile should allow the cluster to sustain the loss of three  
chunks, the results are interesting. This is what I tested:


1. I stopped all OSDs on one host and the PG was still active with one  
missing chunk, everything's good.
2. Stopping a second host in the same DC resulted in the PG being  
marked as "down". That was unexpected since with m=3 I expected the PG  
to still be active but degraded. Before test #3 I started all OSDs to  
have the PG active+clean again.
3. I stopped one host per DC, so in total 3 chunks were missing and  
the PG was still active.


Apparently, this profile is able to sustain the loss of m chunks, but  
not an entire DC. I get the impression (and I also discussed this with  
a colleague) that LRC with this implementation is either designed to  
loose only single OSDs which can be recovered quicker with fewer  
surviving OSDs and saving bandwidth. Or this is a bug because  
according to the low-level description [1] the algorithm works its way  
up in the reverse order within the configured layers, like in this  
example (not displaying my k, m, l requirements, just for reference):


chunk nr01234567
step 1  _cDD_cDD
step 2  cDDD
step 3  cDDD

So if a whole DC fails and the chunks from step 3 can not be  
recovered, and maybe step 2 also fails, but eventually step 1 contains  
the actual k and m chunks which should sustain the loss of an entire  
DC. My impression is that the algorithm somehow doesn't arrive at step  
1 and therefore the PG stays down although there are enough surviving  
chunks. I'm not sure if my observations and conclusion are correct,  
I'd love to have a comment from the developers on this topic. But in  
this state I would not recommend to use the LRC plugin when the  
resiliency requirements are to sustain the loss of an entire DC.


Thanks,
Eugen

[1]  
https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration


Zitat von Michel Jouvin :


Hi,

 I realize that the crushmap I attached to one of my email, probably  
required to understand the discussion here, has been stripped down  
by mailman. To avoid poluting the thread with a long output, I put  
it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download  
it if you are interested.


Best regards,

Michel

Le 21/05/2023 à 16:07, Michel Jouvin a écrit :

Hi Eugen,

My LRC pool is also somewhat experimental so nothing really urgent.  
If you manage to do some tests that help me to understand the  
problem I remain interested. I propose to keep this thread for that.


Zitat, I shared my crush map in the email you answered if the  
attachment was not suppressed by mailman.


Cheers,

Michel
Sent from my mobile

Le 18 mai 2023 11:19:35 Eugen Block  a écrit :


Hi, I don’t have a good explanation for this yet, but I’ll soon get
the opportunity to play around with a decommissioned cluster. I’ll try
to get a better understanding of the LRC plugin, but it might take
some time, especially since my vacation is coming up. :-)
I have some thoughts about the down PGs with failure domain OSD, but I
don’t have anything to confirm it yet.

Zitat von Curt :


Hi,

I've been following this thread with interest as it seems like a  
unique use

case to expand my knowledge. I don't use LRC or anything outside basic
erasure coding.

What is your current crush steps rule?  I know you made changes since your
first post and had some thoughts I wanted to share, but wanted to see your
rule first so I could try to visualize the distribution better.  The only
way I can currently visualize it working is with more servers,  
I'm thinking

6 or 9 per data center min, but that could be my lack of knowledge on some
of the step rules.

Thanks
Curt

On Tue, May 16, 2023 at 11:09 AM Michel Jouvin <
michel.jou...@ijclab.in2p3.fr> wrote:


Hi Eugen,

Yes, sure, no problem to share it. I attach it to this email (as it may
clutter the discussion if inline).

If somebody on the list has some clue on the LRC plugin, I'm still
interested by understand what I'm doing wrong!

Cheers,

Michel

Le 04/05/2023 à 15:07, Eugen Block a écrit :

Hi,

I don't think you've shared your osd tree 

[ceph-users] Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-19 Thread Work Ceph
Hello guys,

We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD
for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows
clients.

Recently, we had the need to add some VMWare clusters as clients for the
iSCSI GW and also Windows systems with the use of Clustered Storage Volumes
(CSV), and we are facing a weird situation. In windows for instance, the
iSCSI block can be mounted, formatted and consumed by all nodes, but when
we add in the CSV it fails with some generic exception. The same happens in
VMWare, when we try to use it with VMFS it fails.

We do not seem to find the root cause for these errors. However, the errors
seem to be linked to the situation of multiple nodes consuming the same
block by shared file systems. Have you guys seen this before?

Are we missing some basic configuration in the iSCSI GW?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-19 Thread Jayanth Reddy
Hello Weiwen,

Thank you for the response. I've attached the output for all PGs in state
incomplete and remapped+incomplete. Thank you!

Thanks,
Jayanth Reddy

On Sat, Jun 17, 2023 at 11:00 PM 胡 玮文  wrote:

> Hi Jayanth,
>
> Can you post the complete output of “ceph pg  query”? So that we can
> understand the situation better.
>
> Can you get OSD 3 or 4 back into the cluster? If you are sure they cannot
> rejoin, you may try “ceph osd lost ” (doc says this may result in
> permanent data lost. I didn’t have a chance to try this myself).
>
> Weiwen Hu
>
> > 在 2023年6月18日,00:26,Jayanth Reddy  写道:
> >
> > Hello Nino / Users,
> >
> > After some initial analysis, I had increased max_pg_per_osd to 480, but
> > we're out of luck. Also tried force-backfill and force-repair as well.
> > On querying PG using *# ceph pg ** query* the output says
> blocked_by
> > 3 to 4 OSDs which are out of the cluster already. Guessing if these have
> to
> > do something with the recovery.
> >
> > Thanks,
> > Jayanth Reddy
> >
> >> On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy <
> jayanthreddy5...@gmail.com>
> >> wrote:
> >>
> >> Thanks, Nino.
> >>
> >> Would give these initial suggestions a try and let you know at the
> >> earliest.
> >>
> >> Regards,
> >> Jayanth Reddy
> >> --
> >> *From:* Nino Kotur 
> >> *Sent:* Saturday, June 17, 2023 12:16:09 PM
> >> *To:* Jayanth Reddy 
> >> *Cc:* ceph-users@ceph.io 
> >> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete
> >>
> >> problem is just that some of your OSDs have too much PGs and pool cannot
> >> recover as it cannot create more PGs
> >>
> >> [osd.214,osd.223,osd.548,osd.584] have slow ops.
> >>too many PGs per OSD (330 > max 250)
> >>
> >> I'd have to guess that the safest thing would be permanently or
> >> temporarily adding more storage so that PGs drop below 250, another
> option
> >> is just dropping down the total number of PGs but I don't know if I
> would
> >> perform that action before my pool was healthy!
> >>
> >> in case that there is only one OSD that has this number of OSDs but all
> >> other OSDs have less than 100-150 than you can just reweight problematic
> >> OSD so it rebalances those "too many PGs"
> >>
> >> But it looks to me that you have way too many PGs which is also super
> >> negatively impacting performance.
> >>
> >> Another option is to increase max allowed PGs per OSD to say 350 this
> >> should also allow cluster to rebuild honestly even tho this may be
> easiest
> >> option, i'd never do this, performance cost of having over 150 PGs per
> OSD
> >> suffer greatly.
> >>
> >>
> >> kind regards,
> >> Nino
> >>
> >>
> >> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy <
> jayanthreddy5...@gmail.com>
> >> wrote:
> >>
> >> Hello Users,
> >> Greetings. We've a Ceph Cluster with the version
> >> *ceph version 14.2.5-382-g8881d33957
> >> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
> >>
> >> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
> >> incomplete+remapped states. Below are the PGs,
> >>
> >> # ceph pg dump_stuck inactive
> >> ok
> >> PG_STAT STATE   UP
> >> UP_PRIMARY ACTING
> >> ACTING_PRIMARY
> >> 15.251e  incomplete
> [151,464,146,503,166,41,555,542,9,565,268]
> >> 151
> >> [151,464,146,503,166,41,555,542,9,565,268]151
> >> 15.3f3   incomplete
> [584,281,672,699,199,224,239,430,355,504,196]
> >> 584
> >> [584,281,672,699,199,224,239,430,355,504,196]584
> >> 15.985  remapped+incomplete
> [396,690,493,214,319,209,546,91,599,237,352]
> >> 396
> >>
> >>
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
> >>   214
> >> 15.39d3 remapped+incomplete
> [404,221,223,585,38,102,533,471,568,451,195]
> >> 404
> >> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
> >> 223
> >> 15.d46  remapped+incomplete
> [297,646,212,254,110,169,500,372,623,470,678]
> >> 297
> >>
> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
> >> 548
> >>
> >> Some of the OSDs had gone down on the cluster. Below is the # ceph
> status
> >>
> >> # ceph -s
> >>  cluster:
> >>id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
> >>health: HEALTH_WARN
> >>noscrub,nodeep-scrub flag(s) set
> >>1 pools have many more objects per pg than average
> >>Reduced data availability: 5 pgs inactive, 5 pgs incomplete
> >>Degraded data redundancy: 44798/8718528059 objects degraded
> >> (0.001%), 1 pg degraded, 1 pg undersized
> >>22726 pgs not deep-scrubbed in time
> >>23552 pgs not scrubbed in time
> >>77 slow ops, oldest one blocked for 56400 sec, daemons
> >> [osd.214,osd.223,osd.548,osd.584] have slow ops.
> >>too many PGs per OSD (330 > max 250)
> >>
> >>  services:
> >>mon: 3 daemons, quorum 

[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-19 Thread Jayanth Reddy
Hello Weiwen,

Thank you for the response. I've attached the output for all PGs in state
incomplete and remapped+incomplete. Thank you!

Thanks,
Jayanth Reddy

On Sun, Jun 18, 2023 at 4:09 PM Jayanth Reddy 
wrote:

> Hello Weiwen,
>
> Thank you for the response. I've attached the output for all PGs in state
> incomplete and remapped+incomplete. Thank you!
>
> Thanks,
> Jayanth Reddy
>
> On Sat, Jun 17, 2023 at 11:00 PM 胡 玮文  wrote:
>
>> Hi Jayanth,
>>
>> Can you post the complete output of “ceph pg  query”? So that we can
>> understand the situation better.
>>
>> Can you get OSD 3 or 4 back into the cluster? If you are sure they cannot
>> rejoin, you may try “ceph osd lost ” (doc says this may result in
>> permanent data lost. I didn’t have a chance to try this myself).
>>
>> Weiwen Hu
>>
>> > 在 2023年6月18日,00:26,Jayanth Reddy  写道:
>> >
>> > Hello Nino / Users,
>> >
>> > After some initial analysis, I had increased max_pg_per_osd to 480, but
>> > we're out of luck. Also tried force-backfill and force-repair as well.
>> > On querying PG using *# ceph pg ** query* the output says
>> blocked_by
>> > 3 to 4 OSDs which are out of the cluster already. Guessing if these
>> have to
>> > do something with the recovery.
>> >
>> > Thanks,
>> > Jayanth Reddy
>> >
>> >> On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy <
>> jayanthreddy5...@gmail.com>
>> >> wrote:
>> >>
>> >> Thanks, Nino.
>> >>
>> >> Would give these initial suggestions a try and let you know at the
>> >> earliest.
>> >>
>> >> Regards,
>> >> Jayanth Reddy
>> >> --
>> >> *From:* Nino Kotur 
>> >> *Sent:* Saturday, June 17, 2023 12:16:09 PM
>> >> *To:* Jayanth Reddy 
>> >> *Cc:* ceph-users@ceph.io 
>> >> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in
>> remapped+incomplete
>> >>
>> >> problem is just that some of your OSDs have too much PGs and pool
>> cannot
>> >> recover as it cannot create more PGs
>> >>
>> >> [osd.214,osd.223,osd.548,osd.584] have slow ops.
>> >>too many PGs per OSD (330 > max 250)
>> >>
>> >> I'd have to guess that the safest thing would be permanently or
>> >> temporarily adding more storage so that PGs drop below 250, another
>> option
>> >> is just dropping down the total number of PGs but I don't know if I
>> would
>> >> perform that action before my pool was healthy!
>> >>
>> >> in case that there is only one OSD that has this number of OSDs but all
>> >> other OSDs have less than 100-150 than you can just reweight
>> problematic
>> >> OSD so it rebalances those "too many PGs"
>> >>
>> >> But it looks to me that you have way too many PGs which is also super
>> >> negatively impacting performance.
>> >>
>> >> Another option is to increase max allowed PGs per OSD to say 350 this
>> >> should also allow cluster to rebuild honestly even tho this may be
>> easiest
>> >> option, i'd never do this, performance cost of having over 150 PGs per
>> OSD
>> >> suffer greatly.
>> >>
>> >>
>> >> kind regards,
>> >> Nino
>> >>
>> >>
>> >> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy <
>> jayanthreddy5...@gmail.com>
>> >> wrote:
>> >>
>> >> Hello Users,
>> >> Greetings. We've a Ceph Cluster with the version
>> >> *ceph version 14.2.5-382-g8881d33957
>> >> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
>> >>
>> >> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
>> >> incomplete+remapped states. Below are the PGs,
>> >>
>> >> # ceph pg dump_stuck inactive
>> >> ok
>> >> PG_STAT STATE   UP
>> >> UP_PRIMARY ACTING
>> >> ACTING_PRIMARY
>> >> 15.251e  incomplete
>> [151,464,146,503,166,41,555,542,9,565,268]
>> >> 151
>> >> [151,464,146,503,166,41,555,542,9,565,268]151
>> >> 15.3f3   incomplete
>> [584,281,672,699,199,224,239,430,355,504,196]
>> >> 584
>> >> [584,281,672,699,199,224,239,430,355,504,196]584
>> >> 15.985  remapped+incomplete
>> [396,690,493,214,319,209,546,91,599,237,352]
>> >> 396
>> >>
>> >>
>> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
>> >>   214
>> >> 15.39d3 remapped+incomplete
>> [404,221,223,585,38,102,533,471,568,451,195]
>> >> 404
>> >>
>> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
>> >> 223
>> >> 15.d46  remapped+incomplete
>> [297,646,212,254,110,169,500,372,623,470,678]
>> >> 297
>> >>
>> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
>> >> 548
>> >>
>> >> Some of the OSDs had gone down on the cluster. Below is the # ceph
>> status
>> >>
>> >> # ceph -s
>> >>  cluster:
>> >>id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
>> >>health: HEALTH_WARN
>> >>noscrub,nodeep-scrub flag(s) set
>> >>1 pools have many more objects per pg than average
>> >>Reduced data availability: 5 pgs inactive, 5 pgs incomplete
>> >>Degraded data redundancy: 44798/8718528059 objects degraded
>> >> (0.001%), 1 pg 

[ceph-users] Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

2023-06-19 Thread Frédéric Nass

Hello, 

This message does not concern Ceph itself but a hardware vulnerability which 
can lead to permanent loss of data on a Ceph cluster equipped with the same 
hardware in separate fault domains. 

The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives 
of the 13G generation of DELL servers are subject to a vulnerability which 
renders them unusable after 70,000 hours of operation, i.e. approximately 7 
years and 11 months of activity. 

This topic has been discussed here: 
https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438
 

The risk is all the greater since these disks may die at the same time in the 
same server leading to the loss of all data in the server. 

To date, DELL has not provided any firmware fixing this vulnerability, the 
latest firmware version being "A3B3" released on Sept. 12, 2016: 
https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k 

If your have servers running these drives, check their uptime. If they are 
close to the 70,000 hour limit, replace them immediately. 

The smartctl tool does not report the uptime for these SSDs, but if you have 
HDDs in the server, you can query their SMART status and get their uptime, 
which should be about the same as the SSDs. 
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the 
iSCSI bus number). 

We have informed DELL about this but have no information yet on the arrival of 
a fix. 

We have lost 6 disks, in 3 different servers, in the last few weeks. Our 
observation shows that the drives don't survive full shutdown and restart of 
the machine (power off then power on in iDrac), but they may also die during a 
single reboot (init 6) or even while the machine is running. 

Fujitsu released a corrective firmware in June 2021 but this firmware is most 
certainly not applicable to DELL drives: 
https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf 

Regards, 
Frederic 

Sous-direction Infrastructure and Services 
Direction du Numérique 
Université de Lorraine 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OpenStack (cinder) volumes retyping on Ceph back-end

2023-06-19 Thread Eugen Block

Hi,

I don't quite understand the issue yet, maybe you can clarify.

If I perform a "change volume type" from OpenStack on volumes  
attached to the VMs the system successfully migrates the volume from  
the source pool to the destination pool and at the end of the  
process the volume is visible in the new pool and is removed from  
the old pool.


Are these volumes root disks or just additional volumes? But  
apparently, the retype works.


The problem encountered is that when reconfiguring the VM, to  
specify the new pool associated with the volumes (performed through  
a resize of the VM, I haven't found any other method to change the  
information on the nova/cinder db automatically.


If the retype already works then what is your goal by "reconfiguring  
the VM"? What information is wrong in the DB? This part needs some  
clarification for me. Can you give some examples?


The VM after the retype continues to work perfectly in RW but the  
"new" volume created in the new pool is not used to write data and  
consequently when the VM is shut down all the changes are lost.


Just wondering, did you shut down the VM before retyping the volume?  
I'll try to reproduce this in a test cluster.


Regards,
Eugen

Zitat von andrea.mar...@oscct.it:


Hello,
I configured different back-end storage on OpenStack (Yoga release)  
and using Ceph (ceph version 17.2.4) with different pools (volumes,  
cloud-basic, shared-hosting-os, shared-hosting-homes,...) for RBD  
application.
I created different volume types towards each of the backends and  
everything works perfectly.
If I perform a "change volume type" from OpenStack on volumes  
attached to the VMs the system successfully migrates the volume from  
the source pool to the destination pool and at the end of the  
process the volume is visible in the new pool and is removed from  
the old pool.
The problem encountered is that when reconfiguring the VM, to  
specify the new pool associated with the volumes (performed through  
a resize of the VM, I haven't found any other method to change the  
information on the nova/cinder db automatically. I also did some  
tests shut-off of the VM and modification of the xml through virsh  
edit and startup the VM) the volume presented to the VM is exactly  
the version and content on the retype date of the volume itself. All  
data written and modified after the retype is lost.
The VM after the retype continues to work perfectly in RW but the  
"new" volume created in the new pool is not used to write data and  
consequently when the VM is shut down all the changes are lost.
Do you have any idea how to carry out a check and possibly how to  
proceed in order not to lose the data of the vm of which I have  
retyped the volume?

The data is written somewhere because the VMs work perfectly.
Thank you
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: same OSD in multiple CRUSH hierarchies

2023-06-19 Thread Eugen Block

Hi,
I don't think this is going to work. Each OSD belongs to a specific  
host and you can't have multiple buckets (e.g. bucket type "host")  
with the same name in the crush tree. But if I understand your  
requirement correctly, there should be no need to do it this way. If  
you structure your crush tree according to your separation  
requirements and the critical pools use designated rules, you can  
still have a rule that doesn't care about the data separation but  
distributes the replicas across the available hosts (given your  
failure domain would be "host"), which is already the default for the  
replicated_rule. Did I misunderstand something?


Regards,
Eugen


Zitat von Budai Laszlo :


Hi there,

I'm curious if there is anything against configuring an ODS to be  
part in multiple CRUSH hierarchies. I'm thinking of the following  
scenario:


I want to create pools that are using distinct sets of OSDs. I want  
to make sure that a piece data which replicated at application level  
will not end up on the same OSD. So I would creat multiple CRUSH  
hierarchies (root - host - osd) but using different OSDs in each,  
and different rules that are using those hierarchies. Then I would  
create pools with the different rules, and use those different pools  
for storing the data for the different application instances. But I  
would also like to use the OSDs in the "default hierarchy" set up by  
ceph where all the hosts are in the same root bucket, and have the  
default replicated rule, so my generic data volumes would be able to  
spread across all the OSDs available.


Is there something against this setup?

Thank you for any advice!
Laszlo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] autocaling not work and active+remapped+backfilling

2023-06-19 Thread farhad kh
 hi
i have a problem with ceph 17.2.6 , cephfs with mds daemons but see an
unusual behavior.
 create a data pool with default crush rule but data just store in 3
specific osd and other osd is clean
PG auto-scaling is also active but its size does not change when the pool
is biger
  I did this manually but the problem was not solved and I got the error pg
are not balanced across osds
How do I solve this problem? Is this a bug? I did not have this problem in
previous versions
I solved this problem. There are several identical crash rules in the folder

step chooseleaf firstn 0 type host

I think this confuses the balancer and autoscale and output for ceph osd
pool autoscale-status  is empty
after remove other crush rules autoscale runing
but
 move data from osd full to clear osd is slow trying  with  Reducing the
 weight of filled OSDs, I tried to prioritize the use of other OSDs  ceph
osd reweight-by-utilization I hope this works
Is there a solution that makes the process of autoscaling and cleaning
placement groups faster?

---

[root@opcsdfpsbpp0201 ~]# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "r3-host",
"type": 1,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "r3",
"type": 1,
"steps": [
{
"op": "take",
"item": -2,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]



# ceph osd status | grep back
23  opcsdfpsbpp0211  1900G   147G  00   00
backfillfull,exists,up
48  opcsdfpsbpp0201  1900G   147G  00   00
backfillfull,exists,up
61  opcsdfpsbpp0205  1900G   147G  00   00
backfillfull,exists,up

--

Every 2.0s: ceph -s

 opcsdfpsbpp0201: Sun Jun 18 11:44:29 2023

  cluster:
id: 79a2627c-0821-11ee-a494-00505695c58c
health: HEALTH_WARN
3 backfillfull osd(s)
6 pool(s) backfillfull

  services:
mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203
(age 6d)
mgr: opcsdfpsbpp0201.vttwxa(active, since 5d), standbys:
opcsdfpsbpp0205.tpodbs, opcsdfpsbpp0203.jwjkcl
mds: 1/1 daemons up, 2 standby
osd: 74 osds: 74 up (since 7d), 74 in (since 7d); 107 remapped pgs

  data:
volumes: 1/1 healthy
pools:   6 pools, 359 pgs
objects: 599.64k objects, 2.2 TiB
usage:   8.1 TiB used, 140 TiB / 148 TiB avail
pgs: 923085/1798926 objects misplaced (51.313%)
 252 active+clean
 87  active+remapped+backfill_wait
 20  active+remapped+backfilling

  io:
client:   255 B/s rd, 0 op/s rd, 0 op/s wr
recovery: 33 MiB/s, 8 objects/s

  progress:
Global Recovery Event (5h)
  [===.] (remaining: 2h)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephfs mount with kernel driver

2023-06-19 Thread farhad kh
 I noticed that in my scenario, when I mount cephfs via the kernel module,
it directly copies to one or three of the OSDs. And the writing speed of
the client is higher than the speed of replication and auto scaling This
causes the writing operation to stop as soon as those OSDs are filled, and
the error of free space is not available What should be done to solve this
problem? Is there a way to increase the speed of scaling or moving objects
in OSD? Or a way to mount cephfs that does not have these problems?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io