[ceph-users] Re: Ceph newbee questions

2023-12-22 Thread Anthony D'Atri


>> 
>> You can do that for a PoC, but that's a bad idea for any production 
>> workload.  You'd want at least three nodes with OSDs to use the default RF=3 
>> replication.  You can do RF=2, but at the peril of your mortal data.
> 
> I'm not sure I agree - I think size=2, min_size=2 is no worse than
> RAID1 for data security.

size=2, min_size=2 *is* RAID1.  Except that you become unavailable if a single 
drive is unavailable.

>> That isn't even the main risk as I understand it.  Of course a double
> failure is going to be a problem with size=2, or traditional RAID1,
> and I think anybody choosing this configuration accepts this risk.

We see people often enough who don’t know that.  I’ve seen double failures.  
ymmv.


>  As I understand it, the reason min_size=1 is a trap has nothing to do
> with double failures per se.

It’s one of the concerns.

> 
> The issue is that Ceph OSDs are somewhat prone to flapping during
> recovery (OOM, etc).  So even if the disk is fine, an OSD can go down
> for a short time.  If you have size=2, min=1 configured, then when
> this happens the PG will become degraded and will continue operating
> on the other OSD, and the flapping OSD becomes stale.  Then when it
> comes back up it recovers.  The problem is that if the other OSD has a
> permanent failure (disk crash/etc) while the first OSD is flapping,
> now you have no good OSDs, because when the flapping OSD comes back up
> it is stale, and its PGs have no peer.  

Indeed, arguably that’s an overlapping failure.  I’ve seen this too, and have a 
pg query to demonstrate it.


> I suspect there are ways to re-activate it, though this will result in 
> potential data
> inconsistency since writes were allowed to the cluster and will then
> get rolled back.

Yep.

> With only two OSDs I'm guessing that would be the
> main impact (well, depending on journaling behavior/etc), but if you
> have more OSDs than that then you could have situations where one file
> is getting rolled back, and some other file isn't, and so on.

But you’d have a voting majority.

> 
> With min_size=2 you're fairly safe from flapping because there will
> always be two replicas that have the most recent version of every PG,
> and so you can still tolerate a permanent failure of one of them.

Exactly.

> 
> size=2, min=2 doesn't suffer this failure mode, because anytime there
> is flapping the PG goes inactive and no writes can be made, so when
> the other OSD comes back up there is nothing to recover.  Of course
> this results in IO blocks and downtime, which is obviously
> undesirable, but it is likely a more recoverable state than
> inconsistent writes.

Agreed, the difference between availability and durability.  Depends what’s 
important to you.


> 
> Apologies if I've gotten any of that wrong, but my understanding is
> that it is these sorts of failure modes that cause min_size=1 to be a
> trap.  This isn't the sort of thing that typically happens in a RAID1
> config, or at least that admins don't think about.

It’s both.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph newbee questions

2023-12-22 Thread Rich Freeman
Disclaimer: I'm fairly new to Ceph, but I've read a bunch of threads
on the min_size=1 issue as that was perplexing me when I started, as
one replica is generally considered fine in many other applications.
However, there really are some unique concerns to Ceph beyond just the
number of disks you can lose...

On Fri, Dec 22, 2023 at 3:09 PM Anthony D'Atri  wrote:
>
> > 2 x "storage nodes"
>
> You can do that for a PoC, but that's a bad idea for any production workload. 
>  You'd want at least three nodes with OSDs to use the default RF=3 
> replication.  You can do RF=2, but at the peril of your mortal data.

I'm not sure I agree - I think size=2, min_size=2 is no worse than
RAID1 for data security.  Maybe some consider RAID1 inappropriate, and
if so then size=2 is the same, but I think many are quite comfortable
with it.

The issue is that if you lose a disk your PGs become inactive - the
data is perfectly safe, but you have downtime.  Of course that is
probably not what you're expecting, as that isn't what happens with
RAID1.  Read on...

> > If just one data node goes down we will still not loose any
> > data and that is fine until we have a new server.
>
> ... unless one of the drives in the surviving node fails.
>

That isn't even the main risk as I understand it.  Of course a double
failure is going to be a problem with size=2, or traditional RAID1,
and I think anybody choosing this configuration accepts this risk.  As
I understand it, the reason min_size=1 is a trap has nothing to do
with double failures per se.

The issue is that Ceph OSDs are somewhat prone to flapping during
recovery (OOM, etc).  So even if the disk is fine, an OSD can go down
for a short time.  If you have size=2, min=1 configured, then when
this happens the PG will become degraded and will continue operating
on the other OSD, and the flapping OSD becomes stale.  Then when it
comes back up it recovers.  The problem is that if the other OSD has a
permanent failure (disk crash/etc) while the first OSD is flapping,
now you have no good OSDs, because when the flapping OSD comes back up
it is stale, and its PGs have no peer.  I suspect there are ways to
re-activate it, though this will result in potential data
inconsistency since writes were allowed to the cluster and will then
get rolled back.  With only two OSDs I'm guessing that would be the
main impact (well, depending on journaling behavior/etc), but if you
have more OSDs than that then you could have situations where one file
is getting rolled back, and some other file isn't, and so on.

With min_size=2 you're fairly safe from flapping because there will
always be two replicas that have the most recent version of every PG,
and so you can still tolerate a permanent failure of one of them.

size=2, min=2 doesn't suffer this failure mode, because anytime there
is flapping the PG goes inactive and no writes can be made, so when
the other OSD comes back up there is nothing to recover.  Of course
this results in IO blocks and downtime, which is obviously
undesirable, but it is likely a more recoverable state than
inconsistent writes.

Apologies if I've gotten any of that wrong, but my understanding is
that it is these sorts of failure modes that cause min_size=1 to be a
trap.  This isn't the sort of thing that typically happens in a RAID1
config, or at least that admins don't think about.  Those
implementations are simpler and less prone to flapping, though to its
credit I'm guessing ceph would be far better about detecting this sort
of thing in the first place.

> Ceph is a scale-out solution, not meant for a very small number of servers.  
> For replication, you really want at least 3 nodes with OSDs and 
> size=3,min_size=2.  More nodes is better.  If you need a smaller-scale 
> solution, DRBD or ZFS might be better choices.

Agree that pretty-much all distributed filesystems suffer performance
issues at the very least with a small number of nodes, especially with
hard disks.

Moosefs is a decent distributed solution on small clusters, but it
lacks high availability on the FOSS version.  I've found that on small
clusters on hard disks it actually performs much better than cephfs,
but it certainly won't scale up nearly as well.  With only 2-3 hard
disks though it still will perform fairly poorly.

Of course ZFS will perform best of all, but it lacks any kind of
host-level redundancy.

--
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Johan Hattne

On 2023-12-22 03:28, Robert Sander wrote:

Hi,

On 22.12.23 11:41, Albert Shih wrote:


for n in 1-100
   Put off line osd on server n
   Uninstall docker on server n
   Install podman on server n
   redeploy on server n
end


Yep, that's basically the procedure.

But first try it on a test cluster.

Regards


For reference, this was also discussed about two years ago:

  https://www.spinics.net/lists/ceph-users/msg70108.html

Worked for me.

// Johan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph newbee questions

2023-12-22 Thread Anthony D'Atri


> I have manually configured a ceph cluster with ceph fs on debian bookworm.

Bookworm support is very, very recent I think.

> What is the difference from installing with cephadm compared to manuall 
> install,
> any benefits that you miss with manual install?

A manual install is dramatically more work and much easier to get wrong.  
There's also Rook if you skate k8s.

> There are also another couple of things that I can not figure out
> reading the documentation.
> 
> Most of our files are small and from my understanding replication is then 
> recomended, right?

How small is "small"? 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253

If your files are super small, like say <256KB  you may consume measurably more 
underlying storage space than you expect.

CephFS isn't my strong suit, but my understanding is that it's designed for 
reasonably large files.  As with RGW, if you store zillions of 1KB files you 
may not have the ideal experience.

> 
> The plan is to set ceph up like this:
> 1 x "admin node"

MDS AIUI is single-threaded and so will benefit from a high-frequency CPU more 
than a high-core-count CPU.

> 2 x "storage nodes"

You can do that for a PoC, but that's a bad idea for any production workload.  
You'd want at least three nodes with OSDs to use the default RF=3 replication.  
You can do RF=2, but at the peril of your mortal data.

> This works well to setup but I can not get my head around is
> how things are replicated over nodes and disks.
> In ceth.conf I set the folowing:
> osd pool default size = 2
> osd pool default min size = 1
> So the idea is that we always have 2 copies of the data.

Those are only defaults if you don't specify them when creating a pool.  I 
suggest always specifying the replication parameters explicitly when creating a 
pool.
min_size = 1 is a trap for any data you care about.


> I do not seem to be able to figure out the replication
> when things starts to fail.
> If the admin node goes down, one of the data nodes will
> run the mon, mgr and mds. This will slow things down but
> will be fine until we have a new admin node in place again.
> (or if there is something I am missing here?)

If you have 3 mons, that's mostly true.  The MDS situation is more nuanced.

> If just one data node goes down we will still not loose any
> data and that is fine until we have a new server.

... unless one of the drives in the surviving node fails.  

> But what if one data node goes down and one disk of the other
> data node breaks, will I loose data then?

It most likely will be at least unavailable until you get the first node back 
up with all OSDs.  This is one reason why RF=2 is okay for a sandbox but a bad 
idea for any data you care about.  There are legit situations where one doesn't 
care so much about losing data, but they are infrequent.

> Or how many disks can I loose before I loose data?
> This is what I can not get my head around, how to think
> when disaster strikes, how much hardware can I loose before
> I loose data?
> Or have I got it all wrong?
> Is it a bad idea with just 2 fileservers is more servers required?

Ceph is a scale-out solution, not meant for a very small number of servers.  
For replication, you really want at least 3 nodes with OSDs and 
size=3,min_size=2.  More nodes is better.  If you need a smaller-scale 
solution, DRBD or ZFS might be better choices.


> 
> The second thing I have a problem with is snapshots.
> I manage to create snapshot in root with command:
> ceph fs subvolume snapshot create  / 
> But it fails if I try to create a shapshot in any
> other directory then in the root.
> Second of all if I try to create a snapshot from the
> client with:
> mkdir /mnt-ceph/.snap/my_snapshot
> I get the same error in all directories:
> Permission dened.
> I have not found any sollution to this,
> am I missing something here as well?
> Any config missing?
> 
> Many thanks for your support!!
> 
> Best regrads
> Marcus
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph newbee questions

2023-12-22 Thread Marcus

Hi all,
I am all new with ceph and I come from gluster.
We have had our eyes on ceph for several years
and as the gluster project seems to slow down we
now think it is time to start look into ceph.

I have manually configured a ceph cluster with ceph fs on debian 
bookworm.
What is the difference from installing with cephadm compared to manuall 
install,

any benefits that you miss with manual install?

There are also another couple of things that I can not figure out
reading the documentation.

Most of our files are small and from my understanding replication is 
then recomended, right?


The plan is to set ceph up like this:
1 x "admin node"
2 x "storage nodes"

The admin node will run mon, mgr and mds.
The storage nodes will run mon, mgr, mds and 8x osd (8 disks).

This works well to setup but I can not get my head around is
how things are replicated over nodes and disks.
In ceth.conf I set the folowing:
osd pool default size = 2
osd pool default min size = 1
So the idea is that we always have 2 copies of the data.

I do not seem to be able to figure out the replication
when things starts to fail.
If the admin node goes down, one of the data nodes will
run the mon, mgr and mds. This will slow things down but
will be fine until we have a new admin node in place again.
(or if there is something I am missing here?)
If just one data node goes down we will still not loose any
data and that is fine until we have a new server.
But what if one data node goes down and one disk of the other
data node breaks, will I loose data then?
Or how many disks can I loose before I loose data?
This is what I can not get my head around, how to think
when disaster strikes, how much hardware can I loose before
I loose data?
Or have I got it all wrong?
Is it a bad idea with just 2 fileservers is more servers required?

The second thing I have a problem with is snapshots.
I manage to create snapshot in root with command:
ceph fs subvolume snapshot create  / 
But it fails if I try to create a shapshot in any
other directory then in the root.
Second of all if I try to create a snapshot from the
client with:
mkdir /mnt-ceph/.snap/my_snapshot
I get the same error in all directories:
Permission dened.
I have not found any sollution to this,
am I missing something here as well?
Any config missing?

Many thanks for your support!!

Best regrads
Marcus

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Anthony D'Atri


> Sorry I thought of one more thing.
> 
> I was actually re-reading the hardware recommendations for Ceph and it seems 
> to imply that both RAID controllers as well as HBAs are bad ideas.

Advice I added most likely ;)   "RAID controllers" *are* a subset of HBAs BTW.  
The nomenclature can be confusing and there's this guy on Reddit 

> I believe I remember knowing that RAID controllers are sub optimal but I 
> guess I don't understand how you would actually build a cluster with many 
> disks 12-14 each per server without any HBAs in the servers.

NVMe

> Are there certain HBAs that are worse than others? Sorry I am just confused.

For liability and professionalism I won't name names, especially in serverland 
there's one manufacturer who dominates.

There are three main points, informed by years of wrangling the things.  I 
posted a litany of my experiences to this very list a few years back, including 
data-losing firmware / utility bugs and operationally expensive ECOs.

* RoC HBAs aka "RAID controllers" are IMHO a throwback to the days when x86 / 
x64 servers didn't have good software RAID.  In the land of the Sun we had VxVM 
(at $$$) that worked well, and Sun's SDS/ODS that ... got better over time.  I 
dunno if the Microsoft world has bootable software RAID now or not.  They are 
in my experience flaky and a pain to monitor.  Granted they offer the potential 
for a system to boot without intervention if the first boot drive is horqued, 
but IMHO that doesn't happen nearly often enough to justify the hassle.

* These things can have significant cost, especially if one shells out for 
cache RAM, BBU/FBWC, etc.  Today I have a handful of legacy systems that were 
purchased with a tri-mode HBA that in 2018 had a list price of USD 2000.  The  
*only* thing it's doing is mirroring two SATA boot drives.  That money would be 
better spent on SSDs, either with a non-RAID aka JBOD HBA, or better NVMe.  

* RAID HBAs confound observability.  Many models today have a JBOD / 
passthrough mode -- in which case why pay for all the RAID-fu?  Some, 
bizarrely, still don't, and one must set up a single-drive RAID0 volume around 
every drive for the system to see it.  This makes iostat even less useful than 
it already is, and one has to jump through hoops to get SMART info.  Hoops 
that, for example, the very promising upstream smartctl_exporter doesn't have.

There's a certain current M.2 boot drive module like this, the OS cannot see 
the drives AT ALL unless they're in a virtual drive.  Like Chong said, there's 
too much recession.

When using SATA or SAS, you can get a non-RAID HBA for much less money than a 
RAID HBA.  But the nuance here is that unless you have pre-existing gear, SATA 
and especially SATA *do not save money*.  This is heterodox to conventional 
wisdom.

An NVMe-only chassis does not need an HBA of any kind.  NVMe *is* PCI-e.  It 
especially doesn't need an astronomically expensive NVMe-capable RAID 
controller, at least not for uses like Ceph.  If one has an unusual use-case 
that absolutely requires a single volume, if LVM doesn't cut it for some reason 
-- maybe.  And there are things like Phison and Scaleflux that are out of 
scope, we're talking about Ceph here.

Some chassis vendors try hard to stuff an RoC HBA down your throat, with rather 
high markups.  Others may offer a basic SATA HBA built into the motherboard if 
you need it for some reason.

So when you don't have to spend USD hundreds to a thousand on an RoC HBA + 
BBU/cache/FBWC and jump through hoops to have one more thing to monitor, and an 
NVMe SSD doesn't cost significantly more than a SATA SSD, an all-NVMe system 
can easily be *less* expensive than SATA or especially SAS.  SAS is very, very 
much in its last days in the marketplace; SATA is right behind it.  In 5-10 
years you'll be hard-pressed to find enterprise SAS/SATA SSDs, or if you can, 
might only be from a single manufacturer -- which is an Akbarian trap.

This calculator can help show the false economy of SATA / SAS, and especially 
of HDDs.  Yes, in the enterprise, HDDs are *not* less expensive unless you're a 
slave to $chassisvendor.

https://www.snia.org/forums/cmsi/programs/TCOcalc


You read that right.

Don't plug in the cost of a 22TB SMR SATA drive, it likely won't be usable in 
real life.  It's not uncommon to limit spinners to say 8TB just because of the 
interface and seek bottlenecks.  The above tool has a multiplier for how many 
additional spindles one has to provision to get semi-acceptable IOPS, for RUs, 
power, AFR, cost to repair, etc.

At scale, consider how many chassis you need when you can stuff 32x 60TB SSDs 
into one, vs 12x 8TB HDDs.  Consider also the risks when it takes a couple of 
weeks to migrate data onto a replacement spinner, or if you can't do 
maintenance because your cluster becomes unusable to users.


-- aad


> 
> Thanks,
> -Drew
> 
> -Original Message-
> From: Drew Weaver  
> Sent: Thursday, December 21, 

[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Drew Weaver
Sorry I thought of one more thing.

I was actually re-reading the hardware recommendations for Ceph and it seems to 
imply that both RAID controllers as well as HBAs are bad ideas.

I believe I remember knowing that RAID controllers are sub optimal but I guess 
I don't understand how you would actually build a cluster with many disks 12-14 
each per server without any HBAs in the servers.

Are there certain HBAs that are worse than others? Sorry I am just confused.

Thanks,
-Drew

-Original Message-
From: Drew Weaver  
Sent: Thursday, December 21, 2023 8:51 AM
To: 'ceph-users@ceph.io' 
Subject: [ceph-users] Building new cluster had a couple of questions

Howdy,

I am going to be replacing an old cluster pretty soon and I am looking for a 
few suggestions.

#1 cephadm or ceph-ansible for management?
#2 Since the whole... CentOS thing... what distro appears to be the most 
straightforward to use with Ceph?  I was going to try and deploy it on Rocky 9.

That is all I have.

Thanks,
-Drew

___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
I'd like to say that it was something smart but it was a bit of luck.

I logged in on a hypervisor (we run OSDs and OpenStack hypervisors on the
same hosts) to deal with another issue, and while checking the system I
noticed that one of the OSDs was using a lot more CPU than the others. It
made me think that the increased IOPS could put a strain on some of the
OSDs without impacting the whole cluster so I decided to increate pg_num to
spread the operations to more OSDs, and it did the trick. The qlen metric
went back to something similar to what we had before the problems started.

We're going to look into adding CPU/RAM monitoring for all the OSDs next.

Gauvain

On Fri, Dec 22, 2023 at 2:58 PM Drew Weaver  wrote:

> Can you say how you determined that this was a problem?
>
> -Original Message-
> From: Gauvain Pocentek 
> Sent: Friday, December 22, 2023 8:09 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: RGW requests piling up
>
> Hi again,
>
> It turns out that our rados cluster wasn't that happy, the rgw index pool
> wasn't able to handle the load. Scaling the PG number helped (256 to 512),
> and the RGW is back to a normal behaviour.
>
> There is still a huge number of read IOPS on the index, and we'll try to
> figure out what's happening there.
>
> Gauvain
>
> On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek <
> gauvainpocen...@gmail.com>
> wrote:
>
> > Hello Ceph users,
> >
> > We've been having an issue with RGW for a couple days and we would
> > appreciate some help, ideas, or guidance to figure out the issue.
> >
> > We run a multi-site setup which has been working pretty fine so far.
> > We don't actually have data replication enabled yet, only metadata
> > replication. On the master region we've started to see requests piling
> > up in the rgw process, leading to very slow operations and failures
> > all other the place (clients timeout before getting responses from
> > rgw). The workaround for now is to restart the rgw containers regularly.
> >
> > We've made a mistake and forcefully deleted a bucket on a secondary
> > zone, this might be the trigger but we are not sure.
> >
> > Other symptoms include:
> >
> > * Increased memory usage of the RGW processes (we bumped the container
> > limits from 4G to 48G to cater for that)
> > * Lots of read IOPS on the index pool (4 or 5 times more compared to
> > what we were seeing before)
> > * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> > active requests) seem to show that the number of concurrent requests
> > increases with time, although we don't see more requests coming in on
> > the load-balancer side.
> >
> > The current thought is that the RGW process doesn't close the requests
> > properly, or that some requests just hang. After a restart of the
> > process things look OK but the situation turns bad fairly quickly
> > (after 1 hour we start to see many timeouts).
> >
> > The rados cluster seems completely healthy, it is also used for rbd
> > volumes, and we haven't seen any degradation there.
> >
> > Has anyone experienced that kind of issue? Anything we should be
> > looking at?
> >
> > Thanks for your help!
> >
> > Gauvain
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW - user created bucket with name of already created bucket

2023-12-22 Thread Ondřej Kukla
Hello,

I would like to share a quite worrying experience I’ve just found on one of my 
production clusters.

User successfully created a bucket with name of a bucket that already exists!

He is not bucket owner - the original user is, but he is able to see it when he 
does ListBuckets over s3 api. (Both accounts are able to do it now - only the 
original owner is able to interact with it)

This bucket is also counted to the new users usage stats.

Has anyone noticed this before? This cluster is running on Quincy - 17.2.6.

Is there a way to detach the bucket from the new owner so he doesn’t have a 
bucket that doesn’t belong to him?

Regards,

Ondrej



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph-dashboard odd behavior when visiting through haproxy

2023-12-22 Thread Demian Romeijn


I'm currently trying to setup a ceph-dashboard using the official documentation 
on how to do so.
I've managed to log-in by just visiting the URL & port, and by visting it 
through haproxy. However using haproxy to visit the site results in odd 
behavior.


At my first login, nothing loads on the page and eventually at ~5s it times me 
out, sending me back to the log-in screen.
After logging back on to the dashboard, everything loads and functions as 
expected. I can refresh my browser as many times as I want and it still keeps 
on working.
After some time, usually ~30 minutes or so of inactivity, the problem arises 
again. 



Haproxy tells us the server is down for about ~10 seconds, running a simple 
HTTP check results in the following aswell: CRITICAL - Socket timeout after 10 
seconds.
In the ceph-mgr logs there isn't any special error other than: [dashboard ERROR 
frontend.error] (https://*redacted*/#/login): Http failure response for 
https://*redacted*/ui-api/orchestrator/get_name: 401 OK None
It seems as such the ceph dashboard is "overloaded", changing haproxy config 
(following the official ceph documentation on how to set it up) to do 
health-checks less often results in the problem happening less often. 



Anything I might've overlooked that could sort out the issue?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW requests piling up

2023-12-22 Thread Gauvain Pocentek
Hi again,

It turns out that our rados cluster wasn't that happy, the rgw index pool
wasn't able to handle the load. Scaling the PG number helped (256 to 512),
and the RGW is back to a normal behaviour.

There is still a huge number of read IOPS on the index, and we'll try to
figure out what's happening there.

Gauvain

On Thu, Dec 21, 2023 at 1:40 PM Gauvain Pocentek 
wrote:

> Hello Ceph users,
>
> We've been having an issue with RGW for a couple days and we would
> appreciate some help, ideas, or guidance to figure out the issue.
>
> We run a multi-site setup which has been working pretty fine so far. We
> don't actually have data replication enabled yet, only metadata
> replication. On the master region we've started to see requests piling up
> in the rgw process, leading to very slow operations and failures all other
> the place (clients timeout before getting responses from rgw). The
> workaround for now is to restart the rgw containers regularly.
>
> We've made a mistake and forcefully deleted a bucket on a secondary zone,
> this might be the trigger but we are not sure.
>
> Other symptoms include:
>
> * Increased memory usage of the RGW processes (we bumped the container
> limits from 4G to 48G to cater for that)
> * Lots of read IOPS on the index pool (4 or 5 times more compared to what
> we were seeing before)
> * The prometheus ceph_rgw_qlen and ceph_rgw_qactive metrics (number of
> active requests) seem to show that the number of concurrent requests
> increases with time, although we don't see more requests coming in on the
> load-balancer side.
>
> The current thought is that the RGW process doesn't close the requests
> properly, or that some requests just hang. After a restart of the process
> things look OK but the situation turns bad fairly quickly (after 1 hour we
> start to see many timeouts).
>
> The rados cluster seems completely healthy, it is also used for rbd
> volumes, and we haven't seen any degradation there.
>
> Has anyone experienced that kind of issue? Anything we should be looking
> at?
>
> Thanks for your help!
>
> Gauvain
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS subtree pinning

2023-12-22 Thread Sake Ceph
Hi!

As I'm reading through the documentation about subtree pinning, I was wondering 
if the following is possible.

We've got the following directory structure. 
/
  /app1
  /app2
  /app3
  /app4

Can I pin /app1 to MDS rank 0 and 1, the directory /app2 to rank 2 and finally 
/app3 and /app4 to rank 3?

I would like to load balance the subfolders of /app1 to 2 (or 3) MDS servers.

Best regards,
Sake
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

On 22.12.23 11:46, Marc wrote:


Does podman have this still, what dockers has. That if you kill the docker 
daemon all tasks are killed?


Podman does not come with a daemon to start containers.

The Ceph orchestrator creates systemd units to start the daemons in 
podman containers.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread André Gemünd
Podman doesn't use a daemon, thats one of the basic ideas.

We also use it in production btw.

- Am 22. Dez 2023 um 11:46 schrieb Marc m...@f1-outsourcing.eu:

>> >
>> >> It's been claimed to me that almost nobody uses podman in
>> >> production, but I have no empirical data.
> 
> As opposed to docker or to having no containers at all?
> 
>> > I even converted clusters from Docker to podman while they stayed
>> > online thanks to "ceph orch redeploy".
>> >
> 
> Does podman have this still, what dockers has. That if you kill the docker
> daemon all tasks are killed?
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

-- 
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemu...@scai.fraunhofer.de
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

Hi,

On 22.12.23 11:41, Albert Shih wrote:


for n in 1-100
   Put off line osd on server n
   Uninstall docker on server n
   Install podman on server n
   redeploy on server n
end


Yep, that's basically the procedure.

But first try it on a test cluster.

Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Marc


> >
> >> It's been claimed to me that almost nobody uses podman in
> >> production, but I have no empirical data.

As opposed to docker or to having no containers at all?

> > I even converted clusters from Docker to podman while they stayed
> > online thanks to "ceph orch redeploy".
> >

Does podman have this still, what dockers has. That if you kill the docker 
daemon all tasks are killed?


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Eugen Block
That's good to know, I have the same in mind for one of the clusters  
but didn't have the time to test it yet.


Zitat von Robert Sander :


On 21.12.23 22:27, Anthony D'Atri wrote:

It's been claimed to me that almost nobody uses podman in  
production, but I have no empirical data.


I even converted clusters from Docker to podman while they stayed  
online thanks to "ceph orch redeploy".


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW rate-limiting or anti-hammering for (external) auth requests // Anti-DoS measures

2023-12-22 Thread Christian Rohmann

Hey Ceph-Users,


RGW does have options [1] to rate limit ops or bandwidth per bucket or user.
But those only come into play when the request is authenticated.

I'd like to also protect the authentication subsystem from malicious or 
invalid requests.
So in case e.g. some EC2 credentials are not valid (anymore) and clients 
start hammering the RGW with those requests, I'd like to make it cheap 
to deal with those requests. Especially in case some external 
authentication like OpenStack Keystone [2] is used, valid access tokens 
are cached within the RGW. But requests with invalid credentials end up 
being sent at full rate to the external API [3] as there is no negative 
caching. And even if there was, that would only limit the external auth 
requests for the same set of invalid credentials, but it would surely 
reduce the load in that case:


Since the HTTP request is blocking  



[...]
2023-12-18T15:25:55.861+ 7fec91dbb640 20 sending request to 
https://keystone.example.com/v3/s3tokens
2023-12-18T15:25:55.861+ 7fec91dbb640 20 register_request 
mgr=0x561a407ae0c0 req_data->id=778, curl_handle=0x7fedaccb36e0
2023-12-18T15:25:55.861+ 7fec91dbb640 20 WARNING: blocking http 
request
2023-12-18T15:25:55.861+ 7fede37fe640 20 link_request 
req_data=0x561a40a418b0 req_data->id=778, curl_handle=0x7fedaccb36e0

[...]



this does not only stress the external authentication API (keystone in 
this case), but also blocks RGW threads for the duration of the external 
call.


I am currently looking into using the capabilities of HAProxy to rate 
limit requests based on their resulting http-response [4]. So in essence 
to rate-limit or tarpit clients that "produce" a high number of 403 
"InvalidAccessKeyId" responses. To have less collateral it might make 
sense to limit based on the presented credentials themselves. But this 
would require to extract and track HTTP headers or URL parameters 
(presigned URLs) [5] and to put them into tables.



* What are your thoughts on the matter?
* What kind of measures did you put in place?
* Does it make sense to extend RGWs capabilities to deal with those 
cases itself?

** adding negative caching
** rate limits on concurrent external authentication requests (or is 
there a pool of connections for those requests?)




Regards


Christian



[1] https://docs.ceph.com/en/latest/radosgw/admin/#rate-limit-management
[2] 
https://docs.ceph.com/en/latest/radosgw/keystone/#integrating-with-openstack-keystone
[3] 
https://github.com/ceph/ceph/blob/86bb77eb9633bfd002e73b5e58b863bc2d0df594/src/rgw/rgw_auth_keystone.cc#L475
[4] 
https://www.haproxy.com/documentation/haproxy-configuration-manual/latest/#4.2-http-response%20track-sc0
[5] 
https://docs.aws.amazon.com/AmazonS3/latest/API/sig-v4-authenticating-requests.html#auth-methods-intro

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Building new cluster had a couple of questions

2023-12-22 Thread Robert Sander

On 21.12.23 22:27, Anthony D'Atri wrote:


It's been claimed to me that almost nobody uses podman in production, but I 
have no empirical data.


I even converted clusters from Docker to podman while they stayed online 
thanks to "ceph orch redeploy".


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io