from:"Richard Bade"

[ceph-users] Re: Nautilus: Decommission an OSD Node

2023-11-05 Thread Richard Bade

Hi Dave,
It's been a few days and I haven't seen any follow up in the list so
I'm wondering if the issue is that there was a typo in your osd list?
It appears that you have 16 included again in the destination instead of 26?
"24,25,16,27,28"
I'm not familiar with the pgremapper script so I may be
misunderstanding your command.

Rich

On Thu, 2 Nov 2023 at 09:39, Dave Hall  wrote:
>
> Hello.,
>
> I've recently made the decision to gradually decommission my Nautilus
> cluster and migrate the hardware to a new Pacific or Quincy cluster. By
> gradually, I mean that as I expand the new cluster I will move (copy/erase)
> content from the old cluster to the new, making room to decommission more
> nodes and move them over.
>
> In order to do this I will, of course, need to remove OSD nodes by first
> emptying the OSDs on each node.
>
> I noticed that pgremapper (a version prior to October 2021) has a 'drain'
> subcommand that allows one to control which target OSDs would receive the
> PGs from the source OSD being drained.  This seemed like a good idea:  If
> one simply marks an OSD 'out', it's contents would be rebalanced to other
> OSDs on the same node that are still active, which seems like it would make
> a lot of unnecessary data movement and also make removing the next OSD take
> longer.
>
> So I went through the trouble of creating a 'really long' pgremapper drain
> command excluding the OSDs of two nodes as targets:
>
> # bin/pgremapper drain 16 --target-osds
> 00,01,02,03,04,05,06,07,24,25,16,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
> --allow-movement-across host  --max-source-backfills 75 --concurrency 20
> --verbose --yes
>
>
> However, when this is complete OSD 16 actually contains more PGs than
> before I started.  It appears that the mapping generated by pgremapper also
> back-filled the OSD as it was draining it.
>
> So did I miss something here?  What is the best way to proceed?  I
> understand that it would be mayhem to mark 8 of 72 OSDs out and then turn
> backfill/rebalance/recover back on.  But it seems like there should be a
> better way.
>
> Suggestions?
>
> Thanks.
>
> -Dave
>
> --
> Dave Hall
> Binghamton University
> kdh...@binghamton.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Manual resharding with multisite

2023-10-08 Thread Richard Bade

Hi Yixin,
I am interested in the answers to your questions also but I think I
can provide some useful information for you.
We have a multisite setup also where we need to reshard sometimes as
the bucket have grown. However we have bucket sync turned off for
these buckets as they only reside on one gateway and not the other.
For these buckets I have been able to manually reshard using this command:
radosgw-admin bucket reshard --rgw-zone={zone_name}
--bucket={bucket_name} --num-shards {new_shard_number}
--yes-i-really-mean-it

I have not seen any issues with this, but like I said I only have data
on that one zone and not the other. This may not be useful for your
situation but I thought I'd mention it anyway.
I would really like to know what the correct procedure is for buckets
that have more than 100k objects per shard in a multisite environment.

Regards,
Rich

On Thu, 5 Oct 2023 at 06:51, Yixin Jin  wrote:
>
> Hi folks,
>
> I am aware that dynamic resharding isn't supported before Reef with 
> multisite. However, does manual resharding work? It doesn't seem to be so, 
> either. First of all, running "bucket reshard" has to be in the master zone. 
> But if the objects of that bucket isn't in the master zone, resharding in the 
> master zone seems to render those objects inaccessible in the zone that 
> actually has them. So, what is recommended practice of resharding with 
> multiste? No resharding at all?
>
> Thanks,
> Yixin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

2023-09-06 Thread Richard Bade

Yes, I agree with Anthony. If your cluster is healthy and you don't
*need* to bring them back in it's going to be less work and time to
just deploy them as new.

I usually set norebalance, purge the osds in ceph, remove the vg from
the disks and re-deploy. Then unset norebalance at the end once
everything is peered and happy. This is so that it doesn't start
moving stuff around when you purge.

Rich

On Thu, 7 Sept 2023 at 02:21, Anthony D'Atri  wrote:
>
> Resurrection usually only makes sense if fate or a certain someone resulted 
> in enough overlapping removed OSDs that you can't meet min_size.  I've had to 
> a couple of times :-/
>
> If an OSD is down for more than a short while, backfilling a redeployed OSD 
> will likely be faster than waiting for it to peer and do deltas -- if it can 
> at all.
>
> > On Sep 6, 2023, at 10:16, Malte Stroem  wrote:
> >
> > Hi ceph-m...@rikdvk.mailer.me,
> >
> > you could squeeze the OSDs back in but it does not make sense.
> >
> > Just clean the disks with dd for example and add them as new disks to your 
> > cluster.
> >
> > Best,
> > Malte
> >
> > Am 04.09.23 um 09:39 schrieb ceph-m...@rikdvk.mailer.me:
> >> Hello,
> >> I have a ten node cluster with about 150 OSDs. One node went down a while 
> >> back, several months. The OSDs on the node have been marked as down and 
> >> out since.
> >> I am now in the position to return the node to the cluster, with all the 
> >> OS and OSD disks. When I boot up the now working node, the OSDs do not 
> >> start.
> >> Essentially , it seems to complain with "fail[ing]to load OSD map for 
> >> [various epoch]s, got 0 bytes".
> >> I'm guessing the OSDs on disk maps are so old, they can't get back into 
> >> the cluster?
> >> My questions are whether it's possible or worth it to try to squeeze these 
> >> OSDs back in or to just replace them. And if I should just replace them, 
> >> what's the best way? Manually remove [1] and recreate? Replace [2]? Purge 
> >> in dashboard?
> >> [1] 
> >> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual
> >> [2] 
> >> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd
> >> Many thanks!
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] radosgw-admin sync error trim seems to do nothing

2023-08-20 Thread Richard Bade

Hi Matthew,
At least for nautilus (14.2.22) i have discovered through trial and
error that you need to specify a beginning or end date. Something like
this:
radosgw-admin sync error trim --end-date="2023-08-20 23:00:00"
--rgw-zone={your_zone_name}

I specify the zone as there's a error list for each zone.
Hopefully that helps.

Rich

--

Date: Sat, 19 Aug 2023 12:48:55 -0400
From: Matthew Darwin 
Subject: [ceph-users] radosgw-admin sync error trim seems to do
  nothing
To: Ceph Users 
Message-ID: <95e7edfd-ca29-fc0e-a30a-987f1c43e...@mdarwin.ca>
Content-Type: text/plain; charset=UTF-8; format=flowed

Hello all,

"radosgw-admin sync error list" returns errors from 2022.  I want to
clear those out.

I tried "radosgw-admin sync error trim" but it seems to do nothing.
The man page seems to offer no suggestions
https://protect-au.mimecast.com/s/26o0CzvkGRhLoOXfXjZR3?domain=docs.ceph.com

Any ideas what I need to do to remove old errors? (or at least I want
to see more recent errors)

ceph version 17.2.6 (quincy)

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-18 Thread Richard Bade

Hi Yixin,
One place that I start with trying to figure this out is the sync
error logs. You may have already looked here:
sudo radosgw-admin sync error list --rgw-zone={zone_name}
If there's a lot in there you can trim it to a specific date so you
can see if they're still occurring
sudo radosgw-admin sync error trim --end-date="2023-06-16 03:00:00"
--rgw-zone={zone_name}
There's a log for both sides of the sync, so make sure you check both
your zones.

The next thing I try is re-running a full sync, metadata and then data:
sudo radosgw-admin metadata sync init --rgw-zone=zone1 --source_zone=zone2
sudo radosgw-admin metadata sync init --rgw-zone=zone2 --source_zone=zone1

sudo radosgw-admin data sync init --rgw-zone=zone1 --source_zone=zone2
sudo radosgw-admin data sync init --rgw-zone=zone2 --source_zone=zone1

you need to restart all the rgw processes to get this to start.
Obviously if you have a massive amount of data you don't want to
re-run a full data sync.

Lastly, I had this stuck sync happen for me with an old cluster that
had explicit placement in the buckets. I think this is because the
pool name was different in each of my zones so the explicit placement
couldn't find anywhere to put the data and the sync never finished.
Might be worth checking for this situation as there is also another
thread on the mailing list recently where someone had explicit
placement causing issues with regards to sync.

I hope that helps you track down the issue.
Rich

On Sat, 17 Jun 2023 at 08:41, Yixin Jin  wrote:
>
> Hi ceph gurus,
>
> I am experimenting with rgw multisite sync feature using Quincy release 
> (17.2.5). I am using the zone-level sync, not bucket-level sync policy. 
> During my experiment, somehow my setup got into a situation that it doesn't 
> seem to get out of. One zone is perpetually behind the other, although there 
> is no ongoing client request.
>
> Here is the output of my "sync status":
>
> root@mon1-z1:~# radosgw-admin sync status
>   realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
>   zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
>zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
>   metadata sync no sync (zone is master)
>   data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is behind on 14 shards
> behind shards: 
> [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
>
>
> It stays behind forever while rgw is almost completely idle (1% of CPU).
>
> Any suggestion on how to drill deeper to see what happened?
>
> Thanks,
> Yixin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [RGW] what is log_meta and log_data config in a multisite config?

2023-06-07 Thread Richard Bade

Hi Gilles,
I'm not 100% sure but I believe this is relating to the logs kept for
doing incremental sync. When these are false then changes are not
tracked and sync doesn't happen.
My reference is this Red Hat documentation on configuring zones
without replication.
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html/object_gateway_guide/advanced-configuration#configuring-multiple-zones-without-replication_rgw
"Open the file for editing, and set the log_meta, log_data, and
sync_from_all fields to false"

I hope that helps.
Rich

On Mon, 5 Jun 2023 at 20:42, Gilles Mocellin
 wrote:
>
> Hi Cephers,
>
> In a multisite config, with one zonegroup and 2 zones, when I look at
> `radiosgw-admin zonegroup get`,
> I see by defaut these two parameters :
>  "log_meta": "false",
>  "log_data": "true",
>
> Where can I find documentation on these, I can't find.
>
> I set log_meta to true, because, why not ?
> Is it a bad thing ?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Rados gateway data-pool replacement.

2023-04-25 Thread Richard Bade

Hi Gaël,
I'm actually embarking on a similar project to migrate EC pool from
k=2,m=1 to k=4,m=2 using rgw multi site sync.
I just thought I'd check before you do a lot of work for nothing that
when you say failure domain that's the crush failure domain you mean,
not k and m? If it is failure domain you mean I wonder if you realise
that you can change the crush rule on an EC pool?
You can change the rule the same as other pool types like this:
sudo ceph osd pool set {pool_name} crush_rule {rule_name}
At least that is my understanding and I have done so on a couple of my
pools (changed from Host to Chassis failure domain).
I found it a bit confusing in the docs because you can't change the EC
profile of a pool due to k and m numbers and the crush rule is defined
in the profile as well, but you can change that outside of the
profile.

Regards,
Rich

On Mon, 24 Apr 2023 at 20:55, Gaël THEROND  wrote:
>
> Hi casey,
>
> I’ve tested that while you answered me actually :-)
>
> So, all in all, we can’t stop the radosgw for now and tier cache option
> can’t work as we use EC based pools (at least for nautilus).
>
> Due to those constraints we’re currently thinking of the following
> procedure:
>
> 1°/- Create the new EC Profile.
> 2°/- Create the new EC based pool and assign it the new profile.
> 3°/- Create a new storage class that use this new pool.
> 4°/- Add this storage class to the default placement policy.
> 5°/- Force a bucket lifecycle objects migration (possible??).
>
> It seems at least one user attempted to do just that in here:
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/RND652IBFIG6ESSQXVGNX7NAGCNEVYOU
>
> The only part of that thread that I don’t get is the:
>
> « I think actually moving an already-stored object requires a lifecycle
> transition policy… » part of the Matt Benjamin answer.
>
> What kind of policy should I write to do that ??
>
> Is this procedure something that looks ok to you?
>
> Kind regards!
>
> Le mer. 19 avr. 2023 à 14:49, Casey Bodley  a écrit :
>
> > On Wed, Apr 19, 2023 at 5:13 AM Gaël THEROND 
> > wrote:
> > >
> > > Hi everyone, quick question regarding radosgw zone data-pool.
> > >
> > > I’m currently planning to migrate an old data-pool that was created with
> > > inappropriate failure-domain to a newly created pool with appropriate
> > > failure-domain.
> > >
> > > If I’m doing something like:
> > > radosgw-admin zone modify —rgw-zone default —data-pool 
> > >
> > > Will data from the old pool be migrated to the new one or do I need to do
> > > something else to migrate those data out of the old pool?
> >
> > radosgw won't migrate anything. you'll need to use rados tools to do
> > that first. make sure you stop all radosgws in the meantime so it
> > doesn't write more objects to the old data pool
> >
> > > I’ve read a lot
> > > of mail archive with peoples willing to do that but I can’t get a clear
> > > answer from those archives.
> > >
> > > I’m running on nautilus release of it ever help.
> > >
> > > Thanks a lot!
> > >
> > > PS: This mail is a redo of the old one as I’m not sure the former one
> > > worked (missing tags).
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Can I delete rgw log entries?

2023-04-20 Thread Richard Bade

Ok, cool. Thanks for clarifying that Daniel and Casey.
I'll clean up my sync logs now but leave the rest alone.

Rich

On Fri, 21 Apr 2023, 05:46 Daniel Gryniewicz,  wrote:

> On 4/20/23 10:38, Casey Bodley wrote:
> > On Sun, Apr 16, 2023 at 11:47 PM Richard Bade  wrote:
> >>
> >> Hi Everyone,
> >> I've been having trouble finding an answer to this question. Basically
> >> I'm wanting to know if stuff in the .log pool is actively used for
> >> anything or if it's just logs that can be deleted.
> >> In particular I was wondering about sync logs.
> >> In my particular situation I have had some tests of zone sync setup,
> >> but now I've removed the secondary zone and pools. My primary zone is
> >> filled with thousands of logs like this:
> >> data_log.71
> >> data.full-sync.index.e2cf2c3e-7870-4fc4-8ab9-d78a17263b4f.47
> >> meta.full-sync.index.7
> >> datalog.sync-status.shard.e2cf2c3e-7870-4fc4-8ab9-d78a17263b4f.13
> >>
> bucket.sync-status.f3113d30-ecd3-4873-8537-aa006e54b884:{bucketname}:default.623958784.455
> >>
> >> I assume that because I'm not doing any sync anymore I can delete all
> >> the sync related logs? Is anyone able to confirm this?
> >
> > yes
> >
> >> What about if the sync is running? Are these being written and read
> >> from and therefore must be left alone?
> >
> > right. while a multisite configuration is operating, the replication
> > logs will be trimmed in the background. in addition to the replication
> > logs, the log pool also contains sync status objects. these track the
> > progress of replication, and removing those objects would generally
> > cause sync to start over from the beginning
> >
> >> It seems like these are more of a status than just a log and that
> >> deleting them might confuse the sync process. If so, does that mean
> >> that the log pool is not just output that can be removed as needed?
> >> Are there perhaps other things in there that need to stay?
> >
> > the log pool is used by several subsystems like multisite sync,
> > garbage collection, bucket notifications, and lifecycle. those
> > features won't work reliably if you delete their rados objects
> >
>
> Also, to be clear (in case you were confused), these logs are not data
> to be read by admins (like "log files") but structured data that
> represents changes to be used by syncing (like "log structured
> filesystem").  So deleting logs while sync is running will break sync.
>
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Can I delete rgw log entries?

2023-04-16 Thread Richard Bade

Hi Everyone,
I've been having trouble finding an answer to this question. Basically
I'm wanting to know if stuff in the .log pool is actively used for
anything or if it's just logs that can be deleted.
In particular I was wondering about sync logs.
In my particular situation I have had some tests of zone sync setup,
but now I've removed the secondary zone and pools. My primary zone is
filled with thousands of logs like this:
data_log.71
data.full-sync.index.e2cf2c3e-7870-4fc4-8ab9-d78a17263b4f.47
meta.full-sync.index.7
datalog.sync-status.shard.e2cf2c3e-7870-4fc4-8ab9-d78a17263b4f.13
bucket.sync-status.f3113d30-ecd3-4873-8537-aa006e54b884:{bucketname}:default.623958784.455

I assume that because I'm not doing any sync anymore I can delete all
the sync related logs? Is anyone able to confirm this?
What about if the sync is running? Are these being written and read
from and therefore must be left alone?
It seems like these are more of a status than just a log and that
deleting them might confuse the sync process. If so, does that mean
that the log pool is not just output that can be removed as needed?
Are there perhaps other things in there that need to stay?

Regards,
Richard
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 10x more used space than expected

2023-03-14 Thread Richard Bade

Hi,
I found the documentation for metadata get to be unhelpful for what
syntax to use. I eventually found that it's this:
radosgw-admin metadata get bucket:{bucket_name}
or
radosgw-admin metadata get bucket.instance:{bucket_name}:{instance_id}

Hopefully that helps you or someone else struggling with this.

Rich

On Wed, 15 Mar 2023 at 07:18, Gaël THEROND  wrote:
>
> Alright,
> Seems something is odd out there, if I do a radosgw-admin metadata list
>
> I’ve got the following list:
>
> [
> ”bucket”,
> ”bucket.instance”,
> ”otp”,
> ”user”
> ]
>
> BUT
>
> When I try a radosgw-admin metadata get bucket or bucket.instance it
> complain with the following error:
>
> ERROR: can’t get key: (22) Invalid argument
>
> Ok, fine for the api, I’ll deal with the s3 api.
>
> Even if a radosgw-admin bucket flush version —keep-current or something
> similar would be much appreciated xD
>
> Le mar. 14 mars 2023 à 19:07, Robin H. Johnson  a
> écrit :
>
> > On Tue, Mar 14, 2023 at 06:59:51PM +0100, Gaël THEROND wrote:
> > > Versioning wasn’t enabled, at least not explicitly and for the
> > > documentation it isn’t enabled by default.
> > >
> > > Using nautilus.
> > >
> > > I’ll get all the required missing information on tomorrow morning, thanks
> > > for the help!
> > >
> > > Is there a way to tell CEPH to delete versions that aren’t current used
> > one
> > > with radosgw-admin?
> > >
> > > If not I’ll use the rest api no worries.
> > Nope, s3 API only.
> >
> > You should also check for incomplete multiparts. For that, I recommend
> > using AWSCLI or boto directly. Specifically not s3cmd, because s3cmd
> > doesn't respect the  flag properly.
> >
> > --
> > Robin Hugh Johnson
> > Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
> > E-Mail   : robb...@gentoo.org
> > GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
> > GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Undo "radosgw-admin bi purge"

2023-02-21 Thread Richard Bade

Hi Robert,
A colleague and I ran into this a few weeks ago. The way we managed to
get access back to delete the bucket properly (using radosgw-admin
bucket rm) was to reshard the bucket.
This created a new bucket index and therefore it was then possible to delete it.
If you are looking to get access back to the objects, then as Eric
said there's no way to get those indexes back but the objects will
still be there in the pool.
sudo radosgw-admin bucket reshard --bucket={bucket_name} --num-shards {number}
If you are doing multi-site replication, resharding can cause some
issues on earlier versions of ceph so check that out if that applies
to you. If you're just going to delete the bucket anyway it may not be
an issue.

Regards,
Richard

On Wed, 22 Feb 2023 at 07:43, J. Eric Ivancich  wrote:
>
> When the admin runs “bi purge” they have the option of supplying a bucket_id 
> with the “--bucket-id” command-line argument. This was useful back when 
> resharding did not automatically remove the older bucket index shards (which 
> it now does), which had a different bucket_id from the current bucket index 
> shards.
>
> If the admin doesn't supply a bucket_id it assumes the current bucket index 
> shards are to be purged. Because this is generally not wanted, the admin is 
> required to supply the "--yes-i-really-mean-it” command-line argument to 
> verify that they know that this is generally not done.
>
> There is no “undo” for “bi purge”, because it removes the metadata objects 
> that contain the bucket listing.
>
> The objects should still be readable if you know their names.
>
> > After this operation the bucket cannot be listed or removed any more.
>
>
> Knowing the above, what is your goal? Removal/clean-up? Recovery of as much 
> as possible? Both are possible to a degree (not 100%) but the processes are 
> not simple and highly manual.
>
> Eric
> (he/him)
>
> > On Feb 20, 2023, at 10:01 AM, Robert Sander  
> > wrote:
> >
> > Hi,
> >
> > There is an operation "radosgw-admin bi purge" that removes all bucket 
> > index objects for one bucket in the rados gateway.
> >
> > What is the undo operation for this?
> >
> > After this operation the bucket cannot be listed or removed any more.
> >
> > Regards
> > --
> > Robert Sander
> > Heinlein Consulting GmbH
> > Schwedter Str. 8/9b, 10119 Berlin
> >
> > http://www.heinlein-support.de
> >
> > Tel: 030 / 405051-43
> > Fax: 030 / 405051-19
> >
> > Zwangsangaben lt. §35a GmbHG:
> > HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
> > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Nautilus to Octopus when RGW already on Octopus

2023-02-06 Thread Richard Bade

Hi,
We're actually on very similar setup to you with 18.04 and Nautilus
and thinking about the 20.04 upgrade process.

As for your RGW, I think I would not consider the downgrade. I believe
the order is about avoiding issues with newer RGW connecting to older
mons and osds. Since you're already in this situation and not having
any issues, I would probably continue forward with the upgrade on
Mons, then Managers, then osds as per documentation. Then just restart
the RGW at the end.
I think that trying to downgrade at this point may introduce new
issues that you don't currently have.

This is just my opinion though, as I have not actually tried this. Do
you have a test cluster you could practice on?
I would be keen to hear how your upgrade goes.

Regards,
Richard

On Sat, 4 Feb 2023 at 22:10,  wrote:
>
> We are finally going to upgrade our Ceph from Nautilus to Octopus, before 
> looking at moving onward.  We are still on Ubuntu 18.04, so once on Octopus, 
> we will then upgrade the OS to 20.04, ready for the next upgrade.
>
> Unfortunately, we have already upgraded our rados gateways to Ubuntu 20.04, 
> last Sept, which had the side effect of upgrading the RGWs to Octopus. So I'm 
> looking to downgrade the rados gateways, back to Nautilus, just to be safe.  
> We can then do the upgrade in the right order.
>
> I have no idea if the newer Octopus rados gateways will have altered any 
> metadata, that would affect a downgrade back to Nautilus.
>
> Any advise.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2023-02-06 Thread Richard Bade

) only or are they 
> > > required for every restart?
> > >
> > > Is this setting "bluestore debug enforce settings=hdd" in the ceph config 
> > > data base or set somewhere else? How does this work if deploying HDD- and 
> > > SSD-OSDs at the same time?
> > >
> > > Ideally, all these tweaks should be applicable and settable at creation 
> > > time only without affecting generic settings (that is, at the ceph-volume 
> > > command line and not via config side effects). Otherwise it becomes 
> > > really tedious to manage these.
> > >
> > > For example, would the following work-flow apply the correct settings 
> > > *permanently* across restarts:
> > >
> > > 1) Prepare OSD on fresh HDD with ceph-volume lvm batch --prepare ...
> > > 2) Assign dm_cache to logical OSD volume created in step 1
> > > 3) Start OSD, restart OSDs, boot server ...
> > >
> > > I would assume that the HDD settings are burned into the OSD in step 1 
> > > and will be used in all future (re-)starts without the need to do 
> > > anything despite the device being detected as non-rotational after step 
> > > 2. Is this assumption correct?
> > >
> > > Thanks and best regards,
> > > =
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > 
> > > From: Richard Bade 
> > > Sent: 06 April 2022 00:43:48
> > > To: Igor Fedotov
> > > Cc: Ceph Users
> > > Subject: [Warning Possible spam]  [ceph-users] Re: Ceph Bluestore tweaks 
> > > for Bcache
> > >
> > > Just for completeness for anyone that is following this thread. Igor
> > > added that setting in Octopus, so unfortunately I am unable to use it
> > > as I am still on Nautilus.
> > >
> > > Thanks,
> > > Rich
> > >
> > > On Wed, 6 Apr 2022 at 10:01, Richard Bade  wrote:
> > > > Thanks Igor for the tip. I'll see if I can use this to reduce the
> > > > number of tweaks I need.
> > > >
> > > > Rich
> > > >
> > > > On Tue, 5 Apr 2022 at 21:26, Igor Fedotov  wrote:
> > > > > Hi Richard,
> > > > >
> > > > > just FYI: one can use "bluestore debug enforce settings=hdd" config
> > > > > parameter to manually enforce HDD-related  settings for a BlueStore
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Igor
> > > > >
> > > > > On 4/5/2022 1:07 AM, Richard Bade wrote:
> > > > > > Hi Everyone,
> > > > > > I just wanted to share a discovery I made about running bluestore on
> > > > > > top of Bcache in case anyone else is doing this or considering it.
> > > > > > We've run Bcache under Filestore for a long time with good results 
> > > > > > but
> > > > > > recently rebuilt all the osds on bluestore. This caused some
> > > > > > degradation in performance that I couldn't quite put my finger on.
> > > > > > Bluestore osds have some smarts where they detect the disk type.
> > > > > > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > > > > > the HDD parameters are better suited.
> > > > > > I changed the following parameters to match the HDD default values 
> > > > > > and
> > > > > > immediately saw my average osd latency during normal workload drop
> > > > > > from 6ms to 2ms. Peak performance didn't change really, but a test
> > > > > > machine that I have running a constant iops workload was much more
> > > > > > stable as was the average latency.
> > > > > > Performance has returned to Filestore or better levels.
> > > > > > Here are the parameters.
> > > > > >
> > > > > >; Make sure that we use values appropriate for HDD not SSD - 
> > > > > > Bcache
> > > > > > gets detected as SSD
> > > > > >bluestore_prefer_deferred_size = 32768
> > > > > >bluestore_compression_max_blob_size = 524288
> > > > > >bluestore_deferred_batch_ops = 64
> > > > > >bluestore_max_blob_size = 524288
> > > > > >bluestore_min_alloc_size = 65536
> > > > > >

[ceph-users] Re: Request for Info: What has been your experience with bluestore_compression_mode?

2022-08-18 Thread Richard Bade

Hi Laura,
We have used pool compression in the past and found it to work well.
We had it on 4/2 EC pool and found data ended up near 1:1 pool:raw.
We were storing backup data in this cephfs pool, however we changed
the backup product and as the data is now encrypted at rest by the
application the bluestore compression is unnecessary overhead for zero
gain so we are nolonger using it.

Regards,
Rich

On Fri, 19 Aug 2022 at 09:10, Laura Flores  wrote:
>
> Hi everyone,
>
> We sent an earlier inquiry on this topic asking how many people are using
> bluestore_compression_mode, but now, we would like to know about users'
> experience in a more general sense. *Do you currently have
> bluestore_compression_mode enabled? Have you tried enabling it in the past?
> Have you chosen not to enable it? We would like to know about your
> experience!*
>
> The purpose of this inquiry is that we are trying to get a sense of
> people's experiences -- positive, negative, or anything in between -- with
> bluestore_compression_mode or the per-pool compression_mode options (these
> were introduced early in bluestore's life, but as far as we know, may not
> widely be used).  We might be able to reduce complexity in bluestore's blob
> code if we could do compression in some other fashion, so we are trying to
> get a sense of whether or not it's something worth looking into more.
>
> It would help immensely to know:
>
>1. What has been *your* experience with bluestore_compression_mode?
>Whether you currently have it enabled, tried enabling it in the past, or
>have chosen not to enable it, we would like to know about your experience.
>2. If you currently have it enabled (or had it enabled in the past),
>what is/was your use case?
>
>
> Thanks,
> Laura Flores
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage
>
> Red Hat Inc. 
>
> La Grange Park, IL
>
> lflo...@redhat.com
> M: +17087388804
> @RedHat    Red Hat
>   Red Hat
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: osd_disk_thread_ioprio_class deprecated?

2022-05-18 Thread Richard Bade

> See this PR
> https://github.com/ceph/ceph/pull/19973

> Doing "git log -Sosd_disk_thread_ioprio_class -u
> src/common/options.cc" in the Ceph source indicates that they were
> removed in commit 3a331c8be28f59e2b9d952e5b5e864256429d9d5 which first
> appeared in Mimic.

Thanks Matthew and Josh for the info. That clears it up.
I'll pull those settings out of my config.

Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] osd_disk_thread_ioprio_class deprecated?

2022-05-17 Thread Richard Bade

Hi Everyone,
I've been going through our config trying to remove settings that are
nolonger relevant or which are now the default setting.
The osd_disk_thread_ioprio_class and osd_disk_thread_ioprio_priority
settings come up a few times in the mailing list but nolonger appear
in the ceph documentation
(https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/).
However they *do* still exist on the Red Hat Ceph Storage
documentation 
(https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/)

I couldn't find any reference to them having been deprecated in the
release notes when I searched.

They also don't come up when I do a ceph daemon osd.0 config show so
I'm fairly confident that they're deprecated.

Could anyone confirm this? And which release it was deprecated in?

Perhaps someone should also let Red Hat know to update their documentation :)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: DM-Cache for spinning OSDs

2022-05-17 Thread Richard Bade

Hey Felix,
I run bcache pretty much in the way you're describing, but we have
smaller spinning disks (4TB). We mostly share a 1TB NVMe between 6x
osd's with 33GB db/wal per osd and the rest shared bcache cache. The
performance is definitely improved over not running cache. We run this
mostly for rbd replicated pool for vm disk.
As I learnt when discussing bcache in another thread here it's
important to set rotational in sysfs for the device (e.g.
/sys/devices/virtual/block/bcache0/queue/rotational) before creating
the osd, otherwise ceph detects it as SSD storage and will default a
bunch of parameters that may not be appropriate. I was able to
manually set some stuff back and improve latency but I think for my
setup and possibly yours the performance would be improved with ceph
seeing it as fast spinning rather than slow SSD/NVMe.

The reason I've done it the way you have also suggested with separate
db/wal rather than what Burkhard has done with db/wal on the cached
disk is that I want the db/wal to always be in the cache for all osds
and I've also set the bluestore_prefer_deferred_size so that it's
passing small i/o through the wal, which is the default for spinning
disks.
For me this gives the lowest latency, with the caveat that it has
multiple writes due to wal->bcache->spinning.

Statistically the db / wal should always be in the cache due to
constant read/write and as Burkhard said it may waste space but I feel
it's an ok compromise for consistency.

Regards,
Rich

> > Hey guys,
> >
> > i have three servers with 12x 12 TB Sata HDDs and 1x 3,4 TB NVME. I am 
> > thinking of putting DB/WAL on the NVMe as well as an 5GB DM-Cache for each 
> > spinning disk. Is anyone running something like this in a production 
> > environment?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unbalanced Cluster

2022-05-05 Thread Richard Bade

Hi David,
Something else you could try with that other pool, if it contains little or
no data, is to reduce the PG number. This does cause some backfill
operations as it does a pg merge but this doesn't take long if the pg is
virtually empty. The autoscaler has a mode where it can make
recommendations for you without actually doing anything if you want some
advice on a suitable number. Then you can set it manually.
If the empty pg's are a factor in the balance issues then this will help.

Also, the upmap mode on the balancer is far more effective than reweight.
It has an option where you can control the max deviation. I have this set
to one and it achieves a 5% spread for my EC cluster.

Note, you'll need to reweight everything back to 1, which will cause
backfill to occur.
If you have your backfill_full level set to default this should stop any
osd's over 85% doing any backfill.

Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unbalanced Cluster

2022-05-04 Thread Richard Bade

Hi David,
I think that part of the problem with unbalanced osds is that your EC
rule k=7,m=2 gives 9 total chunks and you have 9 total servers. This
is essentially tying cephs hands as it has no choice where to put the
pg's. Assuming a failure domain of host then each EC shard needs to be
on a different host.
Therefore adding another host would help, but you're still going to be
limited. For balance a rule of k=4,m=2 would work better on 9 servers.
I had a similar experience with 6 hosts and 4,2 EC rule.
This may also cause you other problems if you have a host failure as
you wouldn't be able to balance out the host due to 8 hosts being less
than 9 shards. You would stay degraded. At least that is my
understanding. I'm happy to be corrected if I've put you wrong.

Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph OSD purge doesn't work while rebalancing

2022-04-26 Thread Richard Bade

I agree that it would be better if it was less sensitive to unrelated
backfill. I've noticed this recently too, especially if you're purging
multiple osds (like a whole host). The first one succeeds but the next
one fails even though I have no rebalance set and the osd was already
out.
I guess if my process was to remove the osd from the crush and let it
rebalance (as compared to just setting it out) then there would be no
rebalance when it was purged.
This example doesn't cover your case when there's pre-existing backfill though.

Rich

On Tue, 26 Apr 2022 at 20:09, Benoît Knecht  wrote:
>
> Hi Stefan,
>
> On Fri, Apr 22, 2022 at 11:13:36AM +0200, Stefan Kooman wrote:
> > On 4/22/22 09:25, Benoît Knecht wrote:
> > > We use the following procedure to remove an OSD from a Ceph cluster (to 
> > > replace
> > > a defective disk for instance):
> > >
> > ># ceph osd crush reweight 559 0
> > >(Wait for the cluster to rebalance.)
> > ># ceph osd out 559
> > ># ceph osd ok-to-stop 559
> > ># ceph osd safe-to-destroy 559
> > >(Stop the OSD daemon.)
> > ># ceph osd purge 559
> > >
> > > This works great when there's no rebalancing happening on the cluster, 
> > > but if
> > > there is, the last step (ceph osd purge 559) fails with
> > >
> > ># ceph osd purge 559
> > >Error EAGAIN: OSD(s) 559 have no reported stats, and not all PGs are 
> > > active+clean; we cannot draw any conclusions.
> > >You can proceed by passing --force, but be warned that this will 
> > > likely mean real, permanent data loss.
> > >
> > > But none of the PGs are degraded, so it isn't clear to me why Ceph thinks 
> > > this
> > > is a risky operation. The only PGs that are not active+clean are
> > > active+remapped+backfill_wait or active+remapped+backfilling.
> > >
> > > Is the ceph osd purge command overly cautious, or am I overlooking an 
> > > edge-case
> > > that could lead to data loss? I know I could use --force, but I don't 
> > > want to
> > > override these safety checks if they're legitimate.
> >
> > To me this looks like Ceph being overly cautious. It appears to only
> > accept PGs in active+clean state. When you have not set "norebalance",
> > "norecover", "nobackfill" an out OSD should not have PGs mapped to it.
> >
> > Instead of purge you can do "ceph osd rm $id", "ceph osd auth rm $id"
> > and "ceph osd crush rm $id" ... but that's probably the same as using
> > "--force" with the purge command.
>
> Thanks for your feedback! I think I'll try to submit a patch for `ceph osd
> safe-to-destroy` to be a bit more permissive about acceptable PG states, as it
> would be quite convenient to be able to purge OSDs even if the cluster is
> rebalancing.
>
> Cheers,
>
> --
> Ben
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: replaced osd's get systemd errors

2022-04-21 Thread Richard Bade

Yeah, I've seen this happen when replacing osds. Like Eugen said,
there's some services that get created for mounting the volumes.
You can disable them like this:
systemctl disable ceph-volume@lvm-{osdid}-{fsid}.service

list the contents of
/etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-* to find
the enabled ones.

If you've got a lot of them I usually disable them all
for service in /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-*;
do service=`basename $service`; sudo systemctl disable $service; done
then re-enable them from the lvm tags to get just the ones that exist,
like this:
 for lv in /dev/mapper/ceph*
   do osd=`sudo lvs -o lv_tags $lv | tail -1 | grep -Po
"ceph.osd_id=([0-9]*)" | gawk -F= '{ print $2 }'`
   fsid=`sudo lvs -o lv_tags $lv | tail -1 | grep -Po
"ceph.osd_fsid=([a-z\-_0-9]*)" | gawk -F= '{ print $2 }'`
   sudo systemctl enable ceph-volume@lvm-$osd-$fsid
   sudo systemctl enable --runtime ceph-osd@$osd.service
 done

Rich

On Thu, 21 Apr 2022 at 20:29, Eugen Block  wrote:
>
> These are probably remainders of previous OSDs, I remember having to
> clean up orphaned units from time to time. Compare the UUIDs to your
> actual OSDs and disable the units of the non-existing OSDs.
>
> Zitat von Marc :
>
> > I added some osd's which are up and running with:
> >
> > ceph-volume lvm create --data /dev/sdX --dmcrypt
> >
> > But I am still getting such messages of the newly created osd's
> >
> > systemd: Job
> > dev-disk-by\x2duuid-7a8df80d\x2d4a7a\x2d469f\x2d868f\x2d8fd9b7b0f09d.device/start
> >  timed
> > out.
> > systemd: Timed out waiting for device
> > dev-disk-by\x2duuid-7a8df80d\x2d4a7a\x2d469f\x2d868f\x2d8fd9b7b0f09d.device.
> > systemd: Dependency failed for /var/lib/ceph/osd/ceph-11.
> > systemd: Job
> > dev-disk-by\x2duuid-864b01aa\x2d1abf\x2d4dc0\x2da532\x2dced7cb321f4a.device/start
> >  timed
> > out.
> > systemd: Timed out waiting for device
> > dev-disk-by\x2duuid-864b01aa\x2d1abf\x2d4dc0\x2da532\x2dced7cb321f4a.device.
> > systemd: Dependency failed for /var/lib/ceph/osd/ceph-1.
> >
> >
> >
> > ceph 14.2.22
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-10 Thread Richard Bade

Ok, further testing and thinking...
Frank, you mentioned about creating the osd without cache so that it'd
be picked up as HDD not SSD. Also back in this thread Aleksandr
mentioned that the parameter rotational in sysfs is used for this.
So I checked what this parameter is being set to with bcache with and
without a cache disk attached. It's always set to zero. However I also
noticed that since this is in the "virtual" disks section this
parameter is writable.
e.g.
$ echo 1 | sudo tee /sys/devices/virtual/block/bcache0/queue/rotational
1
$ cat /sys/devices/virtual/block/bcache0/queue/rotational
1

So then I did some testing on my test cluster and created an osd on
this bcache disk set to rotational=1. The class gets correctly set to
HDD and when I query the osd metadata I get rotational=1
$ sudo ceph osd metadata 15
--->8---
"bluefs_db_rotational": "0",
"bluefs_db_type": "ssd",
--->8---
"bluestore_bdev_rotational": "1",
"bluestore_bdev_type": "hdd",
--->8---
"devices": "bcache0,md1",
--->8---
"osd_data": "/var/lib/ceph/osd/ceph-15",
"osd_objectstore": "bluestore",
"rotational": "1"

So that's looking pretty good for working around the issue for me.
I'll need to do a bunch more testing as it'll obviously be changing a
few other settings that weren't getting set before on osd creation.
It's an easy add to my process though. After creating bcache device,
set rotational to 1 and carry on as before.

Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-10 Thread Richard Bade

Ok, so I did some testing on each of these parameters one by one;
removing them from the config, watching the latency for a few minutes
then adding them back again.
None of them had any conclusive, statistically significant impact on
the latency except bluestore_prefer_deferred_size.
I removed it like this:
sudo ceph config rm osd/class:hdd bluestore_prefer_deferred_size

and my latency immediately increased from 2ms to 6ms. So I added it back again:
sudo ceph config set osd/class:hdd bluestore_prefer_deferred_size 32768

latency immediately dropped back to 2ms.
So this parameter is definitely able to be applied at runtime and
makes a difference to how my osds perform. As I am using separate db
partitions on the ssd this is to be expected when more is being pushed
through the wal, which I believe is what this parameter is doing.
I also notice the activity on the wal increases across these osds.

I also tested the other way, by removing all the parameters I
mentioned earlier and just adding this one. The results were the same.

So I guess an update to my original post is that when using bcache
make sure that you tweak the bluestore_prefer_deferred_size at least.
The bluestore_prefer_deferred_size_hdd value of 32768 works well for
me but there may be other values that are better.

Rich

On Mon, 11 Apr 2022 at 09:23, Richard Bade  wrote:
>
> Hi Frank,
> Thanks for your insight on this. I had done a bunch of testing on this
> over a year ago and found improvements with these settings. I then
> applied them all at once to our production cluster and confirmed the
> 3x reduction in latency, however I did not test the settings
> individually.
> It could well be, as you say, that the settings cannot be changed at
> runtime and that in fact only the other settings such as op queue and
> throttle cost are making the difference. I'll attempt to test the
> settings again this week and see which ones are actually affecting
> latency during runtime setting.
>
> > I'm not sure why with your OSD creation procedure the data part is created 
> > with the correct HDD parameters.
> I believe that at prepare time my osds get all SSD parameters. That's
> why I manually change the class and these runtime settings.
>
> Rich
>
> On Sat, 9 Apr 2022 at 00:22, Frank Schilder  wrote:
> >
> > Hi Richard,
> >
> > thanks for the additional info, now I understand the whole scenario and 
> > what might be different when using lvm and dm_cache.
> >
> > > In my process, bcache is added before osd creation as bcache creates a
> > > disk device called /dev/bcache0 for example. This is used for the data
> >
> > This is an important detail. As far as I know, dm_cache is transparent. It 
> > can be added/removed at run time and doesn't create a new device. However, 
> > I don't know if it changes the rotational attribute of the LVM device.
> >
> > I'm not sure why with your OSD creation procedure the data part is created 
> > with the correct HDD parameters. I believe at least these if not more 
> > parameters are used at prepare time only and cannot be changed after the 
> > OSD is created:
> >
> > bluestore_prefer_deferred_size = 32768
> > bluestore_compression_max_blob_size = 524288
> > bluestore_max_blob_size = 524288
> > bluestore_min_alloc_size = 65536
> >
> > If you set these for osd/class:hdd they should *not* be used if the initial 
> > device class is ssd. If I understood you correctly, you create an OSD with 
> > class=ssd and then change its class to class=hdd. At this point, however, 
> > it is too late, the hard-coded ssd options should persist. I wonder if 
> > using a command like
> >
> > ceph-volume lvm batch --crush-device-class hdd ...
> >
> > will select the right parameters irrespective of the rotational flag. How 
> > did you do it? I believe the only way to get the burned-in bluestore values 
> > was to start an OSD with high debug logging. The "config show" commands 
> > will show what is in the config DB and not what is burned onto disk (and 
> > actually used).
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Richard Bade 
> > Sent: 08 April 2022 00:08
> > To: Frank Schilder
> > Cc: Igor Fedotov; Ceph Users
> > Subject: [Warning Possible spam]  Re: [Warning Possible spam] Re: [Warning 
> > Possible spam] [ceph-users] Re: Ceph Bluestore tweaks for Bcache
> >
> > Hi Frank,
> > Yes, I think you have got to the crux of the issue.
> > > - some_config_value_hdd is used for "rotational=0" d

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-10 Thread Richard Bade

Hi Frank,
Thanks for your insight on this. I had done a bunch of testing on this
over a year ago and found improvements with these settings. I then
applied them all at once to our production cluster and confirmed the
3x reduction in latency, however I did not test the settings
individually.
It could well be, as you say, that the settings cannot be changed at
runtime and that in fact only the other settings such as op queue and
throttle cost are making the difference. I'll attempt to test the
settings again this week and see which ones are actually affecting
latency during runtime setting.

> I'm not sure why with your OSD creation procedure the data part is created 
> with the correct HDD parameters.
I believe that at prepare time my osds get all SSD parameters. That's
why I manually change the class and these runtime settings.

Rich

On Sat, 9 Apr 2022 at 00:22, Frank Schilder  wrote:
>
> Hi Richard,
>
> thanks for the additional info, now I understand the whole scenario and what 
> might be different when using lvm and dm_cache.
>
> > In my process, bcache is added before osd creation as bcache creates a
> > disk device called /dev/bcache0 for example. This is used for the data
>
> This is an important detail. As far as I know, dm_cache is transparent. It 
> can be added/removed at run time and doesn't create a new device. However, I 
> don't know if it changes the rotational attribute of the LVM device.
>
> I'm not sure why with your OSD creation procedure the data part is created 
> with the correct HDD parameters. I believe at least these if not more 
> parameters are used at prepare time only and cannot be changed after the OSD 
> is created:
>
> bluestore_prefer_deferred_size = 32768
> bluestore_compression_max_blob_size = 524288
> bluestore_max_blob_size = 524288
> bluestore_min_alloc_size = 65536
>
> If you set these for osd/class:hdd they should *not* be used if the initial 
> device class is ssd. If I understood you correctly, you create an OSD with 
> class=ssd and then change its class to class=hdd. At this point, however, it 
> is too late, the hard-coded ssd options should persist. I wonder if using a 
> command like
>
> ceph-volume lvm batch --crush-device-class hdd ...
>
> will select the right parameters irrespective of the rotational flag. How did 
> you do it? I believe the only way to get the burned-in bluestore values was 
> to start an OSD with high debug logging. The "config show" commands will show 
> what is in the config DB and not what is burned onto disk (and actually used).
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Richard Bade 
> Sent: 08 April 2022 00:08
> To: Frank Schilder
> Cc: Igor Fedotov; Ceph Users
> Subject: [Warning Possible spam]  Re: [Warning Possible spam] Re: [Warning 
> Possible spam] [ceph-users] Re: Ceph Bluestore tweaks for Bcache
>
> Hi Frank,
> Yes, I think you have got to the crux of the issue.
> > - some_config_value_hdd is used for "rotational=0" devices and
> > - osd/class:hdd values are used for "device_class=hdd" OSDs,
>
> The class is something that is user defined and you can actually
> define your own class names. By default the class is set to ssd for
> rotational=0 and hdd for rotational=1. I override this so my osds end
> up in the right pools as my pools are class based. I also have another
> class called nvme for all nvme storage.
> So the rotational=0 and the class=ssd are actually disconnected and
> used for two different purposes.
>
> > Or are you observing that an HDD+bcache OSD comes up in device class hdd 
> > *but* bluestore thinks it is an ssd and applies SSD defaults 
> > (some_config_value_ssd) *unless* you explicitly set the config option for 
> > device class hdd?
>
> Yes, this is what I am observing, because I am manually changing the
> device class to HDD.
>
> > - OSD is prepared on HDD and put into device class hdd (with correct 
> > persistent prepare-time options)
> > - bcache is added *after* OSD creation (???)
> > - after this, on (re-)start the OSD comes up in device class hdd but 
> > bluestore thinks now its an SSD and uses some incorrect run-time config 
> > option defaults
> > - to fix the incorrect run-time options, you explicitly copy some 
> > hdd-defaults to the config data base with filter "osd/class:hdd"
>
> In my process, bcache is added before osd creation as bcache creates a
> disk device called /dev/bcache0 for example. This is used for the data
> disk. As you have surmised bluestore thinks my disks are ssd and
> applies settings as such. I set the clas

[ceph-users] Re: [Warning Possible spam] Re: [Warning Possible spam] Re: Ceph Bluestore tweaks for Bcache

2022-04-07 Thread Richard Bade

Hi Frank,
Yes, I think you have got to the crux of the issue.
> - some_config_value_hdd is used for "rotational=0" devices and
> - osd/class:hdd values are used for "device_class=hdd" OSDs,

The class is something that is user defined and you can actually
define your own class names. By default the class is set to ssd for
rotational=0 and hdd for rotational=1. I override this so my osds end
up in the right pools as my pools are class based. I also have another
class called nvme for all nvme storage.
So the rotational=0 and the class=ssd are actually disconnected and
used for two different purposes.

> Or are you observing that an HDD+bcache OSD comes up in device class hdd 
> *but* bluestore thinks it is an ssd and applies SSD defaults 
> (some_config_value_ssd) *unless* you explicitly set the config option for 
> device class hdd?

Yes, this is what I am observing, because I am manually changing the
device class to HDD.

> - OSD is prepared on HDD and put into device class hdd (with correct 
> persistent prepare-time options)
> - bcache is added *after* OSD creation (???)
> - after this, on (re-)start the OSD comes up in device class hdd but 
> bluestore thinks now its an SSD and uses some incorrect run-time config 
> option defaults
> - to fix the incorrect run-time options, you explicitly copy some 
> hdd-defaults to the config data base with filter "osd/class:hdd"

In my process, bcache is added before osd creation as bcache creates a
disk device called /dev/bcache0 for example. This is used for the data
disk. As you have surmised bluestore thinks my disks are ssd and
applies settings as such. I set the class to HDD and then I correct
runtime settings based on the class.

> There is actually an interesting follow up on this. With bcache/dm_cache 
> large enough it should make sense to use SSD rocks-DB settings, because the 
> data base will fit into the cache. Are there any recommendations for tweaking 
> the prepare-time config options, in particular, the rocks-db options for such 
> hybrid drives?

In my case, this doesn't apply as I have used volumes on the ssd
specifically for the db. This means I know the db will always be on
the fast storage.
But yes, a larger cache size may change the performance and make it
closer to what ceph expects from an ssd. In my experience the ssd
settings made performance considerably worse than the hdd settings (3x
average latency) on bcache.

Regards,
Rich

On Fri, 8 Apr 2022 at 02:03, Frank Schilder  wrote:
>
> Hi Richard,
>
> so you are tweaking run-time config values, not OSD prepare-time config 
> values. There is something I don't understand here:
>
> > What I do for my settings is to set them for the hdd class (ceph config set 
> > osd/class:hdd bluestore_setting_blah=blahblah.
> > I think that's the correct syntax, but I'm not currently at a computer) in 
> > the config database.
>
> If the OSD comes up as class=hdd, then the hdd defaults should be applied any 
> way and there is no point setting these values explicitly to their defaults. 
> How do you make the OSD come up in class hdd, wasn't it your original problem 
> that the OSDs came up in class ssd? Or are you observing that an HDD+bcache 
> OSD comes up in device class hdd *but* bluestore thinks it is an ssd and 
> applies SSD defaults (some_config_value_ssd) *unless* you explicitly set the 
> config option for device class hdd?
>
> I think I am confused about the OSD device class, the drive type detected by 
> bluestore and what options are used if there is a mis-match - if there is 
> any. If I understand you correctly, it seems you observe that:
>
> - OSD is prepared on HDD and put into device class hdd (with correct 
> persistent prepare-time options)
> - bcache is added *after* OSD creation (???)
> - after this, on (re-)start the OSD comes up in device class hdd but 
> bluestore thinks now its an SSD and uses some incorrect run-time config 
> option defaults
> - to fix the incorrect run-time options, you explicitly copy some 
> hdd-defaults to the config data base with filter "osd/class:hdd"
>
> If this is correct, then I believe the underlying issue is that:
>
> - some_config_value_hdd is used for "rotational=0" devices and
> - osd/class:hdd values are used for "device_class=hdd" OSDs,
>
> which is not the same despite the string "hdd" indicating that it is.
>
> There is actually an interesting follow up on this. With bcache/dm_cache 
> large enough it should make sense to use SSD rocks-DB settings, because the 
> data base will fit into the cache. Are there any recommendations for tweaking 
> the prepare-time config options, in particular, the rocks-db options for such 
> hybrid drives?
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Warning Possible spam] Re: Ceph Bluestore tweaks for Bcache

2022-04-07 Thread Richard Bade

Hi Frank,
I can't speak for the bluestore debug enforce settings as I don't have this
setting but I would guess it's the same.
What I do for my settings is to set them for the hdd class (ceph config set
osd/class:hdd bluestore_setting_blah=blahblah. I think that's the correct
syntax, but I'm not currently at a computer) in the config database. The
class is a permanent setting on the osd so when it starts or server reboots
it automatically applies these setting based on the osd class. That way any
new osds also get them as soon as the class is defined for the osd.
Hopefully that helps.

Rich


On Thu, 7 Apr 2022, 23:40 Frank Schilder,  wrote:

> Hi Richard and Igor,
>
> are these tweaks required at build-time (osd prepare) only or are they
> required for every restart?
>
> Is this setting "bluestore debug enforce settings=hdd" in the ceph config
> data base or set somewhere else? How does this work if deploying HDD- and
> SSD-OSDs at the same time?
>
> Ideally, all these tweaks should be applicable and settable at creation
> time only without affecting generic settings (that is, at the ceph-volume
> command line and not via config side effects). Otherwise it becomes really
> tedious to manage these.
>
> For example, would the following work-flow apply the correct settings
> *permanently* across restarts:
>
> 1) Prepare OSD on fresh HDD with ceph-volume lvm batch --prepare ...
> 2) Assign dm_cache to logical OSD volume created in step 1
> 3) Start OSD, restart OSDs, boot server ...
>
> I would assume that the HDD settings are burned into the OSD in step 1 and
> will be used in all future (re-)starts without the need to do anything
> despite the device being detected as non-rotational after step 2. Is this
> assumption correct?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Richard Bade 
> Sent: 06 April 2022 00:43:48
> To: Igor Fedotov
> Cc: Ceph Users
> Subject: [Warning Possible spam]  [ceph-users] Re: Ceph Bluestore tweaks
> for Bcache
>
> Just for completeness for anyone that is following this thread. Igor
> added that setting in Octopus, so unfortunately I am unable to use it
> as I am still on Nautilus.
>
> Thanks,
> Rich
>
> On Wed, 6 Apr 2022 at 10:01, Richard Bade  wrote:
> >
> > Thanks Igor for the tip. I'll see if I can use this to reduce the
> > number of tweaks I need.
> >
> > Rich
> >
> > On Tue, 5 Apr 2022 at 21:26, Igor Fedotov  wrote:
> > >
> > > Hi Richard,
> > >
> > > just FYI: one can use "bluestore debug enforce settings=hdd" config
> > > parameter to manually enforce HDD-related  settings for a BlueStore
> > >
> > >
> > > Thanks,
> > >
> > > Igor
> > >
> > > On 4/5/2022 1:07 AM, Richard Bade wrote:
> > > > Hi Everyone,
> > > > I just wanted to share a discovery I made about running bluestore on
> > > > top of Bcache in case anyone else is doing this or considering it.
> > > > We've run Bcache under Filestore for a long time with good results
> but
> > > > recently rebuilt all the osds on bluestore. This caused some
> > > > degradation in performance that I couldn't quite put my finger on.
> > > > Bluestore osds have some smarts where they detect the disk type.
> > > > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > > > the HDD parameters are better suited.
> > > > I changed the following parameters to match the HDD default values
> and
> > > > immediately saw my average osd latency during normal workload drop
> > > > from 6ms to 2ms. Peak performance didn't change really, but a test
> > > > machine that I have running a constant iops workload was much more
> > > > stable as was the average latency.
> > > > Performance has returned to Filestore or better levels.
> > > > Here are the parameters.
> > > >
> > > >   ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > > > gets detected as SSD
> > > >   bluestore_prefer_deferred_size = 32768
> > > >   bluestore_compression_max_blob_size = 524288
> > > >   bluestore_deferred_batch_ops = 64
> > > >   bluestore_max_blob_size = 524288
> > > >   bluestore_min_alloc_size = 65536
> > > >   bluestore_throttle_cost_per_io = 67
> > > >
> > > >   ; Try to improve responsiveness when some disks are fully utilised
> > > &

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-05 Thread Richard Bade

Just for completeness for anyone that is following this thread. Igor
added that setting in Octopus, so unfortunately I am unable to use it
as I am still on Nautilus.

Thanks,
Rich

On Wed, 6 Apr 2022 at 10:01, Richard Bade  wrote:
>
> Thanks Igor for the tip. I'll see if I can use this to reduce the
> number of tweaks I need.
>
> Rich
>
> On Tue, 5 Apr 2022 at 21:26, Igor Fedotov  wrote:
> >
> > Hi Richard,
> >
> > just FYI: one can use "bluestore debug enforce settings=hdd" config
> > parameter to manually enforce HDD-related  settings for a BlueStore
> >
> >
> > Thanks,
> >
> > Igor
> >
> > On 4/5/2022 1:07 AM, Richard Bade wrote:
> > > Hi Everyone,
> > > I just wanted to share a discovery I made about running bluestore on
> > > top of Bcache in case anyone else is doing this or considering it.
> > > We've run Bcache under Filestore for a long time with good results but
> > > recently rebuilt all the osds on bluestore. This caused some
> > > degradation in performance that I couldn't quite put my finger on.
> > > Bluestore osds have some smarts where they detect the disk type.
> > > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > > the HDD parameters are better suited.
> > > I changed the following parameters to match the HDD default values and
> > > immediately saw my average osd latency during normal workload drop
> > > from 6ms to 2ms. Peak performance didn't change really, but a test
> > > machine that I have running a constant iops workload was much more
> > > stable as was the average latency.
> > > Performance has returned to Filestore or better levels.
> > > Here are the parameters.
> > >
> > >   ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > > gets detected as SSD
> > >   bluestore_prefer_deferred_size = 32768
> > >   bluestore_compression_max_blob_size = 524288
> > >   bluestore_deferred_batch_ops = 64
> > >   bluestore_max_blob_size = 524288
> > >   bluestore_min_alloc_size = 65536
> > >   bluestore_throttle_cost_per_io = 67
> > >
> > >   ; Try to improve responsiveness when some disks are fully utilised
> > >   osd_op_queue = wpq
> > >   osd_op_queue_cut_off = high
> > >
> > > Hopefully someone else finds this useful.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > --
> > Igor Fedotov
> > Ceph Lead Developer
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-05 Thread Richard Bade

Thanks, this should help me with some debugging around the setting
Igor suggested.

Rich

On Tue, 5 Apr 2022 at 21:20, Rudenko Aleksandr  wrote:
>
> OSD uses sysfs device parameter "rotational" for detecting device type 
> (HDD/SSD).
>
> You can see it:
>
> ceph osd metadata {osd_id}
>
> On 05.04.2022, 11:49, "Richard Bade"  wrote:
>
> Hi Frank, yes I changed the device class to HDD but there seems to be some
> smarts in the background that apply the different settings that are not
> based on the class but some other internal mechanism.
> However, I did apply the class after creating the osd, rather than during.
> If someone knows how to manually specify this, I'd also be interested to
> know.
>
> I probably should also have said that I am using Nautilus and it may be
> different in newer versions.
>
> Rich
>
>
> On Tue, 5 Apr 2022, 20:39 Frank Schilder,  wrote:
>
> > Hi Richard,
> >
> > I'm planning to use dm_cache with bluestore OSDs on LVM. I was also
> > wondering how the device will be detected. I guess if I build the OSD
> > before assigning dm_cache space it will use the usual HDD defaults. Did 
> you
> > try forcing the OSD to be in class HDD on build? I believe the OSD 
> create
> > commands have a flag for that.
> >
> > If any of the OSD gurus looks at this, could you possibly point to a
> > reference about what parameters might need attention in such scenarios 
> and
> > what the preferred deployment method would be?
> >
> > Thanks and best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Richard Bade 
> > Sent: 05 April 2022 00:07:34
> > To: Ceph Users
> > Subject: [ceph-users] Ceph Bluestore tweaks for Bcache
> >
> > Hi Everyone,
> > I just wanted to share a discovery I made about running bluestore on
> > top of Bcache in case anyone else is doing this or considering it.
> > We've run Bcache under Filestore for a long time with good results but
> > recently rebuilt all the osds on bluestore. This caused some
> > degradation in performance that I couldn't quite put my finger on.
> > Bluestore osds have some smarts where they detect the disk type.
> > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > the HDD parameters are better suited.
> > I changed the following parameters to match the HDD default values and
> > immediately saw my average osd latency during normal workload drop
> > from 6ms to 2ms. Peak performance didn't change really, but a test
> > machine that I have running a constant iops workload was much more
> > stable as was the average latency.
> > Performance has returned to Filestore or better levels.
> > Here are the parameters.
> >
> >  ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > gets detected as SSD
> >  bluestore_prefer_deferred_size = 32768
> >  bluestore_compression_max_blob_size = 524288
> >  bluestore_deferred_batch_ops = 64
> >  bluestore_max_blob_size = 524288
> >  bluestore_min_alloc_size = 65536
> >  bluestore_throttle_cost_per_io = 67
> >
> >  ; Try to improve responsiveness when some disks are fully utilised
> >  osd_op_queue = wpq
> >  osd_op_queue_cut_off = high
> >
> > Hopefully someone else finds this useful.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-05 Thread Richard Bade

Thanks Igor for the tip. I'll see if I can use this to reduce the
number of tweaks I need.

Rich

On Tue, 5 Apr 2022 at 21:26, Igor Fedotov  wrote:
>
> Hi Richard,
>
> just FYI: one can use "bluestore debug enforce settings=hdd" config
> parameter to manually enforce HDD-related  settings for a BlueStore
>
>
> Thanks,
>
> Igor
>
> On 4/5/2022 1:07 AM, Richard Bade wrote:
> > Hi Everyone,
> > I just wanted to share a discovery I made about running bluestore on
> > top of Bcache in case anyone else is doing this or considering it.
> > We've run Bcache under Filestore for a long time with good results but
> > recently rebuilt all the osds on bluestore. This caused some
> > degradation in performance that I couldn't quite put my finger on.
> > Bluestore osds have some smarts where they detect the disk type.
> > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > the HDD parameters are better suited.
> > I changed the following parameters to match the HDD default values and
> > immediately saw my average osd latency during normal workload drop
> > from 6ms to 2ms. Peak performance didn't change really, but a test
> > machine that I have running a constant iops workload was much more
> > stable as was the average latency.
> > Performance has returned to Filestore or better levels.
> > Here are the parameters.
> >
> >   ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > gets detected as SSD
> >   bluestore_prefer_deferred_size = 32768
> >   bluestore_compression_max_blob_size = 524288
> >   bluestore_deferred_batch_ops = 64
> >   bluestore_max_blob_size = 524288
> >   bluestore_min_alloc_size = 65536
> >   bluestore_throttle_cost_per_io = 67
> >
> >   ; Try to improve responsiveness when some disks are fully utilised
> >   osd_op_queue = wpq
> >   osd_op_queue_cut_off = high
> >
> > Hopefully someone else finds this useful.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Bluestore tweaks for Bcache

2022-04-05 Thread Richard Bade

Hi Frank, yes I changed the device class to HDD but there seems to be some
smarts in the background that apply the different settings that are not
based on the class but some other internal mechanism.
However, I did apply the class after creating the osd, rather than during.
If someone knows how to manually specify this, I'd also be interested to
know.

I probably should also have said that I am using Nautilus and it may be
different in newer versions.

Rich


On Tue, 5 Apr 2022, 20:39 Frank Schilder,  wrote:

> Hi Richard,
>
> I'm planning to use dm_cache with bluestore OSDs on LVM. I was also
> wondering how the device will be detected. I guess if I build the OSD
> before assigning dm_cache space it will use the usual HDD defaults. Did you
> try forcing the OSD to be in class HDD on build? I believe the OSD create
> commands have a flag for that.
>
> If any of the OSD gurus looks at this, could you possibly point to a
> reference about what parameters might need attention in such scenarios and
> what the preferred deployment method would be?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Richard Bade 
> Sent: 05 April 2022 00:07:34
> To: Ceph Users
> Subject: [ceph-users] Ceph Bluestore tweaks for Bcache
>
> Hi Everyone,
> I just wanted to share a discovery I made about running bluestore on
> top of Bcache in case anyone else is doing this or considering it.
> We've run Bcache under Filestore for a long time with good results but
> recently rebuilt all the osds on bluestore. This caused some
> degradation in performance that I couldn't quite put my finger on.
> Bluestore osds have some smarts where they detect the disk type.
> Unfortunately in the case of Bcache it detects as SSD, when in fact
> the HDD parameters are better suited.
> I changed the following parameters to match the HDD default values and
> immediately saw my average osd latency during normal workload drop
> from 6ms to 2ms. Peak performance didn't change really, but a test
> machine that I have running a constant iops workload was much more
> stable as was the average latency.
> Performance has returned to Filestore or better levels.
> Here are the parameters.
>
>  ; Make sure that we use values appropriate for HDD not SSD - Bcache
> gets detected as SSD
>  bluestore_prefer_deferred_size = 32768
>  bluestore_compression_max_blob_size = 524288
>  bluestore_deferred_batch_ops = 64
>  bluestore_max_blob_size = 524288
>  bluestore_min_alloc_size = 65536
>  bluestore_throttle_cost_per_io = 67
>
>  ; Try to improve responsiveness when some disks are fully utilised
>  osd_op_queue = wpq
>  osd_op_queue_cut_off = high
>
> Hopefully someone else finds this useful.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph Bluestore tweaks for Bcache

2022-04-04 Thread Richard Bade

Hi Everyone,
I just wanted to share a discovery I made about running bluestore on
top of Bcache in case anyone else is doing this or considering it.
We've run Bcache under Filestore for a long time with good results but
recently rebuilt all the osds on bluestore. This caused some
degradation in performance that I couldn't quite put my finger on.
Bluestore osds have some smarts where they detect the disk type.
Unfortunately in the case of Bcache it detects as SSD, when in fact
the HDD parameters are better suited.
I changed the following parameters to match the HDD default values and
immediately saw my average osd latency during normal workload drop
from 6ms to 2ms. Peak performance didn't change really, but a test
machine that I have running a constant iops workload was much more
stable as was the average latency.
Performance has returned to Filestore or better levels.
Here are the parameters.

 ; Make sure that we use values appropriate for HDD not SSD - Bcache
gets detected as SSD
 bluestore_prefer_deferred_size = 32768
 bluestore_compression_max_blob_size = 524288
 bluestore_deferred_batch_ops = 64
 bluestore_max_blob_size = 524288
 bluestore_min_alloc_size = 65536
 bluestore_throttle_cost_per_io = 67

 ; Try to improve responsiveness when some disks are fully utilised
 osd_op_queue = wpq
 osd_op_queue_cut_off = high

Hopefully someone else finds this useful.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Centralized config mask not being applied to host

2021-11-25 Thread Richard Bade

Hi Mark,
I have noticed exactly the same thing on Nautilus where host didn't
work but chassis did work. I posted to this mailing list a few weeks
ago.
It's very strange that the host filter is not working. I also could
not find any errors logged for this, so it looks like it's just
ignoring the setting.
I don't know if this is still a problem in Octopus or Pacific as I
have not had a chance to upgrade our dev cluster yet.

My workaround is that I've got a chassis level in my crush above host
and the settings which differ are the same across a chassis. This may
not suit your situation though.

Rich

On Fri, 26 Nov 2021 at 15:23, Mark Kirkwood
 wrote:
>
> HI all,
>
> I'm looking at doing a Luminous to Nautilus upgrade.  I'd like to
> assimilate the config into the mon db. However we do have hosts with
> differing [osd] config sections in their current ceph.conf files. I was
> looking at using the crush type host:xxx to set these differently if
> required.
>
> However my test case is not applying the mask to the particular host, e.g:
>
> markir@ceph2:~$ sudo ceph config set osd/host:ceph2 osd_memory_target
> 1073741824
>
> markir@ceph2:~$ sudo ceph config dump
> WHOMASK   LEVEL OPTION
> VALUE RO
> globaladvanced
> auth_client_required   cephx *
> globaladvanced
> auth_cluster_required  cephx *
> globaladvanced
> auth_service_required  cephx *
> globaladvanced
> cluster_network192.168.124.0/24 *
> globaladvanced osd_pool_default_size 2
> globaladvanced
> public_network 192.168.123.0/24 *
>mon advanced mon_warn_on_insecure_global_id_reclaim false
>mon advanced
> mon_warn_on_insecure_global_id_reclaim_allowed false
>osd  host:ceph2 basic osd_memory_target
> 1073741824
>
>
> markir@ceph2:~$ sudo ceph config get osd.1
> WHOMASK   LEVELOPTION VALUERO
> globaladvanced auth_client_required cephx*
> globaladvanced auth_cluster_required cephx*
> globaladvanced auth_service_required cephx*
> globaladvanced cluster_network 192.168.124.0/24 *
> osdhost:ceph2 basicosd_memory_target 1073741824
> globaladvanced osd_pool_default_size 2
> globaladvanced public_network 192.168.123.0/24 *
>
> markir@ceph2:~$ sudo ceph config show osd.1
> NAME  VALUE SOURCE OVERRIDES IGNORES
> auth_client_required  cephx mon
> auth_cluster_required cephx mon
> auth_service_required cephx mon
> cluster_network   192.168.124.0/24 mon
> daemonize false override
> keyring   $osd_data/keyring default
> leveldb_log default
> mon_host  192.168.123.20 file
> mon_initial_members   ceph0 file
> osd_pool_default_size 2 mon
> public_network192.168.123.0/24 mon
> rbd_default_features  61 default
> setgroup  ceph cmdline
> setuser   ceph  cmdline
>
> If I use a different mask, e.g: osd/class:hdd or even osd/root:default
> then the setting *is* applied. I'm scratching my head about this - no
> errors in either the mon or osd log to indicate why it is not being applied.
>
> The is ceph 14.2.22. And osd.1 is really on host ceph2:
>
> markir@ceph2:~$ sudo ceph osd tree
> ID CLASS WEIGHT  TYPE NAME  STATUS REWEIGHT PRI-AFF
> -1   0.11719 root default
> -3   0.02930 host ceph1
>   0   hdd 0.02930 osd.0  up  1.0 1.0
> -5   0.02930 host ceph2
>   1   hdd 0.02930 osd.1  up  1.0 1.0
> -7   0.02930 host ceph3
>   2   hdd 0.02930 osd.2  up  1.0 1.0
> -9   0.02930 host ceph4
>   3   hdd 0.02930 osd.3  up  1.0 1.0
>
> regards
>
> Mark
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] config db host filter issue

2021-10-19 Thread Richard Bade

Hi Everyone,
I think this might be a bug so I'm wondering if anyone else has seen this.
The issue is that config db filters for host don't seem to work.
I was able to reproduce this on both prod and dev clusters that I
tried it on with Nautilus 14.2.22.

The osd I'm testing (osd.0) is under this tree:
 -8   20.32744 datacenter dc01
 -3   20.32744 rack dc01-rack01
-178.16148 chassis chassis01
 -24.07590 host cstor01
  0   hdd  1.0 osd.0 up  1.0 1.0

I'm testing by changing the osd_max_backfills setting but it was the
same with others I tried too.
Here's an example setting with filter host then with filter chassis,
but it worked for rack too.
user@cstor01 DEV:~$ sudo ceph config set osd/host:cstor01 osd_max_backfills 2
user@cstor01 DEV:~$ sudo ceph config get osd.0 osd_max_backfills
2
user@cstor01 DEV:~$ sudo ceph config show osd.0 | grep osd_max_backfills
osd_max_backfills   1  mon
user@cstor01 DEV:~$ sudo ceph config rm osd/host:cstor01 osd_max_backfills
user@cstor01 DEV:~$ sudo ceph config set osd/chassis:chassis01
osd_max_backfills 2
user@cstor01 DEV:~$ sudo ceph config show osd.0 | grep osd_max_backfills
osd_max_backfills   2  mon

As you can see, the max backfills stays at 1 in the daemon when using
host filter but changes correctly to 2 when using the chassis filter.

Is this a bug? I did a search but couldn't locate anything that
sounded like this issue.
Are others able to reproduce?

Thanks,
Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Balancer vs. Autoscaler

2021-09-22 Thread Richard Bade

If you look at the current pg_num in that pool ls detail command that
Dan mentioned you can set the pool pg_num to what that value currently
is, which will effectively pause the pg changes. I did this recently
when decreasing the number of pg's in a pool, which took several weeks
to complete. This let me get some other maintenance done before
setting the pg_num back to the target num again.
This works well for reduction, but I'm not sure if it works well for
increase as I think the pg_num may reach the target much faster and
then just the pgp_num changes till they match.

Rich

On Wed, 22 Sept 2021 at 23:06, Dan van der Ster  wrote:
>
> To get an idea how much work is left, take a look at `ceph osd pool ls
> detail`. There should be pg_num_target... The osds will merge or split PGs
> until pg_num matches that value.
>
> .. Dan
>
>
> On Wed, 22 Sep 2021, 11:04 Jan-Philipp Litza,  wrote:
>
> > Hi everyone,
> >
> > I had the autoscale_mode set to "on" and the autoscaler went to work and
> > started adjusting the number of PGs in that pool. Since this implies a
> > huge shift in data, the reweights that the balancer had carefully
> > adjusted (in crush-compat mode) are now rubbish, and more and more OSDs
> > become nearful (we sadly have very different sized OSDs).
> >
> > Now apparently both manager modules, balancer and pg_autoscaler, have
> > the same threshold for operation, namely target_max_misplaced_ratio. So
> > the balancer won't become active as long as the pg_autoscaler is still
> > adjusting the number of PGs.
> >
> > I already set the autoscale_mode to "warn" on all pools, but apparently
> > the autoscaler is determined to finish what it started.
> >
> > Is there any way to pause the autoscaler so the balancer has a chance of
> > fixing the reweights? Because even in manual mode (ceph balancer
> > optimize), the balancer won't compute a plan when the misplaced ratio is
> > higher than target_max_misplaced_ratio.
> >
> > I know about "ceph osd reweight-*", but they adjust the reweights
> > (visible in "ceph osd tree"), whereas the balancer adjusts the "compat
> > weight-set", which I don't know how to convert back to the old-style
> > reweights.
> >
> > Best regards,
> > Jan-Philipp
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Edit crush rule

2021-09-07 Thread Richard Bade

Hi Budai,
I agree with Nathan, just switch the crush rule. I've recently done
this on one of our clusters.
Create a new crush rule the same as your old one except with different
failure domain.
Then use: ceph osd pool set {pool_name} crush_rule {new_rule_name}
Very easy.
This may kick off some backfill so I'd suggest setting norebalance
before doing this.

Rich

On Wed, 8 Sept 2021 at 07:51, Nathan Fish  wrote:
>
> I believe you would create a new rule and switch?
>
> On Tue, Sep 7, 2021 at 3:46 PM Budai Laszlo  wrote:
> >
> > Dear all,
> >
> > is there a way to change the failure domain of a CRUSH rule using the CLI?
> >
> > I know I can do that by editing the crush map. I'm curious if there is a 
> > "CLI way"?
> >
> > Thank you,
> > Laszlo
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: BUG #51821 - client is using insecure global_id reclaim

2021-08-08 Thread Richard Bade

Hi Daniel,
I had a similar issue last week after upgrading my test cluster from
14.2.13 to 14.2.22 which included this fix for Global ID reclaim in
.20. My issue was a rados gw that I was re-deploying on the latest
version. The problem seemed to be related with cephx authentication.
It kept displaying the error message you have and the service wouldn't
start.
I ended up stopping and removing the old rgw service, deleting all the
keys in /etc/ceph/ and all data in /var/lib/ceph/radosgw/ and
re-deploying the radosgw. This used the new rgw bootstrap keys and new
key for this radosgw.
So, I would suggest you double and triple check which keys your
clients are using and that cephx is enabled correctly on your cluster.
Check your admin key in /etc/ceph as well, as that's what's being used
for ceph status.

Regards,
Rich

On Sun, 8 Aug 2021 at 05:01, Daniel Persson  wrote:
>
> Hi everyone.
>
> I suggested asking for help here instead of in the bug tracker so that I
> will try it.
>
> https://tracker.ceph.com/issues/51821?next_issue_id=51820_issue_id=51824
>
> I have a problem that I can't seem to figure out how to resolve the issue.
>
> AUTH_INSECURE_GLOBAL_ID_RECLAIM: client is using insecure global_id reclaim
> AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
> global_id reclaim
>
>
> Both of these have to do with reclaiming ID and securing that no client
> could steal or reuse another client's ID. I understand the reason for this
> and want to resolve the issue.
>
> Currently, I have three different clients.
>
> * One Windows client using the latest Ceph-Dokan build. (ceph version
> 15.0.0-22274-g5656003758 (5656003758614f8fd2a8c49c2e7d4f5cd637b0ea) pacific
> (rc))
> * One Linux Debian build using the built packages for that kernel. (
> 4.19.0-17-amd64)
> * And one client that I've built from source for a raspberry PI as there is
> no arm build for the Pacific release. (5.11.0-1015-raspi)
>
> If I switch over to not allow global id reclaim, none of these clients
> could connect, and using the command "ceph status" on one of my nodes will
> also fail.
>
> All of them giving the same error message:
>
> monclient(hunting): handle_auth_bad_method server allowed_methods [2]
> but i only support [2]
>
>
> Has anyone encountered this problem and have any suggestions?
>
> PS. The reason I have 3 different hosts is that this is a test environment
> where I try to resolve and look at issues before we upgrade our production
> environment to pacific. DS.
>
> Best regards
> Daniel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CEPH logs to Graylog

2021-07-04 Thread Richard Bade

Hi Milosz,
I don't have any experience with the settings you're using so can't
help there, but I do log to graylog via syslog.
This is what I do, in case it's helpful as a workaround.
In ceph.conf global section or config db:
 log to syslog = true
 err to syslog = true

in rsyslog.conf add preserve hostname to get fqdn hostnames
$PreserveFQDN on

in rsyslog.d create a file to catch all. You could, of course, just
specify ceph related logs here if you don't want host logs.
*.* @ip_address_of_graylog

Regards,
Rich

On Fri, 2 Jul 2021 at 17:25,  wrote:
>
> Hi,
>
> Want to have logs from cluster on Graylog but seems like CEPH send empty
> "host" field. Any one can help ?
>
> CEPH 16.2.3
>   # ceph config dump | grep graylog
> global advanced clog_to_graylog true
> global advanced clog_to_graylog_host xx.xx.xx.xx
> global basic err_to_graylog true
> global basic log_graylog_host xx.xx.xx.xx *
> global basic log_to_graylog true
>
> I see that my Graylog is hit by traffic from ceph on port 12201 udp and
> parsed by GELF udp
>
> Grylog logs:
>
> 2021-07-01 12:16:57,355 ERROR:
> org.graylog2.shared.buffers.processors.DecodingProcessor - Error
> processing message RawMessage{id=3ad5c6a1-da66-11eb-a55c-0242ac120005,
> messageQueueId=2810784, codec=gelf, payloadSize=340,
> timestamp=2021-07-01T12:16:57.354Z, remoteAddress=/xx.xx.xx.xx:34049}
> java.lang.IllegalArgumentException: GELF message
> <3ad5c6a1-da66-11eb-a55c-0242ac120005> (received from
> ) has empty mandatory "host" field.
>  at
> org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:247)
> ~[graylog.jar:?]
>  at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:140)
> ~[graylog.jar:?]
>  at
> org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:153)
> ~[graylog.jar:?]
>  at
> org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:94)
> [graylog.jar:?]
>  at
> org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:90)
> [graylog.jar:?]
>  at
> org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:47)
> [graylog.jar:?]
>  at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143)
> [graylog.jar:?]
>  at
> com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
> [graylog.jar:?]
>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
>
> Best regards Milosz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD bootstrap time

2021-06-08 Thread Richard Bade

Hi Jan-Philipp,
I've noticed this a couple of times on Nautilus after doing some large
backfill operations. It seems the osd map doesn't clear properly after
the cluster returns to Health OK and builds up on the mons. I do a
"du" on the mon folder e.g. du -shx /var/lib/ceph/mon/ and this shows
several GB of data.
I give all my mgrs and mons a restart and after a few minutes I can
see this osd map data getting purged from the mons. After a while it
should be back to a few hundred MB (depending on cluster size).
This may not be the problem in your case, but an easy thing to try.
Note, if your cluster is being held in Warning or Error by something
this can also explain the osd maps not clearing. Make sure you get the
cluster back to health OK first.

Rich

On Wed, 9 Jun 2021 at 08:29, Jan-Philipp Litza  wrote:
>
> Hi everyone,
>
> recently I'm noticing that starting OSDs for the first time takes ages
> (like, more than an hour) before they are even picked up by the monitors
> as "up" and start backfilling. I'm not entirely sure if this is a new
> phenomenon or if it always was that way. Either way, I'd like to
> understand why.
>
> When I execute `ceph daemon osd.X status`, it says "state: preboot" and
> I can see the "newest_map" increase slowly. Apparently, a new OSD
> doesn't fetch the latest OSD map and gets to work, but instead fetches
> hundreds of thousands of OSD maps from the mon, burning CPU while
> parsing them.
>
> I wasn't able to find any good documentation on the OSDMap, in
> particular why its historical versions need to be kept and why the OSD
> seemingly needs so many of them. Can anybody point me in the right
> direction? Or is something wrong with my cluster?
>
> Best regards,
> Jan-Philipp Litza
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: any experience on using Bcache on top of HDD OSD

2021-04-19 Thread Richard Bade

Hi,
I also have used bcache extensively on filestore with journals on SSD
for at least 5 years. This has worked very well in all versions up to
luminous. The iops improvement was definitely beneficial for vm disk
images in rbd. I am also using it under bluestore with db/wal on nvme
on both Luminous and Nautilus. This is a smaller cluster also for vm
disk images on rbd, and the iops appeared to be improved. However as
Matthias mentioned, be sure to get the right types of SSD's as you can
have strange performance issues. Make sure you do fio testing as I
found on paper Intel DC S4600 looked ok but performed very badly in
sequential 4k writes. Also keep an eye on your drive writes per day as
in a busy cluster you may hit 1DWPD or more.
Stability has been very good with bcache. I'm using it on Ubuntu and
have had to use newer kernels on some of the historical LTS releases.
In 16.04 the 4.15 edge kernel is recommended, newer versions stock
kernel is fine.
Rich

On Tue, 20 Apr 2021 at 03:37, Matthias Ferdinand  wrote:
>
> On Sun, Apr 18, 2021 at 10:31:30PM +0200, huxia...@horebdata.cn wrote:
> > Dear Cephers,
> >
> > Just curious about any one who has some experience on using Bcache on top 
> > of HDD OSD to accelerate IOPS performance?
> >
> > If any, how about the stability and the performance improvement, and for 
> > how long the running time?
>
> Hi,
>
> I have not used bcache with Bluestore, but I use bcache in a Jewel
> cluster with Filestore on XFS on bcache on HDD, and I haven't seen any
> bcache-related trouble with this setup so far. I don't have journal on
> bcache, journal is separated out to SSD.
> From time to time I drained and detached the caching device from an OSD
> just to see if the added complexity still has some value, but latency
> (as measured by iostat on the OSD) would go up by a factor of about 2 so
> I kept bcache active on HDD OSDs.
>
> For Bluestore I don't know how much (if at all) bcache would improve
> performance where WAL/DB is placed on SSD already.
>
> A word of warning: never use non-DC-class SSDs for bcache caching
> devices. bcache does some write amplification, and you will regret it if
> using consumer grade SSDs. Might even turn out slower than plain HDD
> with highly irregular latency spikes.
>
>
> Regards
> Matthias
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about migrating from iSCSI to RBD

2021-03-16 Thread Richard Bade

Hi Justin,
I did some testing with iscsi a year or so ago. It was just using
standard rbd images in the backend so yes I think your theory of
stopping iscsi to release the locks and then providing access to the
rbd image would work.

Rich

On Wed, 17 Mar 2021 at 09:53, Justin Goetz  wrote:
>
> Hello!
>
> I was hoping to inquire if anyone here has attempted similar operations,
> and if they ran into any issues. To give a brief overview of my
> situation, I have a standard octopus cluster running 15.2.2, with
> ceph-iscsi installed via ansible. The original scope of a project we
> were working on changed, and we no longer need the iSCSI overhead added
> to the project (the machine using CEPH is Linux, so we would like to use
> native RBD block devices instead).
>
> Ideally we would create some new pools and migrate the data from the
> iSCSI pools over to the new pools, however, due to the massive amount of
> data (close to 200 TB), we lack the physical resources necessary to copy
> the files.
>
> Digging a bit on the backend of the pools utilized by ceph-iscsi, it
> appears that the iSCSI utility uses standard RBD images on the actual
> backend:
>
> ~]# rbd info iscsi/pool-name
> rbd image 'pool-name':
>  size 200 TiB in 52428800 objects
>  order 22 (4 MiB objects)
>  snapshot_count: 0
>  id: 137b45a37ad84a
>  block_name_prefix: rbd_data.137b45a37ad84a
>  format: 2
>  features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
>  op_features:
>  flags: object map invalid, fast diff invalid
>  create_timestamp: Thu Nov 12 16:14:31 2020
>  access_timestamp: Tue Mar 16 16:13:41 2021
>  modify_timestamp: Tue Mar 16 16:15:36 2021
>
> And I can also see that, like a standard rbd image, our 1st iSCSI
> gateway currently holds the lock on the image:
>
> ]# rbd lock ls --pool iscsi pool-name
> There is 1 exclusive lock on this image.
> Locker  ID  Address
> client.3618592  auto 259361792  10.101.12.61:0/1613659642
>
> Theoretically speaking, would I be able to simply stop & disable the
> tcmu-runner processes on all iSCSI gateways in our cluster, which would
> release the lock on the RBD image, then create another user with rwx
> permissions to the iscsi pool? Would this work, or am I missing
> something that would come back to bite me later on?
>
> Looking for any advice on this topic. Thanks in advance for reading!
>
> --
>
> Justin Goetz
> Systems Engineer, TeraSwitch Inc.
> jgo...@teraswitch.com
> 412-945-7045 (NOC) | 412-459-7945 (Direct)
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: best use of NVMe drives

2021-02-16 Thread Richard Bade

Hi Magnus,
I agree with your last suggestion, putting the OSD DB on NVMe would be
a good idea. I'm assuming you are referring to the Bluestore DB rather
than filestore journal since you mentioned your cluster is Nautilus.
We have a cephfs cluster set up in this way and it performs well. We
don't have the metadata on NVMe at this stage but I would think that
that would improve performance further.
For sizing your db volumes there's some messages in the mailing list
about rocksdb and 3GB, 30GB, 300GB limits so size your volumes
accordingly to make sure you're not wasting space.
However if you are adding these new nodes to your existing cluster
with DB on the data disks then you likely won't see much improvement
as you'll be limited by the slowest osd.

Rich

On Tue, 16 Feb 2021 at 22:27, Magnus HAGDORN  wrote:
>
> Hi there,
> we are in the process of growing our Nautilus ceph cluster. Currently,
> we have 6 nodes, 3 nodes with 2×5.5TB, 6x11TB disks and 8x186GB SSD and
> 3 nodes with 6×5.5TB and 6×7.5TB disks. All with dual link 10GE NICs.
> The SSDs are used for the CephFS metadata pool, the hard drives are
> used for the CephFS data pool. All OSD journals are kept on the drives
> themselves. Replication level is 3 for both data and metadata pools.
>
> The new servers have 12x12TB disks and 1 1.5TB NVMe drive. We expect to
> get another 3 similar nodes in the near future.
>
> My question is what is the most sensible thing to do with the NVMe
> drives. I would like to increase the replication level of the metadata
> pool. So my idea was to split the NVMes into say 4 partitions and add
> them to the metadata pool.
>
> Given the size of the drives and the metadata pool usage (~35GB) that
> seems overkill. Would it make sense to partition the drives further and
> stick the OSD journals on the NVMEs?
>
> Regards
> magnus
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG inconsistent with empty inconsistent objects

2021-01-27 Thread Richard Bade

Thanks Dan and Anthony your suggestions have pointed me in the right
direction. Looking back through the logs at when the first error was
detected I found this:

ceph-osd: 2021-01-24 01:04:55.905 7f0c17821700 -1 log_channel(cluster)
log [ERR] : 17.7ffs0 scrub : stat mismatch, got 112867/112868 objects,
0/0 clones, 112867/112868 dirty, 0/0 omap, 0/0 pinned, 0/0
hit_set_archive, 0/0 whiteouts, 473372381184/473376575488 bytes, 0/0
manifest objects, 0/0 hit_set_archive bytes.

As Anthony suggested the error is not in the rados objects but
actually the stats.
I assume that a repair will fix this up?

Thanks again everyone.
Rich

On Thu, 28 Jan 2021 at 03:59, Dan van der Ster  wrote:
>
> Usually the ceph.log prints the reason for the inconsistency when it
> is first detected by scrubbing.
>
> -- dan
>
> On Wed, Jan 27, 2021 at 12:41 AM Richard Bade  wrote:
> >
> > Hi Everyone,
> > I also have seen this inconsistent with empty when you do 
> > list-inconsistent-obj
> >
> > $ sudo ceph health detail
> > HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent; 1
> > pgs not deep-scrubbed in time
> > OSD_SCRUB_ERRORS 1 scrub errors
> > PG_DAMAGED Possible data damage: 1 pg inconsistent
> > pg 17.7ff is active+clean+inconsistent, acting [232,242,34,280,266,21]
> > PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
> > pg 17.1c2 not deep-scrubbed since 2021-01-15 02:46:16.271811
> >
> > $ sudo rados list-inconsistent-obj 17.7ff --format=json-pretty
> > {
> > "epoch": 183807,
> > "inconsistents": []
> > }
> >
> > Usually these are caused by read errors on the disks, but I've checked
> > all osd hosts that are part of this osd and there's no smart or dmesg
> > errors.
> >
> > Rich
> >
> > --
> > >
> > > Date: Sun, 17 Jan 2021 14:00:01 +0330
> > > From: Seena Fallah 
> > > Subject: [ceph-users] Re: PG inconsistent with empty inconsistent
> > > objects
> > > To: "Alexander E. Patrakov" 
> > > Cc: ceph-users 
> > > Message-ID:
> > > 
> > > 
> > > Content-Type: text/plain; charset="UTF-8"
> > >
> > > It's for a long time ago and I don't have the `ceph health detail` output!
> > >
> > > On Sat, Jan 16, 2021 at 9:42 PM Alexander E. Patrakov 
> > > wrote:
> > >
> > > > For a start, please post the "ceph health detail" output.
> > > >
> > > > сб, 19 дек. 2020 г. в 23:48, Seena Fallah :
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm facing something strange! One of the PGs in my pool got 
> > > > > inconsistent
> > > > > and when I run `rados list-inconsistent-obj $PG_ID 
> > > > > --format=json-pretty`
> > > > > the `inconsistents` key was empty! What is this? Is it a bug in Ceph
> > > > or..?
> > > > >
> > > > > Thanks.
> > > > > ___
> > > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > > >
> > > >
> > > >
> > > > --
> > > > Alexander E. Patrakov
> > > > CV: http://u.pc.cd/wT8otalK
> > > >
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG inconsistent with empty inconsistent objects

2021-01-26 Thread Richard Bade

Thanks Joe for your reply.
Yes I realise I can scrub the one that's behind, that's not my issue this
time. I'm interested in the inconsistent pg.
Usually the list inconsistent obj command gives which copy is wrong and
what the issue is. In this case it reports nothing.
I don't really want to blindly repair as in the past the repair has copied
the primary over the other copies (this may have been fixed).
Usually in this situation I match up the read errors with the failing disk,
set the disk out and run a deep scrub and all is well again.

Ceph v14.2.13 by the way.

Rich

On Wed, 27 Jan 2021, 12:57 Joe Comeau,  wrote:

> just issue the commands
>
> scrub pg deep-scrub 17.1cs
> this will deep scrub this pg
>
> ceph pg repair 17.7ff
> repairs the pg
>
>
>
>
>
> >>> Richard Bade  1/26/2021 3:40 PM >>>
> Hi Everyone,
> I also have seen this inconsistent with empty when you do
> list-inconsistent-obj
>
> $ sudo ceph health detail
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent; 1
> pgs not deep-scrubbed in time
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 17.7ff is active+clean+inconsistent, acting [232,242,34,280,266,21]
> PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
> pg 17.1c2 not deep-scrubbed since 2021-01-15 02:46:16.271811
>
> $ sudo rados list-inconsistent-obj 17.7ff --format=json-pretty
> {
> "epoch": 183807,
> "inconsistents": []
> }
>
> Usually these are caused by read errors on the disks, but I've checked
> all osd hosts that are part of this osd and there's no smart or dmesg
> errors.
>
> Rich
>
> --
> >
> > Date: Sun, 17 Jan 2021 14:00:01 +0330
> > From: Seena Fallah 
> > Subject: [ceph-users] Re: PG inconsistent with empty inconsistent
> > objects
> > To: "Alexander E. Patrakov" 
> > Cc: ceph-users 
> > Message-ID:
> > <
> cak3+omxvdc_x2r-kox-ui4k3osdvxh4o8zeqybztbumqmye...@mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > It's for a long time ago and I don't have the `ceph health detail`
> output!
> >
> > On Sat, Jan 16, 2021 at 9:42 PM Alexander E. Patrakov <
> patra...@gmail.com>
> > wrote:
> >
> > > For a start, please post the "ceph health detail" output.
> > >
> > > сб, 19 дек. 2020 г. в 23:48, Seena Fallah :
> > > >
> > > > Hi,
> > > >
> > > > I'm facing something strange! One of the PGs in my pool got
> inconsistent
> > > > and when I run `rados list-inconsistent-obj $PG_ID
> --format=json-pretty`
> > > > the `inconsistents` key was empty! What is this? Is it a bug in Ceph
> > > or..?
> > > >
> > > > Thanks.
> > > > ___
> > > > ceph-users mailing list -- ceph-users@ceph.io
> > > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > >
> > >
> > > --
> > > Alexander E. Patrakov
> > > CV: http://u.pc.cd/wT8otalK
> > >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG inconsistent with empty inconsistent objects

2021-01-26 Thread Richard Bade

Hi Everyone,
I also have seen this inconsistent with empty when you do list-inconsistent-obj

$ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent; 1
pgs not deep-scrubbed in time
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 17.7ff is active+clean+inconsistent, acting [232,242,34,280,266,21]
PG_NOT_DEEP_SCRUBBED 1 pgs not deep-scrubbed in time
pg 17.1c2 not deep-scrubbed since 2021-01-15 02:46:16.271811

$ sudo rados list-inconsistent-obj 17.7ff --format=json-pretty
{
"epoch": 183807,
"inconsistents": []
}

Usually these are caused by read errors on the disks, but I've checked
all osd hosts that are part of this osd and there's no smart or dmesg
errors.

Rich

--
>
> Date: Sun, 17 Jan 2021 14:00:01 +0330
> From: Seena Fallah 
> Subject: [ceph-users] Re: PG inconsistent with empty inconsistent
> objects
> To: "Alexander E. Patrakov" 
> Cc: ceph-users 
> Message-ID:
> 
> Content-Type: text/plain; charset="UTF-8"
>
> It's for a long time ago and I don't have the `ceph health detail` output!
>
> On Sat, Jan 16, 2021 at 9:42 PM Alexander E. Patrakov 
> wrote:
>
> > For a start, please post the "ceph health detail" output.
> >
> > сб, 19 дек. 2020 г. в 23:48, Seena Fallah :
> > >
> > > Hi,
> > >
> > > I'm facing something strange! One of the PGs in my pool got inconsistent
> > > and when I run `rados list-inconsistent-obj $PG_ID --format=json-pretty`
> > > the `inconsistents` key was empty! What is this? Is it a bug in Ceph
> > or..?
> > >
> > > Thanks.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
> > --
> > Alexander E. Patrakov
> > CV: http://u.pc.cd/wT8otalK
> >
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Small HDD cluster, switch from Bluestore to Filestore

2019-08-13 Thread Richard Bade

Hi Everyone,
There's been a few threads around about small HDD (spinning disk)
clusters and performance on Bluestore.
One recently from Christian
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036385.html)
was particularly interesting to us as we have a very similar setup to
what Christian has and we see similar performance.

We have a 6 node cluster each with 12x 4TB SATA HDD, IT mode LSI 3008, wal/db on
33GB NVMe partitions. Each node has a single Xeon Gold 6132 CPU @
2.60GHz and dual 10GB network.
We also use bcache with 1 180GB NVMe partition shared between 6 osd's.
Workload is via KVM (Proxmox)

I did the same benchmark fio tests as Christian. Here's my results (M
for me, C for Christian)
direct=0

M -- read : io=6008.0MB, bw=203264KB/s, iops=49, runt= 30267msec
C -- read: IOPS=40, BW=163MiB/s (171MB/s)(7556MiB/46320msec)

direct=1

M -- read : io=32768MB, bw=1991.4MB/s, iops=497, runt= 16455ms
C -- read: IOPS=314, BW=1257MiB/s (1318MB/s)(32.0GiB/26063msec)

direct=0

M -- write: io=32768MB, bw=471105KB/s, iops=115, runt= 71225msec
C -- write: IOPS=119, BW=479MiB/s (503MB/s)(32.0GiB/68348msec

direct=1

M -- write: io=32768MB, bw=479829KB/s, iops=117, runt= 69930msec
C -- write: IOPS=139, BW=560MiB/s (587MB/s)(32.0GiB/58519msec)

I should probably mention that there was some active workload on the
cluster at that time also, around 500iops write and 100MB/s
throughput.
The main problem that we're having with this cluster is how easy it is
for it to hit slow requests and we have one particular vm that ends up
doing scsi resets because of the latency.

So we're considering switching these osd's to filestore.
We have two other clusters using filestore/bcache/ssd journal and the
performance seems to be much better on those - taking into account the
different sizes.
What are peoples thoughts on this size cluster? Is it just not a good
fit with bluestore and our type of workload?
Also, does anyone have any knowledge on future support for filestore?
I'm concerned that we may have to migrate our other clusters off
filestore sometime in the future and that'll hurt us with the current
performance.

Rich
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

45 matches

Mail list logo