from:"Gregory Farnum"

[ceph-users] (belated) CLT notes

2024-08-08 Thread Gregory Farnum

Hi folks, the CLT met on Monday August 5. We discussed a few topics:

* The mailing lists are a problem to moderate right now with a huge
increase in spam. We have two problems: 1) the moderation system's web
front-end apparently isn't operational. That's getting fixed. 2) The
moderation is a big load regardless. Casey asked if we want to move to
the Linux Foundation-managed list system; Josh will take that to the
board since it involves some cost.

* Squid Release. 19.1.1 testing is making good progress.
* (Likely) Final quincy release 17.2.8 is next up in the queue. There
isn't a lot targeted for it right now, so please apply the explicit
17.2.8 milestone or set needs-qa on quincy-labeled PRs if you want
Yuri to help move them forward.

* The big topic is centos8 stream, in two ways:
1) We dropped building packages for it in the middle of the Reef
stable release stream because our build system can't handle the way
they closed the release. (Packages are still available at archival
URLs, but we aren't set up to use those and probably can't access them
the same way we normally do.) This has understandably upset some
people. Can we fix it?
2) We had many developers spend a lot of time unexpectedly updating
our QA systems when 8.stream disappeared, and it's still an ongoing
problem for things like upgrade testing of backports. Is Centos stream
a reliable base for our container images, package building, and test
infrastructure? Should we move elsewhere?
We need more information and a wider discussion, so this was also a
topic at yesterday's CDM. We will continue to discuss and coordinate
with other interest groups and keep you informed.

-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph tracker broken?

2024-07-01 Thread Gregory Farnum

You currently have "Email notifications" set to "For any event on all
my projects". I believe that's the firehose setting, so I've gone
ahead and changed it to "Only for things I watch or I'm involved in".
I'm unaware of any reason that would have been changed on the back
end, though there were some upgrades recently. It's also possible you
got assigned to a new group or somehow joined some of the projects
(I'm not well-versed in all the terminology there).
-Greg

On Sun, Jun 30, 2024 at 10:35 PM Frank Schilder  wrote:
>
> Hi all, hopefully someone on this list can help me out. I recently started to 
> receive unsolicited e-mail from the ceph tracker and also certain merge/pull 
> requests. The latest one is:
>
> [CephFS - Bug #66763] (New) qa: revert commit to unblock snap-schedule 
> testing
>
> I have nothing to do with that and I have not subscribed to this tracker item 
> (https://tracker.ceph.com/issues/66763) eithrt. Yet, I receive unrequested 
> updates.
>
> Could someone please take a look and try to find out what the problem is?
>
> Thanks a lot!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to setup NVMeoF?

2024-05-30 Thread Gregory Farnum

There's a major NVMe effort underway but it's not even merged to
master yet, so I'm not sure how docs would have ended up in the Reef
doc tree. :/ Zac, any idea? Can we pull this out?
-Greg


On Thu, May 30, 2024 at 7:03 AM Robert Sander
 wrote:
>
> Hi,
>
> On 5/30/24 14:18, Frédéric Nass wrote:
>
> > ceph config set mgr mgr/cephadm/container_image_nvmeof 
> > "quay.io/ceph/nvmeof:1.2.13"
>
> Thanks for the hint. With that the orchestrator deploys the current container 
> image.
>
> But: It suddenly listens on port 5499 instead of 5500 and:
>
> # podman run -it quay.io/ceph/nvmeof-cli:latest --server-address 10.128.8.29 
> --server-port 5500 subsystem add --subsystem nqn.2016-06.io.spdk:cephtest29
> Failure adding subsystem nqn.2016-06.io.spdk:cephtest29:
> <_InactiveRpcError of RPC that terminated with:
> status = StatusCode.UNAVAILABLE
> details = "failed to connect to all addresses; last error: UNKNOWN: 
> ipv4:10.128.8.29:5500: Failed to connect to remote host: Connection refused"
> debug_error_string = "UNKNOWN:failed to connect to all addresses; 
> last error: UNKNOWN: ipv4:10.128.8.29:5500: Failed to connect to remote host: 
> Connection refused {grpc_status:14, 
> created_time:"2024-05-30T13:59:33.24226686+00:00"}"
>
> # podman run -it quay.io/ceph/nvmeof-cli:latest --server-address 10.128.8.29 
> --server-port 5499 subsystem add --subsystem nqn.2016-06.io.spdk:cephtest29
> Failure adding subsystem nqn.2016-06.io.spdk:cephtest29:
> <_InactiveRpcError of RPC that terminated with:
> status = StatusCode.UNIMPLEMENTED
> details = "Method not found!"
> debug_error_string = "UNKNOWN:Error received from peer 
> ipv4:10.128.8.29:5499 {created_time:"2024-05-30T13:59:49.678809906+00:00", 
> grpc_status:12, grpc_message:"Method not found!"}"
>
>
> Is this not production ready?
> Why is it in the documentation for a released Ceph version?
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> https://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs-data-scan orphan objects while mds active?

2024-05-16 Thread Gregory Farnum

It's unfortunately more complicated than that. I don't think that
forward scrub tag gets persisted to the raw objects; it's just a
notation for you. And even if it was, it would only be on the first
object in every file — larger files would have many more objects
forward scrub doesn't touch.

This isn't a case anybody has really built tooling for. Your best bet
is probably to live with the data leakage, or else find a time to turn
it off and run the data-scan tools.
-Greg

On Tue, May 14, 2024 at 10:26 AM Olli Rajala  wrote:
>
> Tnx Gregory,
>
> Doesn't sound too safe then.
>
> Only reason to discover these orphans via scanning would be to delete the
> files again and I know all these files were at least one year old... so, I
> wonder if I could somehow do something like:
> 1) do forward scrub with a custom tag
> 2) iterate over all the objects in the pool and delete all objects without
> the tag and older than one year
>
> Is there any tooling to do such an operation? Any risks or flawed logic
> there?
>
> ...or any other ways to discover and get rid of these objects?
>
> Cheers!
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
>
>
> On Tue, May 14, 2024 at 9:41 AM Gregory Farnum  wrote:
>
> > The cephfs-data-scan tools are built with the expectation that they'll
> > be run offline. Some portion of them could be run without damaging the
> > live filesystem (NOT all, and I'd have to dig in to check which is
> > which), but they will detect inconsistencies that don't really exist
> > (due to updates that are committed to the journal but not fully
> > flushed out to backing objects) and so I don't think it would do any
> > good.
> > -Greg
> >
> > On Mon, May 13, 2024 at 4:33 AM Olli Rajala  wrote:
> > >
> > > Hi,
> > >
> > > I suspect that I have some orphan objects on a data pool after quite
> > > haphazardly evicting and removing a cache pool after deleting 17TB of
> > files
> > > from cephfs. I have forward scrubbed the mds and the filesystem is in
> > clean
> > > state.
> > >
> > > This is a production system and I'm curious if it would be safe to
> > > run cephfs-data-scan scan_extents and scan_inodes while the fs is online?
> > > Does it help if I give a custom tag while forward scrubbing and then
> > > use --filter-tag on the backward scans?
> > >
> > > ...or is there some other way to check and cleanup orphans?
> > >
> > > tnx,
> > > ---
> > > Olli Rajala - Lead TD
> > > Anima Vitae Ltd.
> > > www.anima.fi
> > > ---
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs-data-scan orphan objects while mds active?

2024-05-13 Thread Gregory Farnum

The cephfs-data-scan tools are built with the expectation that they'll
be run offline. Some portion of them could be run without damaging the
live filesystem (NOT all, and I'd have to dig in to check which is
which), but they will detect inconsistencies that don't really exist
(due to updates that are committed to the journal but not fully
flushed out to backing objects) and so I don't think it would do any
good.
-Greg

On Mon, May 13, 2024 at 4:33 AM Olli Rajala  wrote:
>
> Hi,
>
> I suspect that I have some orphan objects on a data pool after quite
> haphazardly evicting and removing a cache pool after deleting 17TB of files
> from cephfs. I have forward scrubbed the mds and the filesystem is in clean
> state.
>
> This is a production system and I'm curious if it would be safe to
> run cephfs-data-scan scan_extents and scan_inodes while the fs is online?
> Does it help if I give a custom tag while forward scrubbing and then
> use --filter-tag on the backward scans?
>
> ...or is there some other way to check and cleanup orphans?
>
> tnx,
> ---
> Olli Rajala - Lead TD
> Anima Vitae Ltd.
> www.anima.fi
> ---
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: question about rbd_read_from_replica_policy

2024-04-04 Thread Gregory Farnum

On Thu, Apr 4, 2024 at 8:23 AM Anthony D'Atri  wrote:
>
> Network RTT?

No, it's sadly not that clever. There's a crush_location configurable
that you can set on clients (to a host, or a datacenter, or any other
CRUSH bucket), and as long as part of it matches the CRUSH map then it
will feed IOs to OSDs within that CRUSH domain.
-Greg

>
> > On Apr 4, 2024, at 03:44, Noah Elias Feldt  wrote:
> >
> > Hello,
> > I have a question about a setting for RBD.
> > How exactly does "rbd_read_from_replica_policy" with the value "localize" 
> > work?
> > According to the RBD documentation, read operations will be sent to the 
> > closest OSD as determined by the CRUSH map. How does the client know 
> > exactly which OSD I am closest to?
> > The client is not in the CRUSH map. I can't find much more information 
> > about it. How does this work?
> >
> > Thanks
> >
> >
> > noah feldt
> > infrastrutur
> > _
> >
> >  > OWAPstImg85013815377039-2715-4125-ae27-665645666b62.png>
> >
> > Mittwald CM Service GmbH & Co. KG
> > Königsberger Straße 4-6
> > 32339 Espelkamp
> >
> > Tel.: 05772 / 293-100
> > Fax: 05772 / 293-333
> >
> > supp...@mittwald.de 
> > https://www.mittwald.de 
> >
> > Geschäftsführer: Robert Meyer, Florian Jürgens
> >
> > USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
> > Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen
> >
> > Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit
> > gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds 
> >  abrufbar.
> >
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io 
> > To unsubscribe send an email to ceph-users-le...@ceph.io 
> > 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Are we logging IRC channels?

2024-03-22 Thread Gregory Farnum

I put it on the list for the next CLT. :) (though I imagine it will move to
the infrastructure meeting from there.)

On Fri, Mar 22, 2024 at 4:42 PM Mark Nelson  wrote:

> Sure!  I think Wido just did it all unofficially, but afaik we've lost
> all of those records now.  I don't know if Wido still reads the mailing
> list but he might be able to chime in.  There was a ton of knowledge in
> the irc channel back in the day.  With slack, it feels like a lot of
> discussions have migrated into different channels, though #ceph still
> gets some community traffic (and a lot of hardware design discussion).
>
> Mark
>
> On 3/22/24 02:15, Alvaro Soto wrote:
> > Should we bring to life this again?
> >
> > On Tue, Mar 19, 2024, 8:14 PM Mark Nelson  > > wrote:
> >
> > A long time ago Wido used to have a bot logging IRC afaik, but I
> think
> > that's been gone for some time.
> >
> >
> > Mark
> >
> >
> > On 3/19/24 19:36, Alvaro Soto wrote:
> >  > Hi Community!!!
> >  > Are we logging IRC channels? I ask this because a lot of people
> > only use
> >  > Slack, and the Slack we use doesn't have a subscription, so
> > messages are
> >  > lost after 90 days (I believe)
> >  >
> >  > I believe it's important to keep track of the technical knowledge
> > we see
> >  > each day over IRC+Slack
> >  > Cheers!
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > 
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST errors and slow osd_ops despite hardware being fine

2024-03-15 Thread Gregory Farnum

On Fri, Mar 15, 2024 at 6:15 AM Ivan Clayson  wrote:

> Hello everyone,
>
> We've been experiencing on our quincy CephFS clusters (one 17.2.6 and
> another 17.2.7) repeated slow ops with our client kernel mounts
> (Ceph 17.2.7 and version 4 Linux kernels on all clients) that seem to
> originate from slow ops on osds despite the underlying hardware being
> fine. Our 2 clusters are similar and are both Alma8 systems where more
> specifically:
>
>   * Cluster (1) is Alma8.8 running Ceph version 17.2.7 with 7 NVMe SSD
> OSDs storing the metadata and 432 spinning SATA disks storing the
> bulk data in an EC pool (8 data shards and 2 parity blocks) across
> 40 nodes. The whole cluster is used to support a single file system
> with 1 active MDS and 2 standby ones.
>   * Cluster (2) is Alma8.7 running Ceph version 17.2.6 with 4 NVMe SSD
> OSDs storing the metadata and 348 spinning SAS disks storing the
> bulk data in EC pools  (8 data shards and 2 parity blocks). This
> cluster houses multiple filesystems each with their own dedicated
> MDS along with 3 communal standby ones.
>
> Nearly daily we often find that we're get the following error messages:
> MDS_CLIENT_LATE_RELEASE, MDS_SLOW_METADATA_IO, and MDS_SLOW_REQUEST.
> Along with these messages, certain files and directory cannot be stat-ed
> and any processes involving these files hang indefinitely. We have been
> fixing this by:
>
> 1. First, finding the oldest blocked MDS op and the inode listed there:
>
> ~$ ceph tell mds.${my_mds} dump_blocked_ops 2>> /dev/null | grep
> -c description
>
> "description": "client_request(client.251247219:662 getattr
> AsLsXsFs #0x100922d1102 2024-03-13T12:51:57.988115+
> caller_uid=26983, caller_gid=26983)",
>
> # inode/ object of interest: 100922d1102
>
> 2. Second, finding all the current clients that have a cap for this
> blocked inode from the faulty MDS' session list (i.e. ceph tell
> mds.${my_mds} session ls --cap-dump) and then examining the client
> whose had the cap the longest:
>
> ~$ ceph tell mds.${my_mds} session ls --cap-dump ...
>
> 2024-03-13T13:01:36: client.251247219
>
> 2024-03-13T12:50:28: client.245466949
>
> 3. Then on the aforementioned oldest client, get the current ops in
> flight to the OSDs (via the "/sys/kernel/debug/ceph/*/osdc" files)
> and get the op corresponding to the blocked inode along with the OSD
> the I/O is going to:
>
> root@client245466949 $ grep 100922d1102
> /sys/kernel/debug/ceph/*/osdc
>
> 48366  osd79 2.249f8a51  2.a51s0
> [79,351,232,179,107,195,323,14,128,167]/79
> [79,351,232,179,107,195,323,14,128,167]/79  e374191
> 100922d1102.00f5  0x400024  1 write
>
> # osd causing errors is osd.79
>
> 4. Finally, we restart this "hanging" OSD where this results in ls
> and I/O on the previously "stuck" files no longer "hanging" .
>
> Once we get this OSD for which the blocked inode is waiting for, we can
> see in the system logs that the OSD has slow ops:
>
> ~$ systemctl --no-pager --full status ceph-osd@79
>
> ...
> 2024-03-13T12:49:37 -1 osd.79 374175 get_health_metrics reporting 3
> slow ops, oldest is osd_op(client.245466949.0:41350 2.ca4s0
> 2.ce648ca4 (undecoded) ondisk+write+known_if_redirected e374173)
> ...


Have you dumped_ops_in_flight on the OSD in question to see how far that op
got before getting stuck?

This is some kind of RADOS problem, which isn’t great, but I wonder if
we’ve exceeded some snapshot threshold that is showing up on hard drives as
slow ops, or if there’s a code bug that is just causing them to get lost. :/
-Greg



>
> Files that these "hanging" inodes correspond to as well as the
> directories housing these files can't be opened or stat-ed (causing
> directories to hang) where we've found restarting this OSD with slow ops
> to be the least disruptive way of resolving this (compared with a forced
> umount and then re-mount on the client). There are no issues with the
> underlying hardware for either the osd reporting these slow ops or any
> other drive within the acting PG and there seems to be no correlation
> between what processes are involved or what type of files these are.
>
> What could be causing these slow ops and certain files and directories
> to "hang"? There aren't workflows being performed that generate a large
> number of small files nor are there directories with a large number of
> files within them. This seems to happen with a wide range of hard-drives
> and we see this on SATA and SAS type drives where our nodes are
> interconnected with 25 Gb/s NICs so we can't see how the underlying
> hardware would be causing any I/O bottlenecks. Has anyone else seen this
> type of behaviour before and have any ideas? Is there a way to stop
> these from happening as we are havin

[ceph-users] Re: Telemetry endpoint down?

2024-03-11 Thread Gregory Farnum

We had a lab outage Thursday and it looks like this service wasn’t
restarted after that occurred. Fixed now and we’ll look at how to prevent
that in future.
-Greg

On Mon, Mar 11, 2024 at 6:46 AM Konstantin Shalygin  wrote:

> Hi, seems telemetry endpoint is down for a some days? We have connection
> errors from multiple places
>
>
> 1:ERROR Mar 10 00:46:10.653 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 2:ERROR Mar 10 01:48:20.061 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 3:ERROR Mar 10 02:50:29.473 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 4:ERROR Mar 10 03:52:38.877 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 5:ERROR Mar 10 04:54:48.285 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 6:ERROR Mar 10 05:56:57.693 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 7:ERROR Mar 10 06:59:07.105 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 8:ERROR Mar 10 08:01:16.509 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443
> 9:ERROR Mar 10 09:03:25.917 [564383]: opensock: Could not establish a
> connection to telemetry.ceph.com:443 
>
>
> Thanks,
> k
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph-storage slack access

2024-03-08 Thread Gregory Farnum

Much of our infrastructure (including website) was down for ~6 hours
yesterday. Some information on the sepia list, and more in the
slack/irc channel.
-Greg

On Fri, Mar 8, 2024 at 9:48 AM Zac Dover  wrote:
>
> I ping www.ceph.io and ceph.io with no difficulty:
>
>
> zdover@NUC8i7BEH:~$ ping www.ceph.io
> PING www.ceph.io (8.43.84.140) 56(84) bytes of data.
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=1 ttl=51 time=241 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=2 ttl=51 time=256 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=3 ttl=51 time=225 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=4 ttl=51 time=249 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=5 ttl=51 time=226 ms
>
>
> and
>
> zdover@NUC8i7BEH:~$ ping ceph.io
> PING ceph.io (8.43.84.140) 56(84) bytes of data.
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=1 ttl=51 time=237 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=2 ttl=51 time=229 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=3 ttl=51 time=250 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=4 ttl=51 time=258 ms
> 64 bytes from beta.ceph.io (8.43.84.140): icmp_seq=5 ttl=51 time=246 ms
>
>
> Zac
>
>
> On Friday, March 8th, 2024 at 3:23 AM, Marc m...@f1-outsourcing.eu wrote:
>
> What is this irc access then? Is there some webclient that can be used? Is 
> this ceph.io down? Can't get a website nor a ping.
>
> The slack workspace is bridged to our also-published irc channels. I
> don't think we've done anything to enable xmpp (and two protocols is
> enough work to keep alive!).
> -Greg
>
> On Wed, Mar 6, 2024 at 9:07 AM Marc m...@f1-outsourcing.eu wrote:
>
> Is it possible to access this also with xmpp?
>
> At the very bottom of this page is a link
> https://ceph.io/en/community/connect/
>
> Respectfully,
>
> Wes Dillingham
> w...@wesdillingham.com
> LinkedIn http://www.linkedin.com/in/wesleydillingham
>
> On Wed, Mar 6, 2024 at 11:45 AM Matthew Vernon mver...@wikimedia.org
> wrote:
>
> Hi,
>
> How does one get an invite to the ceph-storage slack, please?
>
> Thanks,
>
> Matthew
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Minimum amount of nodes needed for stretch mode?

2024-03-07 Thread Gregory Farnum

On Thu, Mar 7, 2024 at 9:09 AM Stefan Kooman  wrote:
>
> Hi,
>
> TL;DR
>
> Failure domain considered is data center. Cluster in stretch mode [1].
>
> - What is the minimum amount of monitor nodes (apart from tie breaker)
> needed per failure domain?

You need at least two monitors per site. This is because in stretch
mode, OSDs can *only* connect to monitors in their data center. You
don't want a restarting monitor to lead to the entire cluster
switching to degraded stretch mode.

>
> - What is the minimum amount of storage nodes needed per failure domain?

I don't think there's anything specific here beyond supporting 2
replicas per site (so, at least two OSD hosts unless you configure
things weirdly).

> - Are device classes supported with stretch mode?

Stretch mode doesn't have any specific logic around device classes, so
if you provide CRUSH rules that work with them it should be okay? But
definitely not tested.

> - is min_size = 1 in "degraded stretch mode" a hard coded requirement or
> can this be changed to it at leave min_size = 2 (yes, I'm aware that no
> other OSD may go down in the surviving data center or PGs will become
> unavailable).

Hmm, I'm pretty sure this is hard coded but haven't checked in a
while. Hopefully somebody else can chime in here.

> I've converted a (test) 3 node replicated cluster (2 storage nodes, 1
> node with monitor only, min_size=2, size=4) setup to a "stretch mode"
> setup [1]. That works as expected.
>
> CRUSH rule (adjusted to work with 1 host and 2 OSDs per device class per
> data center only)
>
> rule stretch_rule {
> id 5
> type replicated
> step take dc1
> step choose firstn 0 type host
> step chooseleaf firstn 2 type osd
> step emit
> step take dc2
> step choose firstn 0 type host
> step chooseleaf firstn 2 type osd
> step emit
> }
>
> The documentation seems to suggest that 2 storage nodes and 2 monitor
> nodes are needed at a minimum. Is that correct? I wonder why? For a
> minimal (as possible) cluster I don't see the need for one additional
> monitor per datacenter. Does the tiebreaker monitor function as a normal
> monitor (apart from it not allowed to become leader)?
>
> When stretch rules with device classes are used things don't work as
> expected anymore. Example crush rule:
>
>
> rule stretch_rule_ssd {
> id 4
> type replicated
> step take dc1 class ssd
> step choose firstn 0 type host
> step chooseleaf firstn 2 type osd
> step emit
> step take dc2 class ssd
> step choose firstn 0 type host
> step chooseleaf firstn 2 type osd
> step emit
> }
>
> A similar crush rule for hdd exists. When I change the crush_rule for
> one of the pools to use stretch_rule_ssd the PGs on OSDs with device
> class ssd become inactive as soon as one of the data centers goes
> offline (and "degraded stretched mode" has been activated, and only 1
> bucket, data center, is needed for peering). I don't understand why.
> Another issue with this is that as soon as the datacenter is online
> again, the recovery will never finish by itself and a "ceph osd
> force_healthy_stretch_mode --yes-i-really-mean-it" is needed to get
> HEALTH_OK

Hmm. My first guess is there's something going on with the shadow
CRUSH tree from device classes, but somebody would need to dig into
that. I think you should file a ticket.
-Greg

>
> Can anyone explain to me why this is?
>
> Gr. Stefan
>
> [1]: https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph-storage slack access

2024-03-07 Thread Gregory Farnum

The slack workspace is bridged to our also-published irc channels. I
don't think we've done anything to enable xmpp (and two protocols is
enough work to keep alive!).
-Greg

On Wed, Mar 6, 2024 at 9:07 AM Marc  wrote:
>
> Is it possible to access this also with xmpp?
>
> >
> > At the very bottom of this page is a link
> > https://ceph.io/en/community/connect/
> >
> > Respectfully,
> >
> > *Wes Dillingham*
> > w...@wesdillingham.com
> > LinkedIn 
> >
> >
> > On Wed, Mar 6, 2024 at 11:45 AM Matthew Vernon 
> > wrote:
> >
> > > Hi,
> > >
> > > How does one get an invite to the ceph-storage slack, please?
> > >
> > > Thanks,
> > >
> > > Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Gregory Farnum

On Wed, Mar 6, 2024 at 8:56 AM Matthew Vernon  wrote:
>
> Hi,
>
> On 06/03/2024 16:49, Gregory Farnum wrote:
> > Has the link on the website broken? https://ceph.com/en/community/connect/
> > We've had trouble keeping it alive in the past (getting a non-expiring
> > invite), but I thought that was finally sorted out.
>
> Ah, yes, that works. Sorry, I'd gone to
> https://docs.ceph.com/en/latest/start/get-involved/
>
> which lacks the registration link.

Whoops! Can you update that, docs master Zac? :)
-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph-storage slack access

2024-03-06 Thread Gregory Farnum

Has the link on the website broken? https://ceph.com/en/community/connect/
We've had trouble keeping it alive in the past (getting a non-expiring
invite), but I thought that was finally sorted out.
-Greg

On Wed, Mar 6, 2024 at 8:46 AM Matthew Vernon  wrote:
>
> Hi,
>
> How does one get an invite to the ceph-storage slack, please?
>
> Thanks,
>
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs inode backtrace information

2024-01-31 Thread Gregory Farnum

The docs recommend a fast SSD pool for the CephFS *metadata*, but the
default data pool can be more flexible. The backtraces are relatively
small — it's an encoded version of the path an inode is located at,
plus the RADOS hobject, which is probably more of the space usage. So
it should fit fine in your SSD pool, but if all the cephfs file data
is living in the hard drive pool I'd just set it up there.
-Greg

On Tue, Jan 30, 2024 at 2:03 AM Dietmar Rieder
 wrote:
>
> Hello,
>
> I have a question regarding the default pool of a cephfs.
>
> According to the docs it is recommended to use a fast ssd replicated
> pool as default pool for cephfs. I'm asking what are the space
> requirements for storing the inode backtrace information?
>
> Let's say I have a 85 TiB replicated ssd pool (hot data) and as 3 PiB EC
> data pool (cold data).
>
> Does it make sense to create a third pool as default pool which only
> holds the inode backtrace information (what would be a good size), or is
> it OK to use the ssd pool as default pool?
>
> Thanks
> Dietmar
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Debian 12 support

2023-11-15 Thread Gregory Farnum

There are versioning and dependency issues (both of packages, and compiler
toolchain pieces) which mean that the existing reef releases do not build
on Debian. Our upstream support for Debian has always been inconsistent
because we don’t have anybody dedicated or involved enough in both Debian
and Ceph to keep it working on a day-to-day basis. (I think the basic
problem is Debian makes much bigger and less frequent jumps in compiler
toolchains and packages than the other distros we work with, and none of
the developers have used it as their working OS since ~2010.)

Matthew has submitted a number of PRs to deal with those issues that are in
the reef branch and will let us do upstream builds for the next point
release. (Thanks Matthew!) Proxmox may have grabbed them or done their own
changes without pushing them upstream, or they might have found some other
workarounds that fit their needs.
-Greg

On Mon, Nov 13, 2023 at 8:42 AM Luke Hall 
wrote:

> On 13/11/2023 16:28, Daniel Baumann wrote:
> > On 11/13/23 17:14, Luke Hall wrote:
> >> How is it that Proxmox were able to release Debian12 packages for Quincy
> >> quite some time ago?
> >
> > because you can, as always, just (re-)build the package yourself.
>
> I guess I was just trying to point out that there seems to be nothing
> fundamentally blocking these builds which makes it more surprising that
> the official Ceph repo doesn't have Debian12 packages yet.
>
> >> My understanding is that they change almost nothing in their packages
> >> and just roll them to fit with their naming schema etc.
> >
> > yes, we're doing the same since kraken and put them in our own repo
> > (either builds of the "original" ceph sources, or backports from debian
> > - whichever is earlier available).. which is easier/simpler/more
> > reliable and avoids any dependency on external repositories.
> >
> > Regards,
> > Daniel
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> All postal correspondence to:
> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
>
> *Follow us on Twitter* @posipeople
>
> The Positive Internet Company Limited is registered in England and Wales.
> Registered company number: 3673639. VAT no: 726 7072 28.
> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-01 Thread Gregory Farnum

We have seen issues like this a few times and they have all been kernel
client bugs with CephFS’ internal “capability” file locking protocol. I’m
not aware of any extant bugs like this in our code base, but kernel patches
can take a long and winding path before they end up on deployed systems.

Most likely, if you were to restart some combination of the client which
wrote the file and the client(s) reading it, the size would propagate
correctly. As long as you’ve synced the data, it’s definitely present in
the cluster.

Adding Xiubo, who has worked on these and may have other comments.
-Greg

On Wed, Nov 1, 2023 at 7:16 AM Frank Schilder  wrote:

> Dear fellow cephers,
>
> today we observed a somewhat worrisome inconsistency on our ceph fs. A
> file created on one host showed up as 0 length on all other hosts:
>
> [user1@host1 h2lib]$ ls -lh
> total 37M
> -rw-rw 1 user1 user1  12K Nov  1 11:59 dll_wrapper.py
>
> [user2@host2 h2lib]# ls -l
> total 34
> -rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py
>
> [user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test
> [user1@host1 h2lib]$ ls -l
> total 37199
> -rw-rw 1 user1 user111641 Nov  1 11:59 dll_wrapper.py
> -rw-rw 1 user1 user111641 Nov  1 13:10 dll_wrapper.py.test
>
> [user2@host2 h2lib]# ls -l
> total 45
> -rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py
> -rw-rw. 1 user1 user1 11641 Nov  1 13:10 dll_wrapper.py.test
>
> Executing a sync on all these hosts did not help. However, deleting the
> problematic file and replacing it with a copy seemed to work around the
> issue. We saw this with ceph kclients of different versions, it seems to be
> on the MDS side.
>
> How can this happen and how dangerous is it?
>
> ceph fs status (showing ceph version):
>
> # ceph fs status
> con-fs2 - 1662 clients
> ===
> RANK  STATE MDS   ACTIVITY DNSINOS
>  0active  ceph-15  Reqs:   14 /s  2307k  2278k
>  1active  ceph-11  Reqs:  159 /s  4208k  4203k
>  2active  ceph-17  Reqs:3 /s  4533k  4501k
>  3active  ceph-24  Reqs:3 /s  4593k  4300k
>  4active  ceph-14  Reqs:1 /s  4228k  4226k
>  5active  ceph-13  Reqs:5 /s  1994k  1782k
>  6active  ceph-16  Reqs:8 /s  5022k  4841k
>  7active  ceph-23  Reqs:9 /s  4140k  4116k
> POOL   TYPE USED  AVAIL
>con-fs2-meta1 metadata  2177G  7085G
>con-fs2-meta2   data   0   7085G
> con-fs2-data   data1242T  4233T
> con-fs2-data-ec-ssddata 706G  22.1T
>con-fs2-data2   data3409T  3848T
> STANDBY MDS
>   ceph-10
>   ceph-08
>   ceph-09
>   ceph-12
> MDS version: ceph version 15.2.17
> (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>
> There is no health issue:
>
> # ceph status
>   cluster:
> id: abc
> health: HEALTH_WARN
> 3 pgs not deep-scrubbed in time
>
>   services:
> mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w)
> mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03,
> ceph-02
> mds: con-fs2:8 4 up:standby 8 up:active
> osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d)
>
>   task status:
>
>   data:
> pools:   14 pools, 25065 pgs
> objects: 2.20G objects, 3.9 PiB
> usage:   4.9 PiB used, 8.2 PiB / 13 PiB avail
> pgs: 25039 active+clean
>  26active+clean+scrubbing+deep
>
>   io:
> client:   799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr
>
> The inconsistency seems undiagnosed, I couldn't find anything interesting
> in the cluster log. What should I look for and where?
>
> I moved the folder to another location for diagnosis. Unfortunately, I
> don't have 2 clients any more showing different numbers, I see a 0 length
> now everywhere for the moved folder. I'm pretty sure though that the file
> still is non-zero length.
>
> Thanks for any pointers.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Not able to find a standardized restoration procedure for subvolume snapshots.

2023-09-27 Thread Gregory Farnum

Unfortunately, there’s not any such ability. We are starting long-term work
on making this smoother, but CephFS snapshots are read-only and there’s no
good way to do a constant-time or low-time “clone” operation, so you just
have to copy the data somewhere and start work on it from that position :/
That’s what the subvolume clone interface does.

On Wed, Sep 27, 2023 at 10:39 AM Kushagr Gupta <
kushagrguptasps@gmail.com> wrote:

> Hi Team,
>
> Any update on this?
>
> Thanks and Regards,
> Kushagra Gupta
>
> On Thu, Sep 14, 2023 at 9:19 AM Kushagr Gupta <
> kushagrguptasps@gmail.com>
> wrote:
>
> > Hi Team,
> >
> > Any update on this?
> >
> > Thanks and Regards,
> > Kushagra Gupta
> >
> > On Tue, Sep 5, 2023 at 10:51 AM Kushagr Gupta <
> > kushagrguptasps@gmail.com> wrote:
> >
> >> *Ceph-version*: Quincy
> >> *OS*: Centos 8 stream
> >>
> >> *Issue*: Not able to find a standardized restoration procedure for
> >> subvolume snapshots.
> >>
> >> *Description:*
> >> Hi  team,
> >>
> >> We are currently working in a 3-node ceph cluster.
> >> We are currently exploring the scheduled snapshot capability of the
> >> ceph-mgr module.
> >> To enable/configure scheduled snapshots, we followed the following link:
> >>
> >> https://docs.ceph.com/en/quincy/cephfs/snap-schedule/
> >>
> >> The scheduled snapshots are working as expected. But we are unable to
> >> find any standardized restoration procedure for the same.
> >>
> >> We have found the following link( not official documentation):
> >> https://www.suse.com/support/kb/doc/?id=19627
> >>
> >> We have also found a link of cloning a new subvolume from snapshots:
> >> https://docs.ceph.com/en/reef/cephfs/fs-volumes/
> >> (Section: Cloning Snapshots)
> >>
> >> Is there a standard procedure to restore from a snapshot.
> >> By this I mean, is there some kind of command link maybe
> >> ceph fs subvolume snapshot restore 
> >>
> >> Or any other procedure please let us know.
> >>
> >> Thanks and Regards,
> >> Kushagra Gupta
> >>
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CVE-2023-43040 - Improperly verified POST keys in Ceph RGW?

2023-09-27 Thread Gregory Farnum

We discussed this in the CLT today and Casey can talk more about the impact
and technical state of affairs.

This was disclosed on the security list and it’s rated as a bug that did
not require hotfix releases due to the limited escalation scope.
-Greg

On Wed, Sep 27, 2023 at 1:37 AM Christian Rohmann <
christian.rohm...@inovex.de> wrote:

> Hey Ceph-users,
>
> i just noticed there is a post to oss-security
> (https://www.openwall.com/lists/oss-security/2023/09/26/10) about a
> security issue with Ceph RGW.
> Signed by IBM / Redhat and including a patch by DO.
>
>
> I also raised an issue on the tracker
> (https://tracker.ceph.com/issues/63004) about this, as I could not find
> one yet.
> It seems a weird way of disclosing such a thing and am wondering if
> anybody knew any more about this?
>
>
>
> Regards
>
>
> Christian
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph leadership team notes 9/27

2023-09-27 Thread Gregory Farnum

Hi everybody,
The CLT met today as usual. We only had a few topics under discussion:
* the User + Dev relaunch went off well! We’d like reliable recordings and
have found Jitsi to be somewhat glitchy; Laura will communicate about
workarounds for that while we work on a longer-term solution (self-hosting
Jitsi has a better reputation and is a possibility). We also discussed a
GitHub repo for hosting presentation files, and organizing them on the
website.

* CVE handling. As noted elsewhere on the mailing list, CVE-2023-43040 (a
privilege escalation impacting RGW) was disclosed elsewhere, and we do not
have coordinated releases for it. This was not deemed important enough on
the security list for that effort, but we do want to be more prepared for
it than we were — our CVE handling process has broken down a bit since some
of the CVE work is now being handled by IBM instead of Red Hat. Tech leads
and IBM employees will be working on refining that so we have better
disclosures.
Also, if you were previously on the security mailing list and a did not see
these emails, please reach out to the team — some subscribers were lost and
not recovered in the lab disaster end of last year. (For obvious reasons
this is a closed list — if you do not work for a Linux distribution or at a
large deployer with established relationships in Ceph and security
communities, it’s hard for us to put you there.)
-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RHEL / CephFS / Pacific / SELinux unavoidable "relabel inode" error?

2023-08-02 Thread Gregory Farnum

I don't think we've seen this reported before. SELinux gets a hefty
workout from Red Hat with their downstream ODF for OpenShift
(Kubernetes), so it certainly works at a basic level.

SELinux is a fussy beast though, so if you're eg mounting CephFS
across RHEL nodes and invoking SELinux against it, any differences
between those nodes (like differing UIDs, or probably lots of other
bits) could result in it looking wrong to them. I'm not even close to
an expert but IIUC, people generally turn off SELinux for their
shared/distributed data.
-Greg

On Wed, Aug 2, 2023 at 5:53 AM Harry G Coin  wrote:
>
> Hi!  No matter what I try, using the latest cephfs on an all
> ceph-pacific setup, I've not been able to avoid this error message,
> always similar to this on RHEL family clients:
>
> SELinux: inode=1099954719159 on dev=ceph was found to have an invalid
> context=system_u:object_r:unlabeled_t:s0.  This indicates you may need
> to relabel the inode or the filesystem in question.
>
> What's the answer?
>
>
> Thanks
>
> Harry Coin
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS snapshots: impact of moving data

2023-07-06 Thread Gregory Farnum

Moving files around within the namespace never changes the way the file
data is represented within RADOS. It’s just twiddling metadata bits. :)
-Greg

On Thu, Jul 6, 2023 at 3:26 PM Dan van der Ster 
wrote:

> Hi Mathias,
>
> Provided that both subdirs are within the same snap context (subdirs below
> where the .snap is created), I would assume that in the mv case, the space
> usage is not doubled: the snapshots point at the same inode and it is just
> linked at different places in the filesystem.
>
> However, if your cluster and livelihood depends on this being true, I
> suggest making a small test in a tiny empty cephfs, listing the rados pools
> before and after mv and snapshot operations to find out exactly which data
> objects are created.
>
> Cheers, Dan
>
> __
> Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com
>
>
>
>
> On Thu, Jun 22, 2023 at 8:54 AM Kuhring, Mathias <
> mathias.kuhr...@bih-charite.de> wrote:
>
> > Dear Ceph community,
> >
> > We want to restructure (i.e. move around) a lot of data (hundreds of
> > terabyte) in our CephFS.
> > And now I was wondering what happens within snapshots when I move data
> > around within a snapshotted folder.
> > I.e. do I need to account for a lot increased storage usage due to older
> > snapshots differing from the new restructured state?
> > In the end it is just metadata changes. Are the snapshots aware of this?
> >
> > Consider the following examples.
> >
> > Copying data:
> > Let's say I have a folder /test, with a file XYZ in sub-folder
> > /test/sub1 and an empty sub-folder /test/sub2.
> > I create snapshot snapA in /test/.snap, copy XYZ to sub-folder
> > /test/sub2, delete it from /test/sub1 and create another snapshot snapB.
> > I would have two snapshots each with distinct copies of XYZ, hence using
> > double the space in the FS:
> > /test/.snap/snapA/sub1/XYZ <-- copy 1
> > /test/.snap/snapA/sub2/
> > /test/.snap/snapB/sub1/
> > /test/.snap/snapB/sub2/XYZ <-- copy 2
> >
> > Moving data:
> > Let's assume the same structure.
> > But now after creating snapshot snapA, I move XYZ to sub-folder
> > /test/sub2 and then create the other snapshot snapB.
> > The directory tree will look the same. But how is this treated
> internally?
> > Once I move the data, will there be an actually copy created in snapA to
> > represent the old state?
> > Or will this remain the same data (like a link to the inode or so)?
> > And hence not double the storage used for that file.
> >
> > I couldn't find (or understand) anything related to this in the docs.
> > The closest seems to be the hard-link section here:
> > https://docs.ceph.com/en/quincy/dev/cephfs-snapshots/#hard-links
> > Which unfortunately goes a bit over my head.
> > So I'm not sure if this answers my question.
> >
> > Thank you all for your help. Appreciate it.
> >
> > Best Wishes,
> > Mathias Kuhring
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Unexpected behavior of directory mtime after being set explicitly

2023-05-25 Thread Gregory Farnum

I haven’t checked the logs, but the most obvious way this happens is if the
mtime set on the directory is in the future compared to the time on the
client or server making changes — CephFS does not move times backwards.
(This causes some problems but prevents many, many others when times are
not synchronized well across the clients and servers.)
-Greg

On Thu, May 25, 2023 at 7:58 AM Sandip Divekar <
sandip.dive...@hitachivantara.com> wrote:

> Hi Chris,
>
> Kindly request you that follow steps given in previous mail and paste the
> output here.
>
> The reason behind this request is that we have encountered an issue which
> is easily reproducible on
> Latest version of both quincy and pacific, also we have thoroughly
> investigated the matter and we are certain that
> No other factors are at play in this scenario.
>
> Note :  We have used Debian 11 for testing.
> sdsadmin@ceph-pacific-1:~$ uname -a
> Linux ceph-pacific-1 5.10.0-10-amd64 #1 SMP Debian 5.10.84-1 (2021-12-08)
> x86_64 GNU/Linux
> sdsadmin@ceph-pacific-1:~$ sudo ceph -v
> ceph version 16.2.13 (5378749ba6be3a0868b51803968ee9cde4833a3e) pacific
> (stable)
>
> Thanks for your prompt reply.
>
>   Regards
>Sandip Divekar
>
> -Original Message-
> From: Chris Palmer 
> Sent: Thursday, May 25, 2023 7:25 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: Unexpected behavior of directory mtime after
> being set explicitly
>
> * EXTERNAL EMAIL *
>
> Hi Milind
> I just tried this using the ceph kernel client and ceph-common 17.2.6
> package in the latest Fedora kernel, against Ceph 17.2.6 and it worked
> perfectly...
> There must be some other factor in play.
> Chris
>
> On 25/05/2023 13:04, Sandip Divekar wrote:
> > Hello Milind,
> >
> > We are using Ceph Kernel Client.
> > But we found this same behavior while using Libcephfs library.
> >
> > Should we treat this as a bug?  Or
> > Is there any existing bug for similar issue ?
> >
> > Thanks and Regards,
> >Sandip Divekar
> >
> >
> > From: Milind Changire 
> > Sent: Thursday, May 25, 2023 4:24 PM
> > To: Sandip Divekar 
> > Cc: ceph-users@ceph.io; d...@ceph.io
> > Subject: Re: [ceph-users] Unexpected behavior of directory mtime after
> > being set explicitly
> >
> > * EXTERNAL EMAIL *
> > Sandip,
> > What type of client are you using ?
> > kernel client or fuse client ?
> >
> > If it's the kernel client, then it's a bug.
> >
> > FYI - Pacific and Quincy fuse clients do the right thing
> >
> >
> > On Wed, May 24, 2023 at 9:24 PM Sandip Divekar <
> sandip.dive...@hitachivantara.com>
> wrote:
> > Hi Team,
> >
> > I'm writing to bring to your attention an issue we have encountered with
> the "mtime" (modification time) behavior for directories in the Ceph
> filesystem.
> >
> > Upon observation, we have noticed that when the mtime of a directory
> > (let's say: dir1) is explicitly changed in CephFS, subsequent additions
> of files or directories within 'dir1' fail to update the directory's mtime
> as expected.
> >
> > This behavior appears to be specific to CephFS - we have reproduced this
> issue on both Quincy and Pacific.  Similar steps work as expected in the
> ext4 filesystem amongst others.
> >
> > Reproduction steps:
> > 1. Create a directory - mkdir dir1
> > 2. Modify mtime using the touch command - touch dir1 3. Create a file
> > or directory inside of 'dir1' - mkdir dir1/dir2 Expected result:
> > mtime for dir1 should change to the time the file or directory was
> > created in step 3 Actual result:
> > there was no change to the mtime for 'dir1'
> >
> > Note : For more detail, kindly find the attached logs.
> >
> > Our queries are :
> > 1. Is this expected behavior for CephFS?
> > 2. If so, can you explain why the directory behavior is inconsistent
> depending on whether the mtime for the directory has previously been
> manually updated.
> >
> >
> > Best Regards,
> >Sandip Divekar
> > Component QA Lead SDET.
> >
> > ___
> > ceph-users mailing list --
> > ceph-users@ceph.io
> > To unsubscribe send an email to
> > ceph-users-le...@ceph.io
> >
> >
> > --
> > Milind
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [EXTERN] Re: cephfs max_file_size

2023-05-24 Thread Gregory Farnum

On Tue, May 23, 2023 at 11:52 PM Dietmar Rieder
 wrote:
>
> On 5/23/23 15:58, Gregory Farnum wrote:
> > On Tue, May 23, 2023 at 3:28 AM Dietmar Rieder
> >  wrote:
> >>
> >> Hi,
> >>
> >> can the cephfs "max_file_size" setting be changed at any point in the
> >> lifetime of a cephfs?
> >> Or is it critical for existing data if it is changed after some time? Is
> >> there anything to consider when changing, let's say, from 1TB (default)
> >> to 4TB ?
> >
> > Larger files take longer to delete (the MDS has to issue a delete op
> > on every one of the objects that may exist), and longer to recover if
> > their client crashes and the MDS has to probe all the objects looking
> > for the actual size and mtime.
> > This is all throttled so it shouldn't break anything, we just want to
> > avoid the situation somebody ran into once where they accidentally
> > created a 1 exabyte RBD on their little 3-node cluster and then had to
> > suffer through "deleting" it. :D
> > -Greg
> >
>
> Thanks for your detailed explanation.
>
> Would it also be ok if we set the max to 5 TB create some big files
> (>1TB) and then set the max back to 1 TB? Would the big files then still
> be available and usable?

It does look like that would work, but I wouldn't recommend it.
Workarounds like that will always be finicky.
-Greg

>
> Best
> Dietmar
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Help appreciated] ceph mds damaged

2023-05-23 Thread Gregory Farnum

On Tue, May 23, 2023 at 1:55 PM Justin Li  wrote:
>
> Dear All,
>
> After a unsuccessful upgrade to pacific, MDS were offline and could not get 
> back on. Checked the MDS log and found below. See cluster info from below as 
> well. Appreciate it if anyone can point me to the right direction. Thanks.

What made is unsuccessful? Do you mean you tried to upgrade and then
rolled back somehow, or that you ran the upgrade but this problem
occurred?
-Greg

>
>
> MDS log:
>
> 2023-05-24T06:21:36.831+1000 7efe56e7d700  1 mds.0.cache.den(0x600 
> 1005480d3b2) loaded already corrupt dentry: [dentry #0x100/stray0/1005480d3b2 
> [19ce,head] rep@0,-2.0 NULL (dversion lock) pv=0 
> v=2154265030 ino=(nil) state=0 0x556433addb80]
>
> -5> 2023-05-24T06:21:36.831+1000 7efe56e7d700 -1 mds.0.damage 
> notify_dentry Damage to dentries in fragment * of ino 0x600is fatal because 
> it is a system directory for this rank
>
> -4> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco 
> set_want_state: up:active -> down:damaged
>
> -3> 2023-05-24T06:21:36.831+1000 7efe56e7d700  5 mds.beacon.posco Sending 
> beacon down:damaged seq 5339
>
> -2> 2023-05-24T06:21:36.831+1000 7efe56e7d700 10 monclient: 
> _send_mon_message to mon.ceph-3 at v2:10.120.0.146:3300/0
>
> -1> 2023-05-24T06:21:37.659+1000 7efe60690700  5 mds.beacon.posco 
> received beacon reply down:damaged seq 5339 rtt 0.827966
>
>  0> 2023-05-24T06:21:37.659+1000 7efe56e7d700  1 mds.posco respawn!
>
>
> Cluster info:
> root@ceph-1:~# ceph -s
>   cluster:
> id: e2b93a76-2f97-4b34-8670-727d6ac72a64
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 filesystem is offline
> 1 mds daemon damaged
>
>   services:
> mon: 3 daemons, quorum ceph-1,ceph-2,ceph-3 (age 26h)
> mgr: ceph-3(active, since 15h), standbys: ceph-1, ceph-2
> mds: 0/1 daemons up, 3 standby
> osd: 135 osds: 133 up (since 10h), 133 in (since 2w)
>
>   data:
> volumes: 0/1 healthy, 1 recovering; 1 damaged
> pools:   4 pools, 4161 pgs
> objects: 230.30M objects, 276 TiB
> usage:   836 TiB used, 460 TiB / 1.3 PiB avail
> pgs: 4138 active+clean
>  13   active+clean+scrubbing
>  10   active+clean+scrubbing+deep
>
>
>
> root@ceph-1:~# ceph health detail
> HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon 
> damaged
> [WRN] FS_DEGRADED: 1 filesystem is degraded
> fs cephfs is degraded
> [ERR] MDS_ALL_DOWN: 1 filesystem is offline
> fs cephfs is offline because no MDS is active for it.
> [ERR] MDS_DAMAGE: 1 mds daemon damaged
> fs cephfs mds.0 is damaged
>
>
>
>
> Justin Li
> Senior Technical Officer
> School of Information Technology
> Faculty of Science, Engineering and Built Environment
>
> Request for assistance can be lodged to the SIT Technical Team using this 
> form
>
> Deakin University
> Melbourne Burwood Campus, 221 Burwood Highway, Burwood, VIC 3125
> +61 3 9246 8932
> justin...@deakin.edu.au
> http://www.deakin.edu.au
> Deakin University CRICOS Provider Code 00113B
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.
>
>
> Important Notice: The contents of this email are intended solely for the 
> named addressee and are confidential; any unauthorised use, reproduction or 
> storage of the contents is expressly prohibited. If you have received this 
> email in error, please delete it and any attachments immediately and advise 
> the sender by return email or telephone.
>
> Deakin University does not warrant that this email and any attachments are 
> error or virus free.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs max_file_size

2023-05-23 Thread Gregory Farnum

On Tue, May 23, 2023 at 3:28 AM Dietmar Rieder
 wrote:
>
> Hi,
>
> can the cephfs "max_file_size" setting be changed at any point in the
> lifetime of a cephfs?
> Or is it critical for existing data if it is changed after some time? Is
> there anything to consider when changing, let's say, from 1TB (default)
> to 4TB ?

Larger files take longer to delete (the MDS has to issue a delete op
on every one of the objects that may exist), and longer to recover if
their client crashes and the MDS has to probe all the objects looking
for the actual size and mtime.
This is all throttled so it shouldn't break anything, we just want to
avoid the situation somebody ran into once where they accidentally
created a 1 exabyte RBD on their little 3-node cluster and then had to
suffer through "deleting" it. :D
-Greg

>
> We are running the latest Nautilus release, BTW.
>
> Thanks in advance
>Dietmar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds dump inode crashes file system

2023-05-16 Thread Gregory Farnum

On Fri, May 12, 2023 at 5:28 AM Frank Schilder  wrote:
>
> Dear Xiubo and others.
>
> >> I have never heard about that option until now. How do I check that and 
> >> how to I disable it if necessary?
> >> I'm in meetings pretty much all day and will try to send some more info 
> >> later.
> >
> > $ mount|grep ceph
>
> I get
>
> MON-IPs:SRC on DST type ceph 
> (rw,relatime,name=con-fs2-rit-pfile,secret=,noshare,acl,mds_namespace=con-fs2,_netdev)
>
> so async dirop seems disabled.
>
> > Yeah, the kclient just received a corrupted snaptrace from MDS.
> > So the first thing is you need to fix the corrupted snaptrace issue in 
> > cephfs and then continue.
>
> Ooookaaa. I will take it as a compliment that you seem to assume I know 
> how to do that. The documentation gives 0 hits. Could you please provide me 
> with instructions of what to look for and/or what to do first?
>
> > If possible you can parse the above corrupted snap message to check what 
> > exactly corrupted.
> > I haven't get a chance to do that.
>
> Again, how would I do that? Is there some documentation and what should I 
> expect?
>
> > You seems didn't enable the 'osd blocklist' cephx auth cap for mon:
>
> I can't find anything about an osd blocklist client auth cap in the 
> documentation. Is this something that came after octopus? Our caps are as 
> shown in the documentation for a ceph fs client 
> (https://docs.ceph.com/en/octopus/cephfs/client-auth/), the one for mon is 
> "allow r":
>
> caps mds = "allow rw path=/shares"
> caps mon = "allow r"
> caps osd = "allow rw tag cephfs data=con-fs2"
>
>
> > I checked that but by reading the code I couldn't get what had cause the 
> > MDS crash.
> > There seems something wrong corrupt the metadata in cephfs.
>
> He wrote something about an invalid xattrib (empty value). It would be really 
> helpful to get a clue how to proceed. I managed to dump the MDS cache with 
> the critical inode in cache. Would this help with debugging? I also managed 
> to get debug logs with debug_mds=20 during a crash caused by an "mds dump 
> inode" command. Would this contain something interesting? I can also pull the 
> rados objects out and can upload all of these files.

I was just guessing about the invalid xattr based on the very limited
crash info, so if it's clearly broken snapshot metadata from the
kclient logs I would focus on that.

I'm surprised/concerned your system managed to generate one of those,
of course...I'll let Xiubo work with you on that.
-Greg
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds dump inode crashes file system

2023-05-10 Thread Gregory Farnum

On Wed, May 10, 2023 at 7:33 AM Frank Schilder  wrote:
>
> Hi Gregory,
>
> thanks for your reply. Yes, I forgot, I can also inspect the rados head 
> object. My bad.
>
> The empty xattr might come from a crash of the SAMBA daemon. We export to 
> windows and this uses xattrs extensively to map to windows ACLs. It might be 
> possible that a crash at an inconvenient moment left an object in this state. 
> Do you think this is possible? Would it be possible to repair that?

I'm still a little puzzled that it's possible for the system to get
into this state, so we probably will need to generate some bugfixes.
And it might just be the dump function is being naughty. But I would
start by looking at what xattrs exist and if there's an obvious bad
one, deleting it.
-Greg

>
> I will report back what I find with the low-level access. Need to head home 
> now ...
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, May 10, 2023 4:26 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: mds dump inode crashes file system
>
> This is a very strange assert to be hitting. From a code skim my best
> guess is the inode somehow has an xattr with no value, but that's just
> a guess and I've no idea how it would happen.
> Somebody recently pointed you at the (more complicated) way of
> identifying an inode path by looking at its RADOS object and grabbing
> the backtrace, which ought to let you look at the file in-situ.
> -Greg
>
>
> On Wed, May 10, 2023 at 6:37 AM Frank Schilder  wrote:
> >
> > For the "mds dump inode" command I could find the crash in the log; see 
> > below. Most of the log contents is the past OPS dump from the 3 MDS 
> > restarts that happened. It contains the 1 last OPS before the crash and 
> > I can upload the log if someone can use it. The crash stack trace somewhat 
> > truncated for readability:
> >
> > 2023-05-10T12:54:53.142+0200 7fe971ca6700  1 mds.ceph-23 Updating MDS map 
> > to version 892464 from mon.4
> > 2023-05-10T13:39:50.962+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] 
> > : client.205899841 isn't responding to mclientcaps(revoke), ino 
> > 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 61.705410 
> > seconds ago
> > 2023-05-10T13:39:52.550+0200 7fe971ca6700  1 mds.ceph-23 Updating MDS map 
> > to version 892465 from mon.4
> > 2023-05-10T13:40:50.963+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] 
> > : client.205899841 isn't responding to mclientcaps(revoke), ino 
> > 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 121.706193 
> > seconds ago
> > 2023-05-10T13:42:50.966+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] 
> > : client.205899841 isn't responding to mclientcaps(revoke), ino 
> > 0x20011d3e5cb pending pAsLsXsFscr issued pAsLsXsFscr, sent 241.709072 
> > seconds ago
> > 2023-05-10T13:44:00.506+0200 7fe972ca8700  1 mds.ceph-23 asok_command: dump 
> > inode {number=2199322355147,prefix=dump inode} (starting...)
> > 2023-05-10T13:44:00.520+0200 7fe972ca8700 -1 
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc:
> >  In function 'const char* ceph::buffer::v15_2_0::ptr::c_str() const' thread 
> > 7fe972ca8700 time 2023-05-10T13:44:00.507652+0200
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc:
> >  501: FAILED ceph_assert(_raw)
> >
> >  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> > const*)+0x158) [0x7fe979ae9b92]
> >  2: (()+0x27ddac) [0x7fe979ae9dac]
> >  3: (()+0x5ce831) [0x7fe979e3a831]
> >  4: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
> >  5: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
> >  6: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) 
> > [0x55c08c41e00c]
> >  7: (MDSRank::command_dump_inode(ceph::Formatter*, ..., 
> > std::ostream&)+0xb5) [0x55c08c353e75]
> >  8: (MDSRankDispatcher::handle_asok_command(std::basic_string_view > std::char_traits >, ..., ceph::buffer::v15_2_0::list&)>)+0x2296) 
> > [0x55c08c36c5f6]

[ceph-users] Re: mds dump inode crashes file system

2023-05-10 Thread Gregory Farnum

This is a very strange assert to be hitting. From a code skim my best
guess is the inode somehow has an xattr with no value, but that's just
a guess and I've no idea how it would happen.
Somebody recently pointed you at the (more complicated) way of
identifying an inode path by looking at its RADOS object and grabbing
the backtrace, which ought to let you look at the file in-situ.
-Greg


On Wed, May 10, 2023 at 6:37 AM Frank Schilder  wrote:
>
> For the "mds dump inode" command I could find the crash in the log; see 
> below. Most of the log contents is the past OPS dump from the 3 MDS restarts 
> that happened. It contains the 1 last OPS before the crash and I can 
> upload the log if someone can use it. The crash stack trace somewhat 
> truncated for readability:
>
> 2023-05-10T12:54:53.142+0200 7fe971ca6700  1 mds.ceph-23 Updating MDS map to 
> version 892464 from mon.4
> 2023-05-10T13:39:50.962+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] : 
> client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb 
> pending pAsLsXsFscr issued pAsLsXsFscr, sent 61.705410 seconds ago
> 2023-05-10T13:39:52.550+0200 7fe971ca6700  1 mds.ceph-23 Updating MDS map to 
> version 892465 from mon.4
> 2023-05-10T13:40:50.963+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] : 
> client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb 
> pending pAsLsXsFscr issued pAsLsXsFscr, sent 121.706193 seconds ago
> 2023-05-10T13:42:50.966+0200 7fe96fca2700  0 log_channel(cluster) log [WRN] : 
> client.205899841 isn't responding to mclientcaps(revoke), ino 0x20011d3e5cb 
> pending pAsLsXsFscr issued pAsLsXsFscr, sent 241.709072 seconds ago
> 2023-05-10T13:44:00.506+0200 7fe972ca8700  1 mds.ceph-23 asok_command: dump 
> inode {number=2199322355147,prefix=dump inode} (starting...)
> 2023-05-10T13:44:00.520+0200 7fe972ca8700 -1 
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc:
>  In function 'const char* ceph::buffer::v15_2_0::ptr::c_str() const' thread 
> 7fe972ca8700 time 2023-05-10T13:44:00.507652+0200
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/common/buffer.cc:
>  501: FAILED ceph_assert(_raw)
>
>  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x158) [0x7fe979ae9b92]
>  2: (()+0x27ddac) [0x7fe979ae9dac]
>  3: (()+0x5ce831) [0x7fe979e3a831]
>  4: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
>  5: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
>  6: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) 
> [0x55c08c41e00c]
>  7: (MDSRank::command_dump_inode(ceph::Formatter*, ..., std::ostream&)+0xb5) 
> [0x55c08c353e75]
>  8: (MDSRankDispatcher::handle_asok_command(std::basic_string_view std::char_traits >, ..., ceph::buffer::v15_2_0::list&)>)+0x2296) 
> [0x55c08c36c5f6]
>  9: (MDSDaemon::asok_command(std::basic_string_view ceph::buffer::v15_2_0::list&)>)+0x75b) [0x55c08c340eab]
>  10: (MDSSocketHook::call_async(std::basic_string_view std::char_traits >, ..., ceph::buffer::v15_2_0::list&)>)+0x6a) 
> [0x55c08c34f9ca]
>  11: 
> (AdminSocket::execute_command(std::vector std::char_traits, ..., ceph::buffer::v15_2_0::list&)>)+0x6f9) 
> [0x7fe979bece59]
>  12: (AdminSocket::do_tell_queue()+0x289) [0x7fe979bed809]
>  13: (AdminSocket::entry()+0x4d3) [0x7fe979beefd3]
>  14: (()+0xc2ba3) [0x7fe977afaba3]
>  15: (()+0x81ca) [0x7fe9786bf1ca]
>  16: (clone()+0x43) [0x7fe977111dd3]
>
> 2023-05-10T13:44:00.522+0200 7fe972ca8700 -1 *** Caught signal (Aborted) **
>  in thread 7fe972ca8700 thread_name:admin_socket
>
>  ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> (stable)
>  1: (()+0x12ce0) [0x7fe9786c9ce0]
>  2: (gsignal()+0x10f) [0x7fe977126a9f]
>  3: (abort()+0x127) [0x7fe9770f9e05]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x1a9) [0x7fe979ae9be3]
>  5: (()+0x27ddac) [0x7fe979ae9dac]
>  6: (()+0x5ce831) [0x7fe979e3a831]
>  7: (InodeStoreBase::dump(ceph::Formatter*) const+0x153) [0x55c08c59b543]
>  8: (CInode::dump(ceph::Formatter*, int) const+0x144) [0x55c08c59b8d4]
>  9: (MDCache::dump_inode(ceph::Formatter*, unsigned long)+0x7c) 
> [0x55c08c41e00c]
>  10: (MDSRank::command_dump_inode(ceph::Formatter*, ..., std::ostream&)+0xb5) 
> [0x55c08c353e75]
>  11: (MDSRankDispatcher::handle_asok_command(std::basic_string_view std::char_traits >, ..., ceph::buffer::v15_2_0::list&)>)+0x2296) 
> [0x55c08c36c5f6]
>  12: (MDSDaemon::asok_command(std::basic_string_view std::char_traits >, ..., ceph::buffer::v15_2_0::list&)>)+0x75b) 
> [0x55c08c340eab]
>  13: (MDSSocketHook::call_async(std::

[ceph-users] Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

2023-05-02 Thread Gregory Farnum

On Tue, May 2, 2023 at 7:54 AM Igor Fedotov  wrote:
>
>
> On 5/2/2023 11:32 AM, Nikola Ciprich wrote:
> > I've updated cluster to 17.2.6 some time ago, but the problem persists. 
> > This is
> > especially annoying in connection with https://tracker.ceph.com/issues/56896
> > as restarting OSDs is quite painfull when half of them crash..
> > with best regards
> >
> Feel free to set osd_fast_shutdown_timeout to zero to workaround the
> above. IMO this assertion is a nonsence and I don't see any usage of
> this timeout parameter other than just throw an assertion.

This was added by Gabi in
https://github.com/ceph/ceph/commit/9b2a64a5f6ea743b2a4f4c2dbd703248d88b2a96;
presumably he has insight.

I wonder if it's just a debug config so we can see slow shutdowns in
our test runs? In which case it should certainly default to 0 and get
set for those test suites.
-Greg

>
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph stretch mode / POOL_BACKFILLFULL

2023-04-27 Thread Gregory Farnum

On Fri, Apr 21, 2023 at 7:26 AM Kilian Ries  wrote:
>
> Still didn't find out what will happen when the pool is full - but tried a 
> little bit in our testing environment and i were not able to get the pool 
> full before an OSD got full. So in first place one OSD reached the full ratio 
> (pool not quite full, about 98%) and IO stopped (like expected when an OSD 
> reaches full ratio).

I *think* pool full doesn't actually matter if you haven't set quotas,
but those properties have seen some code changes recently. CCing RADOS
people.
We do have a proposed fix but it seems to have languished. :(
-Greg

> I were able to re-balance the OSDs by manually doing reweights. Now, the 
> cluster is much more balanced and even the pool shows more free space (about 
> 75% used).
>
> Also the pg-autoscaler does not really play well with the stretch crush rule 
> ... had to increase / adjust the PGs manually to get a better distribution.
>
> Regards,
> Kilian
> 
> Von: Kilian Ries 
> Gesendet: Mittwoch, 19. April 2023 12:18:06
> An: ceph-users
> Betreff: [ceph-users] Ceph stretch mode / POOL_BACKFILLFULL
>
> Hi,
>
>
> we run a ceph cluster in stretch mode with one pool. We know about this bug:
>
>
> https://tracker.ceph.com/issues/56650
>
> https://github.com/ceph/ceph/pull/47189
>
>
> Can anyone tell me what happens when a pool gets to 100% full? At the moment 
> raw OSD usage is about 54% but ceph throws me a "POOL_BACKFILLFULL" error:
>
>
> $ ceph df
>
> --- RAW STORAGE ---
>
> CLASSSIZE   AVAILUSED  RAW USED  %RAW USED
>
> ssd63 TiB  29 TiB  34 TiB34 TiB  54.19
>
> TOTAL  63 TiB  29 TiB  34 TiB34 TiB  54.19
>
>
>
> --- POOLS ---
>
> POOL ID  PGS   STORED  OBJECTS USED  %USED  MAX AVAIL
>
> .mgr  11  415 MiB  105  1.2 GiB   0.041.1 TiB
>
> vm_stretch_live   2   64   15 TiB4.02M   34 TiB  95.53406 GiB
>
>
>
> So the pool warning / calculation is just a bug, because it thinks its 50% of 
> the total size. I know ceph will stop IO / set OSDs to read only if the hit a 
> "backfillfull_ratio" ... but what will happen if the pool gets to 100% full ?
>
>
> Will IO still be possible?
>
>
> No limits / quotas are set on the pool ...
>
>
> Thanks
>
> Regards,
>
> Kilian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bug, pg_upmap_primaries.empty()

2023-04-26 Thread Gregory Farnum

Looks like you've somehow managed to enable the upmap balancer while
allowing a client that's too told to understand it to mount.

Radek, this is a commit from yesterday; is it a known issue?

On Wed, Apr 26, 2023 at 7:49 AM Nguetchouang Ngongang Kevin
 wrote:
>
> Good morning, i found a bug on ceph reef
>
> After installing ceph and deploying 9 osds with a cephfs layer. I got
> this error after many writing and reading operations on the ceph fs i
> deployed.
>
> ```{
> "assert_condition": "pg_upmap_primaries.empty()",
> "assert_file":
> "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc",
> "assert_func": "void OSDMap::encode(ceph::buffer::v15_2_0::list&,
> uint64_t) const",
> "assert_line": 3239,
> "assert_msg":
> "/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc:
> In function 'void OSDMap::encode(ceph::buffer::v15_2_0::list&, uint64_t)
> const' thread 7f86cb8e5700 time
> 2023-04-26T12:25:12.278025+\n/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.0.0-3593-g1e73409b/rpm/el8/BUILD/ceph-18.0.0-3593-g1e73409b/src/osd/OSDMap.cc:
> 3239: FAILED ceph_assert(pg_upmap_primaries.empty())\n",
> "assert_thread_name": "msgr-worker-0",
> "backtrace": [
> "/lib64/libpthread.so.0(+0x12cf0) [0x7f86d0d21cf0]",
> "gsignal()",
> "abort()",
> "(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x18f) [0x55ce1794774b]",
> "/usr/bin/ceph-osd(+0x6368b7) [0x55ce179478b7]",
> "(OSDMap::encode(ceph::buffer::v15_2_0::list&, unsigned long)
> const+0x1229) [0x55ce183e0449]",
> "(MOSDMap::encode_payload(unsigned long)+0x396)
> [0x55ce17ae2576]",
> "(Message::encode(unsigned long, int, bool)+0x2e)
> [0x55ce1825dbee]",
> "(ProtocolV1::prepare_send_message(unsigned long, Message*,
> ceph::buffer::v15_2_0::list&)+0x54) [0x55ce184e5914]",
> "(ProtocolV1::write_event()+0x511) [0x55ce184f4ce1]",
> "(EventCenter::process_events(unsigned int,
> std::chrono::duration
> >*)+0xa64) [0x55ce182eb484]",
> "/usr/bin/ceph-osd(+0xfdf276) [0x55ce182f0276]",
> "/lib64/libstdc++.so.6(+0xc2b13) [0x7f86d0369b13]",
> "/lib64/libpthread.so.0(+0x81ca) [0x7f86d0d171ca]",
> "clone()"
> ],
> "ceph_version": "18.0.0-3593-g1e73409b",
> "crash_id":
> "2023-04-26T12:25:12.286947Z_55675d7c-7833-4e91-b0eb-6df705104c2e",
> "entity_name": "osd.0",
> "os_id": "centos",
> "os_name": "CentOS Stream",
> "os_version": "8",
> "os_version_id": "8",
> "process_name": "ceph-osd",
> "stack_sig":
> "0ffad2c4bc07caf68ff1e124d3911823bc6fa6f5772444754b7f0a998774c8fe",
> "timestamp": "2023-04-26T12:25:12.286947Z",
> "utsname_hostname": "node1-link-1",
> "utsname_machine": "x86_64",
> "utsname_release": "5.4.0-100-generic",
> "utsname_sysname": "Linux",
> "utsname_version": "#113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022"
> }
>
> ```
>
> I really don't know what is this error for, Will appreciate any help.
>
> Cordially,
>
> --
> Nguetchouang Ngongang Kevin
> ENS de Lyon
> https://perso.ens-lyon.fr/kevin.nguetchouang/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS thrashing through the page cache

2023-04-05 Thread Gregory Farnum

On Fri, Mar 17, 2023 at 1:56 AM Ashu Pachauri  wrote:
>
> Hi Xiubo,
>
> As you have correctly pointed out, I was talking about the stipe_unit
> setting in the file layout configuration. Here is the documentation for
> that for anyone else's reference:
> https://docs.ceph.com/en/quincy/cephfs/file-layouts/
>
> As with any RAID0 setup, the stripe_unit is definitely workload dependent.
> Our use case requires us to read somewhere from a few kilobytes to a few
> hundred kilobytes at once. Having a 4MB default stripe_unit definitely
> hurts quite a bit. We were able to achieve almost 2x improvement in terms
> of average latency and overall throughput (for useful data) by reducing the
> stripe_unit. The rule of thumb is that you want to align the stripe_unit to
> your most common IO size.

There's a lot more that goes into the stripe_unit than IO size for
CephFS. This may improve your workload, but it more generally just
means your IO accesses go out to more objects (so more PGs and more
OSDs) generally. That can be good if you have low client counts doing
random IO, but means that in general readahead (when not broken, as it
appears to be in your kernel! :/) is much much much less effective and
more expensive.
-Greg

>
> > BTW, have you tried to set 'rasize' option to a small size instead of 0
> > ? Won't this work ?
>
> No this won't work. I have tried it already. Since rasize simply impacts
> readahead, your minimum io size to the cephfs client will still be at the
> maximum of (rasize, stripe_unit).  rasize is a useful configuration only if
> it is required to be larger than the stripe_unit, otherwise it's not. Also,
> it's worth pointing out that simply setting rasize is not sufficient; one
> needs to change the corresponding configurations that control
> maximum/minimum readahead for ceph clients.
>
> Thanks and Regards,
> Ashu Pachauri
>
>
> On Fri, Mar 17, 2023 at 2:14 PM Xiubo Li  wrote:
>
> >
> > On 15/03/2023 17:20, Frank Schilder wrote:
> > > Hi Ashu,
> > >
> > > are you talking about the kernel client? I can't find "stripe size"
> > anywhere in its mount-documentation. Could you possibly post exactly what
> > you did? Mount fstab line, config setting?
> >
> > There is no mount option to do this in both userspace and kernel
> > clients. You need to change the file layout, which is (4MB stripe_unit,
> > 1 stripe_count and 4MB object_size) by default, instead.
> >
> > Certainly with a smaller size of the stripe_unit will work. But IMO it
> > will depend and be careful, changing the layout may cause other
> > performance issues in some case, for example too small stripe_unit size
> > may split the sync read into more osd requests to different OSDs.
> >
> > I will generate one patch to make the kernel client wiser instead of
> > blindly setting it to stripe_unit always.
> >
> > Thanks
> >
> > - Xiubo
> >
> >
> > >
> > > Thanks!
> > > =
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > 
> > > From: Ashu Pachauri 
> > > Sent: 14 March 2023 19:23:42
> > > To: ceph-users@ceph.io
> > > Subject: [ceph-users] Re: CephFS thrashing through the page cache
> > >
> > > Got the answer to my own question; posting here if someone else
> > > encounters the same problem. The issue is that the default stripe size
> > in a
> > > cephfs mount is 4 MB. If you are doing small reads (like 4k reads in the
> > > test I posted) inside the file, you'll end up pulling at least 4MB to the
> > > client (and then discarding most of the pulled data) even if you set
> > > readahead to zero. So, the solution for us was to set a lower stripe
> > size,
> > > which aligns better with our workloads.
> > >
> > > Thanks and Regards,
> > > Ashu Pachauri
> > >
> > >
> > > On Fri, Mar 10, 2023 at 9:41 PM Ashu Pachauri 
> > wrote:
> > >
> > >> Also, I am able to reproduce the network read amplification when I try
> > to
> > >> do very small reads from larger files. e.g.
> > >>
> > >> for i in $(seq 1 1); do
> > >>dd if=test_${i} of=/dev/null bs=5k count=10
> > >> done
> > >>
> > >>
> > >> This piece of code generates a network traffic of 3.3 GB while it
> > actually
> > >> reads approx 500 MB of data.
> > >>
> > >>
> > >> Thanks and Regards,
> > >> Ashu Pachauri
> > >>
> > >> On Fri, Mar 10, 2023 at 9:22 PM Ashu Pachauri 
> > >> wrote:
> > >>
> > >>> We have an internal use case where we back the storage of a proprietary
> > >>> database by a shared file system. We noticed something very odd when
> > >>> testing some workload with a local block device backed file system vs
> > >>> cephfs. We noticed that the amount of network IO done by cephfs is
> > almost
> > >>> double compared to the IO done in case of a local file system backed
> > by an
> > >>> attached block device.
> > >>>
> > >>> We also noticed that CephFS thrashes through the page cache very
> > quickly
> > >>> compared to the amount of data being read and think that the two issues
> > >>> mi

[ceph-users] Re: ln: failed to create hard link 'file name': Read-only file system

2023-03-22 Thread Gregory Farnum

On Wed, Mar 22, 2023 at 8:27 AM Frank Schilder  wrote:

> Hi Gregory,
>
> thanks for your reply. First a quick update. Here is how I get ln to work
> after it failed, there seems no timeout:
>
> $ ln envs/satwindspy/include/ffi.h
> mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h
> ln: failed to create hard link
> 'mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h': Read-only file system
> $ ls -l envs/satwindspy/include mambaforge/pkgs/libffi-3.3-h58526e2_2
> envs/satwindspy/include:
> total 7664
> -rw-rw-r--.   1 rit rit959 Mar  5  2021 ares_build.h
> [...]
> $ ln envs/satwindspy/include/ffi.h
> mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h
>
> After an ls -l on both directories ln works.
>
> To the question: How can I pull out a log from the nfs server? There is
> nothing in /var/log/messages.


So you’re using the kernel server and re-exporting, right?

I’m not very familiar with its implementation; I wonder if it’s doing
something strange via the kernel vfs.
AFAIK this isn’t really supportable for general use because nfs won’t
respect the CephFS file consistency protocol. But maybe it’s trying a bit
and that’s causing trouble?
-Greg



>
> I can't reproduce it with simple commands on the NFS client. It seems to
> occur only when a large number of files/dirs is created. I can make the
> archive available to you if this helps.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, March 22, 2023 4:14 PM
> To: Frank Schilder
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ln: failed to create hard link 'file name':
> Read-only file system
>
> Do you have logs of what the nfs server is doing?
> Managed to reproduce it in terms of direct CephFS ops?
>
>
> On Wed, Mar 22, 2023 at 8:05 AM Frank Schilder  fr...@dtu.dk>> wrote:
> I have to correct myself. It also fails on an export with "sync" mode.
> Here is an strace on the client (strace ln envs/satwindspy/include/ffi.h
> mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h):
>
> [...]
> stat("mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h",
> 0x7ffdc5c32820) = -1 ENOENT (No such file or directory)
> lstat("envs/satwindspy/include/ffi.h", {st_mode=S_IFREG|0664,
> st_size=13934, ...}) = 0
> linkat(AT_FDCWD, "envs/satwindspy/include/ffi.h", AT_FDCWD,
> "mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h", 0) = -1 EROFS
> (Read-only file system)
> [...]
> write(2, "ln: ", 4ln: ) = 4
> write(2, "failed to create hard link 'mamb"..., 80failed to create hard
> link 'mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h') = 80
> [...]
> write(2, ": Read-only file system", 23: Read-only file system) = 23
> write(2, "\n", 1
> )   = 1
> lseek(0, 0, SEEK_CUR)   = -1 ESPIPE (Illegal seek)
> close(0)= 0
> close(1)= 0
> close(2)= 0
> exit_group(1)   = ?
> +++ exited with 1 +++
>
> Has anyone advice?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder mailto:fr...@dtu.dk>>
> Sent: Wednesday, March 22, 2023 2:44 PM
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] ln: failed to create hard link 'file name':
> Read-only file system
>
> Hi all,
>
> on an NFS re-export of a ceph-fs (kernel client) I observe a very strange
> error. I'm un-taring a larger package (1.2G) and after some time I get
> these errors:
>
> ln: failed to create hard link 'file name': Read-only file system
>
> The strange thing is that this seems only temporary. When I used "ln src
> dst" for manual testing, the command failed as above. However, after that I
> tried "ln -v src dst" and this command created the hard link with exactly
> the same path arguments. During the period when the error occurs, I can't
> see any FS in read-only mode, neither on the NFS client nor the NFS server.
> Funny thing is that file creation and write still works, its only the
> hard-link creation that fails.
>
> For details, the set-up is:
>
> file-server: mount ceph-fs at /shares/path, export /shares/path as nfs4 to
> other server
> other server: mount /shares/path as NFS
>
> More precisely, on the file-server:
>
> fstab: MON-IPs:/shares/folder

[ceph-users] Re: ln: failed to create hard link 'file name': Read-only file system

2023-03-22 Thread Gregory Farnum

Do you have logs of what the nfs server is doing?
Managed to reproduce it in terms of direct CephFS ops?


On Wed, Mar 22, 2023 at 8:05 AM Frank Schilder  wrote:

> I have to correct myself. It also fails on an export with "sync" mode.
> Here is an strace on the client (strace ln envs/satwindspy/include/ffi.h
> mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h):
>
> [...]
> stat("mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h",
> 0x7ffdc5c32820) = -1 ENOENT (No such file or directory)
> lstat("envs/satwindspy/include/ffi.h", {st_mode=S_IFREG|0664,
> st_size=13934, ...}) = 0
> linkat(AT_FDCWD, "envs/satwindspy/include/ffi.h", AT_FDCWD,
> "mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h", 0) = -1 EROFS
> (Read-only file system)
> [...]
> write(2, "ln: ", 4ln: ) = 4
> write(2, "failed to create hard link 'mamb"..., 80failed to create hard
> link 'mambaforge/pkgs/libffi-3.3-h58526e2_2/include/ffi.h') = 80
> [...]
> write(2, ": Read-only file system", 23: Read-only file system) = 23
> write(2, "\n", 1
> )   = 1
> lseek(0, 0, SEEK_CUR)   = -1 ESPIPE (Illegal seek)
> close(0)= 0
> close(1)= 0
> close(2)= 0
> exit_group(1)   = ?
> +++ exited with 1 +++
>
> Has anyone advice?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: Wednesday, March 22, 2023 2:44 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] ln: failed to create hard link 'file name':
> Read-only file system
>
> Hi all,
>
> on an NFS re-export of a ceph-fs (kernel client) I observe a very strange
> error. I'm un-taring a larger package (1.2G) and after some time I get
> these errors:
>
> ln: failed to create hard link 'file name': Read-only file system
>
> The strange thing is that this seems only temporary. When I used "ln src
> dst" for manual testing, the command failed as above. However, after that I
> tried "ln -v src dst" and this command created the hard link with exactly
> the same path arguments. During the period when the error occurs, I can't
> see any FS in read-only mode, neither on the NFS client nor the NFS server.
> Funny thing is that file creation and write still works, its only the
> hard-link creation that fails.
>
> For details, the set-up is:
>
> file-server: mount ceph-fs at /shares/path, export /shares/path as nfs4 to
> other server
> other server: mount /shares/path as NFS
>
> More precisely, on the file-server:
>
> fstab: MON-IPs:/shares/folder /shares/nfs/folder ceph
> defaults,noshare,name=NAME,secretfile=sec.file,mds_namespace=FS-NAME,_netdev
> 0 0
> exports: /shares/nfs/folder
> -no_root_squash,rw,async,mountpoint,no_subtree_check DEST-IP
>
> On the host at DEST-IP:
>
> fstab: FILE-SERVER-IP:/shares/nfs/folder /mnt/folder nfs defaults,_netdev
> 0 0
>
> Both, the file server and the client server are virtual machines. The file
> server is on Centos 8 stream (4.18.0-338.el8.x86_64) and the client machine
> is on AlmaLinux 8 (4.18.0-425.13.1.el8_7.x86_64).
>
> When I change the NFS export from "async" to "sync" everything works.
> However, that's a rather bad workaround and not a solution. Although this
> looks like an NFS issue, I'm afraid it is a problem with hard links and
> ceph-fs. It looks like a race with scheduling and executing operations on
> the ceph-fs kernel mount.
>
> Has anyone seen something like that?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds damage cannot repair

2023-02-13 Thread Gregory Farnum

A "backtrace" is an xattr on the RADOS object storing data for a given
file, and it contains the file's (versioned) path from the root. So a
bad backtrace means there's something wrong with that — possibly just
that there's a bug in the version of the code that's checking it,
because they're generally out of date and that's okay; it's definitely
not critical because they're only used for disaster recovery or in
some rare ino-based lookup cases if there's a hard link to the file or
some rare nfs situations.

The "bad remote dentry" message means that you had a link and...I
don't actually remember the details of this one.

But my guess is that the relevant files were just auto-deleted and not
critical to the software anyway, so that's why the error went away.
Incorrect handling of them being checked while being deleted them
might also be why a warning was spawned to begin with.
-Greg

On Mon, Feb 13, 2023 at 5:43 AM Andrej Filipcic  wrote:
>
> On 2/10/23 08:50, Andrej Filipcic wrote:
>
> FYI, the damage went away after a couple of days, not quite sure how.
>
> Best,
> Andrej
>
> >
> > Hi,
> >
> > there is mds damage on our cluster, version 17.2.5,
> >
> > [
> >{
> >"damage_type": "backtrace",
> >"id": 2287166658,
> >"ino": 3298564401782,
> >"path": "/hpc/home/euliz/.Xauthority"
> >}
> > ]
> >
> >
> >
> > The recursive repair does not fix it,
> > ...ceph tell mds.0 scrub start /hpc/home/euliz force,repair,recursive
> >
> > mds log:
> >
> > 2023-02-10T07:01:34.012+0100 7f46df3ea700  0 mds.0.cache  failed to
> > open ino 0x30001c26a76 err -116/0
> > 2023-02-10T07:01:34.012+0100 7f46df3ea700  0 mds.0.cache
> > open_remote_dentry_finish bad remote dentry [dentry
> > #0x1/hpc/home/euliz/.Xauthority [568,head] auth
> > REMOTE(reg) (dversion lock) pv=0 v=4425667830 ino=(nil)
> > state=1073741824 | ptrwaiter=1 0x5560eb33a780]
> >
> >
> > Any clue how to fix this? or remove the file from namespace? it is not
> > important...
> >
> > Thanks,
> > Andrej
> >
> > --
> > _
> > prof. dr. Andrej Filipcic,   E-mail:andrej.filip...@ijs.si
> > Department of Experimental High Energy Physics - F9
> > Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> > SI-1001 Ljubljana, Slovenia
> > Tel.: +386-1-477-3674Fax: +386-1-477-3166
> > -
>
>
> --
> _
> prof. dr. Andrej Filipcic,   E-mail:andrej.filip...@ijs.si
> Department of Experimental High Energy Physics - F9
> Jozef Stefan Institute, Jamova 39, P.o.Box 3000
> SI-1001 Ljubljana, Slovenia
> Tel.: +386-1-477-3674Fax: +386-1-477-3166
> -
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Health warning - POOL_TARGET_SIZE_BYTES_OVERCOMMITED

2023-02-13 Thread Gregory Farnum

On Mon, Feb 13, 2023 at 4:16 AM Sake Paulusma  wrote:
>
> Hello,
>
> I configured a stretched cluster on two datacenters. It's working fine, 
> except this weekend the Raw Capicity exceeded 50% and the error 
> POOL_TARGET_SIZE_BYTES_OVERCOMMITED showed up.
>
> The command "ceph df" is showing the correct cluster size, but "ceph osd pool 
> autoscale-status" is showing half of the total Raw Capacity.
>
> What could be wrong?

There's a bug with the statistics handling of pools in stretch mode,
and others like them. :(
https://tracker.ceph.com/issues/56650

-Greg


>
>
>
> 
> [ceph: root@aqsoel11445 /]# ceph status
>   cluster:
> id: adbe7bb6-5h6d-11ed-8511-004449ede0c
> health: HEALTH_WARN
> 1 MDSs report oversized cache
> 1 subtrees have overcommitted pool target_size_bytes
>
>   services:
> mon: 5 daemons, quorum host1,host2,host3,host4,host5 (age 4w)
> mgr: aqsoel11445.nqamuz(active, since 5w), standbys: host1.wujgas
> mds: 2/2 daemons up, 2 standby
> osd: 12 osds: 12 up (since 5w), 12 in (since 9w)
>
>   data:
> volumes: 2/2 healthy
> pools:   5 pools, 193 pgs
> objects: 17.31M objects, 1.2 TiB
> usage:   5.0 TiB used, 3.8 TiB / 8.8 TiB avail
> pgs: 192 active+clean
>  1   active+clean+scrubbing
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph df
> --- RAW STORAGE ---
> CLASS SIZEAVAIL USED  RAW USED  %RAW USED
> ssd8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
> TOTAL  8.8 TiB  3.8 TiB  5.0 TiB   5.0 TiB  56.83
>
> --- POOLS ---
> POOL   ID  PGS   STORED  OBJECTS USED  %USED  MAX 
> AVAIL
> .mgr11  449 KiB2  1.8 MiB  0320 
> GiB
> cephfs.application-tst.meta   2   16  540 MiB   18.79k  2.1 GiB   0.16320 
> GiB
> cephfs.application-tst.data   3   32  4.4 GiB8.01k   17 GiB   1.33320 
> GiB
> cephfs.application-acc.meta   4   16   11 GiB3.54M   45 GiB   3.37320 
> GiB
> cephfs.application-acc.data   5  128  1.2 TiB   13.74M  4.8 TiB  79.46320 
> GiB
> 
>
> 
> [ceph: root@aqsoel11445 /]# ceph osd pool autoscale-status
> POOL SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  
> TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
> .mgr   448.5k4.0 4499G  0.
>   1.0   1  on False
> cephfs.application-tst.meta  539.8M4.0 4499G  0.0005  
> 4.0  16  on False
> cephfs.application-tst.data   4488M   51200M   4.0 4499G  0.0444  
> 1.0  32  on False
> cephfs.application-acc.meta  11430M4.0 4499G  0.0099  
> 4.0  16  on False
> cephfs.application-acc.data   1244G4.0 4499G  1.1062  
>   1.   0.9556   1.0 128  on False
> 
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Frequent calling monitor election

2023-02-09 Thread Gregory Farnum

Also, that the current leader (ceph-01) is one of the monitors
proposing an election each time suggests the problem is with getting
commit acks back from one of its followers.

On Thu, Feb 9, 2023 at 8:09 AM Dan van der Ster  wrote:
>
> Hi Frank,
>
> Check the mon logs with some increased debug levels to find out what
> the leader is busy with.
> We have a similar issue (though, daily) and it turned out to be
> related to the mon leader timing out doing a SMART check.
> See https://tracker.ceph.com/issues/54313 for how I debugged that.
>
> Cheers, Dan
>
> On Thu, Feb 9, 2023 at 7:56 AM Frank Schilder  wrote:
> >
> > Hi all,
> >
> > our monitors have enjoyed democracy since the beginning. However, I don't 
> > share a sudden excitement about voting:
> >
> > 2/9/23 4:42:30 PM[INF]overall HEALTH_OK
> > 2/9/23 4:42:30 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:42:26 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-25 calling monitor election
> > 2/9/23 4:42:26 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:40:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:30:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:34 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:34 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-03 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-25 calling monitor election
> > 2/9/23 4:24:29 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:24:04 PM[INF]overall HEALTH_OK
> > 2/9/23 4:24:03 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 4:23:59 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 4:23:59 PM[INF]mon.ceph-02 calling monitor election
> > 2/9/23 4:20:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:10:00 PM[INF]overall HEALTH_OK
> > 2/9/23 4:00:00 PM[INF]overall HEALTH_OK
> > 2/9/23 3:50:00 PM[INF]overall HEALTH_OK
> > 2/9/23 3:43:13 PM[INF]overall HEALTH_OK
> > 2/9/23 3:43:13 PM[INF]mon.ceph-01 is new leader, mons 
> > ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 in quorum (ranks 0,1,2,3,4)
> > 2/9/23 3:43:08 PM[INF]mon.ceph-01 calling monitor election
> > 2/9/23 3:43:08 PM[INF]mon.ceph-26 calling monitor election
> > 2/9/23 3:43:08 PM[INF]mon.ceph-25 calling monitor election
> >
> > We moved a switch from one rack to another and after the switch came beck 
> > up, the monitors frequently bitch about who is the alpha. How do I get them 
> > to focus more on their daily duties again?
> >
> > Thanks for any help!
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS_DAMAGE dir_frag

2022-12-12 Thread Gregory Farnum

On Mon, Dec 12, 2022 at 12:10 PM Sascha Lucas  wrote:

> Hi Dhairya,
>
> On Mon, 12 Dec 2022, Dhairya Parmar wrote:
>
> > You might want to look at [1] for this, also I found a relevant thread
> [2]
> > that could be helpful.
> >
>
> Thanks a lot. I already found [1,2], too. But I did not considered it,
> because I felt not having a "disaster"? Nothing seems broken nor crashed:
> all servers/services up since weeks. No disk failures, no modifications on
> cluster etc.
>
> Also the Warning Box in [1] tells me (as a newbie) not to run anything of
> this. Or in other words: not to forcefully start a disaster ;-).
>
> A follow-up of [2] also mentioned having random meta-data corruption: "We
> have 4 clusters (all running same version) and have experienced meta-data
> corruption on the majority of them at some time or the other"


Jewel (and upgrading from that version) was much less stable than Luminous
(when we declared the filesystem “awesome” and said the Ceph upstream
considered it production-ready), and things have generally gotten better
with every release since then.


>
> [3] tells me, that metadata damage can happen either from data loss (which
> I'm convinced not to have), or from software bugs. The later would be
> worth fixing. Is there a way to find the root cause?


Yes, we’d very much like to understand this. What versions of the server
and kernel client are you using? What platform stack — I see it looks like
you are using CephFS through the volumes interface? The simplest
possibility I can think of here is that you are running with a bad kernel
and it used async ops poorly, maybe? But I don’t remember other spontaneous
corruptions of this type anytime recent.

Have you run a normal forward scrub (which is non-disruptive) to check if
there are other issues?
-Greg



>
> And is going through [1] relay the only option? It sounds being offline
> for days...
>
> At least I know now, what dirfrags[4] are.
>
> Thanks, Sascha.
>
> [1]
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
> [2] https://www.spinics.net/lists/ceph-users/msg53202.html
> [3]
> https://docs.ceph.com/en/quincy/cephfs/disaster-recovery/#metadata-damage-and-repair
> [4] https://docs.ceph.com/en/quincy/cephfs/dirfrags/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: what happens if a server crashes with cephfs?

2022-12-08 Thread Gregory Farnum

Ceph clients keep updates buffered until they receive server
notification that the update is persisted to disk. On server crash,
the client connects to either the newly-responsible OSD (for file
data) or the standby/restarted MDS (for file metadata) and replays
outstanding operations. This is all transparent to the application or
filesystem users, except that IO calls may take a very long time to
resolve (if you don't have an MDS running for ten minutes, metadata
synchronization calls will just hang for those ten minutes).
Ceph is designed to run operationally when failures happen, and
"failures" include upgrades. You may see degraded capacity, but the
filesystem remains online and no IO errors will be returned to
applications as a result.[1] The whole point of data replication and
Ceph's architecture is to prevent the sort of IO error you might get
when a Lustre target dies.

So, no: your applications will not receive IO errors because an OSD fails. :)
-Greg

[1]: In a default configuration. Some admins/environments prefer to
receive error codes rather than hangs on arbitrary syscalls, and there
are some limited accommodations for them which can be set with config
options.

On Thu, Dec 8, 2022 at 9:52 AM Charles Hedrick  wrote:
>
> I'm aware that the file system will remain available. My concern is about 
> long jobs using it failing because a single operation returns an error. While 
> none of the discussion so far has been explicit, I assume this can happen if 
> an OSD fails, since it might have done an async acknowledgement for an 
> operation that won't actually be possible to complete. I'm assuming that it 
> won't happen during a cephadm upgrade.
> 
> From: Manuel Holtgrewe 
> Sent: Thursday, December 8, 2022 12:38 PM
> To: Charles Hedrick 
> Cc: Gregory Farnum ; Dhairya Parmar ; 
> ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: what happens if a server crashes with cephfs?
>
> Hi Charles,
>
> are you concerned with a single Ceph cluster server crash or the whole server 
> crashing? If you have sufficient redundancy, nothing bad should happen but 
> the file system should remain available. The same should be true if you 
> perform an upgrade in the "correct" way, e.g., through the cephadm commands.
>
> The folks over at 45 drives made a little show of tearing down a ceph cluster 
> bit by bit while it is running:
>
> https://www.youtube.com/watch?v=8paAkGx2_OA
>
> Cheers,
> Manuel
>
> On Thu, Dec 8, 2022 at 6:34 PM Charles Hedrick  wrote:
>
> network and local file systems have different requirements. If I have a long 
> job and the machine I'm running on crashes, I have to rerun it. The fact that 
> the last 500 msec of data didn't get flushed to disk is unlikely to matter.
>
> If I have a long job using a network file system, and the server crashes, my 
> job itself doesn't crash. You really want it to continue after the server 
> reboots without any errors. It's true that you could return an error for 
> write or close, and the job could detect that and either rewrite the file or 
> exit. However a very large amount of code is written for local files, and 
> doesn't check errors for write and close.
>
> I don't actually know how our long jobs would behave if a close fails. 
> Perhaps it's OK. It's mostly python. Presumably the python interpreter would 
> throw an I/O error.
>
> A related question: what is likely to happen when you do a version upgrade? 
> Is that done in a way that won't generate errors in user code?
>
> 
> From: Gregory Farnum 
> Sent: Thursday, December 8, 2022 11:44 AM
> To: Manuel Holtgrewe 
> Cc: Charles Hedrick ; Dhairya Parmar 
> ; ceph-users@ceph.io 
> Subject: Re: [ceph-users] Re: what happens if a server crashes with cephfs?
>
> On Thu, Dec 8, 2022 at 8:42 AM Manuel Holtgrewe  wrote:
> >
> > Hi Charles,
> >
> > as far as I know, CephFS implements POSIX semantics. That is, if the CephFS 
> > server cluster dies for whatever reason then this will translate in I/O 
> > errors. This is the same as if your NFS server dies or you run the program 
> > locally on a workstation/laptop and the machine loses power. POSIX file 
> > systems guarantee that data is persisted on the storage after a file is 
> > closed
>
> Actually the "commit on close" is *entirely* an NFS-ism and is not
> part of posix. If you expect a closed file to be flushed to disk
> anywhere else (including CephFS), you will be disappointed. You need
> to use fsync/fdatasync/sync/syncfs.
> -Greg
>
> > or fsync() is called. Otherwise, the data may still be "in flight", e.g.,

[ceph-users] Re: what happens if a server crashes with cephfs?

2022-12-08 Thread Gregory Farnum

On Thu, Dec 8, 2022 at 8:42 AM Manuel Holtgrewe  wrote:
>
> Hi Charles,
>
> as far as I know, CephFS implements POSIX semantics. That is, if the CephFS 
> server cluster dies for whatever reason then this will translate in I/O 
> errors. This is the same as if your NFS server dies or you run the program 
> locally on a workstation/laptop and the machine loses power. POSIX file 
> systems guarantee that data is persisted on the storage after a file is closed

Actually the "commit on close" is *entirely* an NFS-ism and is not
part of posix. If you expect a closed file to be flushed to disk
anywhere else (including CephFS), you will be disappointed. You need
to use fsync/fdatasync/sync/syncfs.
-Greg

> or fsync() is called. Otherwise, the data may still be "in flight", e.g., in 
> the OS I/O cache or even the runtime library's cache.
>
> This is not a bug but a feature as this improves performance when appending 
> small bits to a file and the HDD head does not have to move every time 
> something is written and not a full 4kb block has to be written for SSD.
>
> Posix semantics even go further, enforcing certain guarantees if files are 
> written from multiple clients. Recently, something called "lazy I/O" has been 
> introduced [1] in CephFS which allows to explicitly relax certain of these 
> guarantees to improve performance.
>
> I don't think there even is a ceph mount setting that allows you to configure 
> local cache mechanisms as for NFS. For NFS, I have seen setups where two 
> clients saw two different versions of the same -- closed -- file because one 
> had written to the file and this was not yet reflected on the second client. 
> To the best of my knowledge, this will not happen with CephFS.
>
> I'd be happy to learn to be wrong if I'm wrong. ;-)
>
> Best wishes,
> Manuel
>
> [1] https://docs.ceph.com/en/latest/cephfs/lazyio/
>
> On Thu, Dec 8, 2022 at 5:09 PM Charles Hedrick  wrote:
>>
>> thanks. I'm evaluating cephfs for a computer science dept. We have users 
>> that run week-long AI training jobs. They use standard packages, which they 
>> probably don't want to modify. At the moment we use NFS. It uses synchronous 
>> I/O, so if somethings goes wrong, the users' jobs pause until we reboot, and 
>> then continue. However there's an obvious performance penalty for this.
>> 
>> From: Gregory Farnum 
>> Sent: Thursday, December 8, 2022 2:08 AM
>> To: Dhairya Parmar 
>> Cc: Charles Hedrick ; ceph-users@ceph.io 
>> 
>> Subject: Re: [ceph-users] Re: what happens if a server crashes with cephfs?
>>
>> More generally, as Manuel noted you can (and should!) make use of fsync et 
>> al for data safety. Ceph’s async operations are not any different at the 
>> application layer from how data you send to the hard drive can sit around in 
>> volatile caches until a consistency point like fsync is invoked.
>> -Greg
>>
>> On Wed, Dec 7, 2022 at 10:02 PM Dhairya Parmar 
>> mailto:dpar...@redhat.com>> wrote:
>> Hi Charles,
>>
>> There are many scenarios where the write/close operation can fail but
>> generally
>> failures/errors are logged (normally every time) to help debug the case.
>> Therefore
>> there are no silent failures as such except you encountered  a very rare
>> bug.
>> - Dhairya
>>
>>
>> On Wed, Dec 7, 2022 at 11:38 PM Charles Hedrick 
>> mailto:hedr...@rutgers.edu>> wrote:
>>
>> > I believe asynchronous operations are used for some operations in cephfs.
>> > That means the server acknowledges before data has been written to stable
>> > storage. Does that mean there are failure scenarios when a write or close
>> > will return an error? fail silently?
>> >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>> > To unsubscribe send an email to 
>> > ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
>> >
>> >
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
>> To unsubscribe send an email to 
>> ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: what happens if a server crashes with cephfs?

2022-12-07 Thread Gregory Farnum

More generally, as Manuel noted you can (and should!) make use of fsync et
al for data safety. Ceph’s async operations are not any different at the
application layer from how data you send to the hard drive can sit around
in volatile caches until a consistency point like fsync is invoked.
-Greg

On Wed, Dec 7, 2022 at 10:02 PM Dhairya Parmar  wrote:

> Hi Charles,
>
> There are many scenarios where the write/close operation can fail but
> generally
> failures/errors are logged (normally every time) to help debug the case.
> Therefore
> there are no silent failures as such except you encountered  a very rare
> bug.
> - Dhairya
>
>
> On Wed, Dec 7, 2022 at 11:38 PM Charles Hedrick 
> wrote:
>
> > I believe asynchronous operations are used for some operations in cephfs.
> > That means the server acknowledges before data has been written to stable
> > storage. Does that mean there are failure scenarios when a write or close
> > will return an error? fail silently?
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Gregory Farnum

On Tue, Nov 29, 2022 at 1:18 PM Joshua Timmer 
wrote:

> I've got a cluster in a precarious state because several nodes have run
> out of memory due to extremely large pg logs on the osds. I came across
> the pglog_hardlimit flag which sounds like the solution to the issue,
> but I'm concerned that enabling it will immediately truncate the pg logs
> and possibly drop some information needed to recover the pgs. There are
> many in degraded and undersized states right now as nodes are down. Is
> it safe to enable the flag in this state? The cluster is running
> luminous 12.2.13 right now.

The hard limit will truncate the log, but all the data goes into the
backing bluestore/filestore instance at the same time. The pglogs are used
for two things:
1) detecting replayed client operations and sending the same answer back on
replays, so shorter logs means a shorter time window of detection but
shouldn’t be an issue;
2) enabling log-based recovery of pgs where OSDs with overlapping logs can
identify exactly which objects have been modified and only moving them.

So if you set the hard limit, it’s possible you’ll induce more backfill as
fewer logs overlap. But no data will be lost.
-Greg

> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS performance

2022-11-22 Thread Gregory Farnum

In addition to not having resiliency by default, my recollection is
that BeeGFS also doesn't guarantee metadata durability in the event of
a crash or hardware failure like CephFS does. There's not really a way
for us to catch up to their "in-memory metadata IOPS" with our
"on-disk metadata IOPS". :(

If that kind of cached performance is your main concern, CephFS is
probably not going to make you happy.

That said, if you've been happy using CephFS with hard drives and
gigabit ethernet, it will be much faster if you store the metadata on
SSD and can increase the size of the MDS cache in memory. More
specific tuning options than that would depend on your workload.
-Greg

On Tue, Nov 22, 2022 at 7:28 AM David C  wrote:
>
> My understanding is BeeGFS doesn't offer data redundancy by default,
> you have to configure mirroring. You've not said how your Ceph cluster
> is configured but my guess is you have the recommended 3x replication
> - I wouldn't be surprised if BeeGFS was much faster than Ceph in this
> case. I'd be interested to see your results after ensuring equivalent
> data redundancy between the platforms.
>
> On Thu, Oct 20, 2022 at 9:02 PM quag...@bol.com.br  wrote:
> >
> > Hello everyone,
> > I have some considerations and doubts to ask...
> >
> > I work at an HPC center and my doubts stem from performance in this 
> > environment. All clusters here was suffering from NFS performance and also 
> > problems of a single point of failure it has. We were suffering from the 
> > performance of NFS and also the single point of failure it has.
> >
> > At that time, we decided to evaluate some available SDS and the chosen 
> > one was Ceph (first for its resilience and later for its performance).
> > I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine 
> > with 1Gpbs connection.
> > The performance was as good as a large NFS we have on another cluster 
> > (spending much less). In addition, it was able to evaluate all the benefits 
> > of resiliency that Ceph offers (such as activating an OSD, MDS, MON or MGR 
> > server) and the objects/services to settle on other nodes. All this in a 
> > way that the user did not even notice.
> >
> > Given this information, a new storage cluster was acquired last year 
> > with 6 machines and 22 disks (HDDs) per machine. The need was for the 
> > amount of available GBs. The amount of IOPs was not so important at that 
> > time.
> >
> > Right at the beginning, I had a lot of work to optimize the performance 
> > in the cluster (the main deficiency was in the performance in the 
> > access/write of metadata). The problem was not at the job execution, but 
> > the user's perception of slowness when executing interactive commands (my 
> > perception was in the slowness of Ceph metadata).
> > There were a few months of high loads in which storage was the 
> > bottleneck of the environment.
> >
> > After a lot of research in documentation, I made several optimizations 
> > on the available parameters and currently CephFS is able to reach around 
> > 10k IOPS (using size=2).
> >
> > Anyway, my boss asked for other solutions to be evaluated to verify the 
> > performance issue.
> > First of all, it was suggested to put the metadata on SSD disks for a 
> > higher amount of IOPS.
> > In addition, a test environment was set up and the solution that made 
> > the most difference in performance was with BeeGFS.
> >
> > In some situations, BeeGFS is many times faster than Ceph in the same 
> > tests and under the same hardware conditions. This happens in both the 
> > throuput (BW) and IOPS.
> >
> > We tested it using io500 as follows:
> > 1-) An individual process
> > 2-) 8 processes (4 processes on 2 different machines)
> > 3-) 16 processes (8 processes on 2 different machines)
> >
> > I did tests configuring CephFS to use:
> > * HDD only (for both data and metadata)
> > * Metadata on SSD
> > * Using Linux FSCache features
> > * With some optimizations (increasing MDS memory, client memory, 
> > inflight parameters, etc)
> > * Cache tier with SSD
> >
> > Even so, the benchmark score was lower than the BeeGFS installed 
> > without any optimization. This difference becomes even more evident as the 
> > number of simultaneous accesses increases.
> >
> > The two best results of CephFS were using metadata on SSD and also 
> > doing TIER on SSD.
> >
> > Here is the result of Ceph's performance when compared to BeeGFS:
> >
> > Bandwith Test (bw is in GB/s):
> >
> > ==
> > |fs|bw|process|
> > ==
> > |beegfs-metassd|0.078933|01|
> > |beegfs-metassd|0.051855|08|
> > |beegfs-metassd|0.039459|16|
> > ==
> > |cephmetassd|0.022489|01|
> > |cephmetassd

[ceph-users] Re: 16.2.11 branch

2022-10-31 Thread Gregory Farnum

On Fri, Oct 28, 2022 at 8:51 AM Laura Flores  wrote:
>
> Hi Christian,
>
> There also is https://tracker.ceph.com/versions/656 which seems to be
> > tracking
> > the open issues tagged for this particular point release.
> >
>
> Yes, thank you for providing the link.
>
> If you don't mind me asking Laura, have those issues regarding the
> > testing lab been resolved yet?
> >
>
> There are currently a lot of folks working to fix the testing lab issues.
> Essentially, disk corruption affected our ability to reach quay.ceph.io.
> We've made progress this morning, but we are still working to understand
> the root cause of the corruption. We expect to re-deploy affected services
> soon so we can resume testing for v16.2.11.

We got a note about this today, so I wanted to clarify:

For Reasons, the sepia lab we run teuthology in currently uses a Red
Hat Enterprise Virtualization stack — meaning, mostly KVM with a lot
of fancy orchestration all packaged up, backed by Gluster. (Yes,
really — a full Ceph integration was never built and at one point this
was deemed the most straightforward solution compared to running
all-up OpenStack backed by Ceph, which would have been the available
alternative.) The disk images stored in Gluster started reporting
corruption last week (though Gluster was claiming to be healthy), and
with David's departure and his backup on vacation it took a while for
the remaining team members to figure out what was going on and
identify strategies to resolve or work around it.

The relevant people have figured out a lot more of what was going on,
and Adam (David's backup) is back now so we're expecting things to
resolve more quickly at this point. And indeed the team's looking at
other options for providing this infrastructure going forward. :)
-Greg

>
> You can follow updates on the two Tracker issues below:
>
>1. https://tracker.ceph.com/issues/57914
>2. https://tracker.ceph.com/issues/57935
>
>
> There are quite a few bugfixes in the pending release 16.2.11 which we
> > are waiting for. TBH I was about

> > to ask if it would not be sensible to do an intermediate release and not
> > let it grow bigger and
> > bigger (with even more changes / fixes)  going out at once.
> >
>
> Fixes for v16.2.11 are pretty much paused at this point; the bottleneck
> lies in getting some outstanding patches tested before they are backported.
> Whether we stop now or continue to introduce more patches, the timeframe
> for getting things tested remains the same.
>
> I hope this clears up some of the questions.
>
> Thanks,
> Laura Flores
>
>
> On Fri, Oct 28, 2022 at 9:41 AM Christian Rohmann <
> christian.rohm...@inovex.de> wrote:
>
> > On 28/10/2022 00:25, Laura Flores wrote:
> > > Hi Oleksiy,
> > >
> > > The Pacific RC has not been declared yet since there have been problems
> > in
> > > our upstream testing lab. There is no ETA yet for v16.2.11 for that
> > reason,
> > > but the full diff of all the patches that were included will be published
> > > to ceph.io when v16.2.11 is released. There will also be a diff
> > published
> > > in the documentation on this page:
> > > https://docs.ceph.com/en/latest/releases/pacific/
> > >
> > > In the meantime, here is a link to the diff in commits between v16.2.10
> > and
> > > the Pacific branch:
> > https://github.com/ceph/ceph/compare/v16.2.10...pacific
> >
> > There also is https://tracker.ceph.com/versions/656 which seems to be
> > tracking
> > the open issues tagged for this particular point release.
> >
> >
> > If you don't mind me asking Laura, have those issues regarding the
> > testing lab been resolved yet?
> >
> > There are quite a few bugfixes in the pending release 16.2.11 which we
> > are waiting for. TBH I was about
> > to ask if it would not be sensible to do an intermediate release and not
> > let it grow bigger and
> > bigger (with even more changes / fixes)  going out at once.
> >
> >
> >
> > Regards
> >
> >
> > Christian
> >
> >
>
> --
>
> Laura Flores
>
> She/Her/Hers
>
> Software Engineer, Ceph Storage
>
> Red Hat Inc. 
>
> Chicago, IL
>
> lflo...@redhat.com
> M: +17087388804
> @RedHat    Red Hat
>   Red Hat
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow monitor responses for rbd ls etc.

2022-10-18 Thread Gregory Farnum

On Fri, Oct 7, 2022 at 7:53 AM Sven Barczyk  wrote:
>
> Hello,
>
>
>
> we are encountering a strange behavior on our Ceph. (All Ubuntu 20 / All
> mons Quincy 17.2.4 / Oldest OSD Quincy 17.2.0 )
> Administrative commands like rbd ls or create are so slow, that libvirtd is
> running into timeouts and creating new VMs on our Cloudstack, on behalf of
> creating new volumes on our pool, takes up to 10 mins.
>
> Already running VMs are not affected or showing slow responses on their
> Filesystem.
> It is really only limited if services needs to interact with rbd commands.
>
>
> Has anyone encountered some behavoir like this?

What leads you to believe it's the monitor being slow, rather than the
OSDs storing RBD data?
-Greg

>
>
> Regards
> Sven
>
>
>
> --
>
> BRINGE Informationstechnik GmbH
>
> Zur Seeplatte 12
>
> D-76228 Karlsruhe
>
> Germany
>
>
>
> Fon: +49 721 94246-0
>
> Fax: +49 721 94246-66
>
> Web:   http://www.bringe.de/
>
>
>
> Geschäftsführer: Dipl.-Ing. (FH) Martin Bringe
>
> Ust.Id: DE812936645, HRB 108943 Mannheim
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: disable stretch_mode possible?

2022-10-17 Thread Gregory Farnum

On Mon, Oct 17, 2022 at 4:40 AM Enrico Bocchi  wrote:
>
> Hi,
>
> I have played with stretch clusters a bit but never managed to
> un-stretch them fully.
>
>  From my experience (using Pacific 16.2.9), once the stretch mode is
> enabled, the replicated pools switch to the stretch_rule with size 4,
> min_size 2, and require at least one replica in each datacenter.
> If one of the two DCs fails, you will still be able to change the crush
> rule of your pools back to the traditional replicated rule, and set size
> 3, min_size 2. After recovery, your objects won't be degraded anymore
> and stored 3 replicas in the surviving DC, but the cluster would still
> complain about one DC down (with related osds and mons).
>
> It does not seem to exist a 'ceph mon disable_stretch_mode' and the
> monmaptool does not implement actions to disable stretch mode and remove
> the tiebreaker mon.
>
> I'd be curious to know if it is possible to disable stretch_mode
> completely too.

Unfortunately, turning it off never got implemented.
If somebody's interested, I don't think this would be too hard to
implement; the issue is mostly the QA around the feature. :)
-Greg

>
> Cheers,
> Enrico
>
>
> On 10/17/22 13:09, Eugen Block wrote:
> > Hi,
> >
> > I didn't have the time to play with it yet, but couldn't you just
> > assign a different ruleset to the pool(s)? Or does ceph complain and
> > prevent that? I'm not sure if stretch mode will be still active after
> > changing the crush rule. But I'm not aware that there's a command to
> > revert this:
> >
> > $ ceph mon enable_stretch_mode e stretch_rule datacenter
> >
> > But it does use the "stretch_rule" defined, so maybe additionally
> > delete that rule? Or start from scratch. ;-)
> >
> > Regards,
> > Eugen
> >
> >
> > Zitat von Manuel Lausch :
> >
> >> Hi,
> >>
> >>
> >> I am playing around with ceph_stretch_mode and now I have one question:
> >>
> >> Is it possible to disable stretchmode again? I didn't found anything
> >> about this in the documentation.
> >>
> >>
> >> -> Ceph pacific 16.2.9
> >>
> >>
> >>
> >> Manuel
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Enrico Bocchi
> CERN European Laboratory for Particle Physics
> IT - Storage & Data Management  - General Storage Services
> Mailbox: G20500 - Office: 31-2-010
> 1211 Genève 23
> Switzerland
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CLT meeting summary 2022-09-28

2022-09-28 Thread Gregory Farnum

On Wed, Sep 28, 2022 at 9:15 AM Adam King  wrote:

> Budget Discussion
>
>- Going to investigate current resources being used, see if any costs
>can be cut
>- What can be moved from virtual environments to internal ones?
>- Need to take inventory of what resources we currently have and what
>their costs are


To be clear, this is specifically about our cloud spend for build and test
resources. :)
-Greg


>
> 17.2.4
>
>- Gibba and LRC cluster have been upgraded
>- Release notes PR open https://github.com/ceph/ceph/pull/48072
>- Release date not yet finalized, but not far off. Next week is likely.
>- (technically from after CLT call) one telemetry regression found in RC
>build (https://tracker.ceph.com/issues/57700). Patch should be small
> and
>low risk. Delay should be minimal.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Power outage recovery

2022-09-15 Thread Gregory Farnum

Recovery from OSDs loses the mds and rgw keys they use to authenticate with
cephx. You need to get those set up again by using the auth commands. I
don’t have them handy but it is discussed in the mailing list archives.
-Greg

On Thu, Sep 15, 2022 at 3:28 PM Jorge Garcia  wrote:

> Yes, I tried restarting them and even rebooting the mds machine. No joy.
> If I try to start ceph-mds by hand, it returns:
>
> 2022-09-15 15:21:39.848 7fc43dbd2700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [2]
> failed to fetch mon config (--no-mon-config to skip)
>
> I found this information online, maybe something to try next:
>
> https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/
>
> But I think maybe the mds needs to be running before that?
>
> On 9/15/22 15:19, Wesley Dillingham wrote:
> > Having the quorum / monitors back up may change the MDS and RGW's
> > ability to start and stay running. Have you tried just restarting the
> > MDS / RGW daemons again?
> >
> > Respectfully,
> >
> > *Wes Dillingham*
> > w...@wesdillingham.com
> > LinkedIn 
> >
> >
> > On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia 
> wrote:
> >
> > OK, I'll try to give more details as I remember them.
> >
> > 1. There was a power outage and then power came back up.
> >
> > 2. When the systems came back up, I did a "ceph -s" and it never
> > returned. Further investigation revealed that the ceph-mon
> > processes had
> > not started in any of the 3 monitors. I looked at the log files
> > and it
> > said something about:
> >
> > ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
> > found 30790637387776 in
> > /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")
> >
> > Looking at the internet, I found some suggestions about
> > troubleshooting
> > monitors in:
> >
> >
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
> >
> > I quickly determined that the monitors weren't running, so I found
> > the
> > section where it said "RECOVERY USING OSDS". The description made
> > sense:
> >
> > "But what if all monitors fail at the same time? Since users are
> > encouraged to deploy at least three (and preferably five) monitors
> > in a
> > Ceph cluster, the chance of simultaneous failure is rare. But
> > unplanned
> > power-downs in a data center with improperly configured disk/fs
> > settings
> > could fail the underlying file system, and hence kill all the
> > monitors.
> > In this case, we can recover the monitor store with the information
> > stored in OSDs."
> >
> > So, I did the procedure described in that section, and then made sure
> > the correct keys were in the keyring and restarted the processes.
> >
> > WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL
> > MESSAGE, AND
> > NOW THE MONITORS ARE BACK! I must have missed some step in the
> > middle of
> > my panic.
> >
> > # ceph -s
> >
> >cluster:
> >  id: ----
> >  health: HEALTH_WARN
> >  mons are allowing insecure global_id reclaim
> >
> >services:
> >  mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
> >  mgr: host-b(active, since 19m), standbys: host-a, host-c
> >  osd: 164 osds: 164 up (since 16m), 164 in (since 8h)
> >
> >data:
> >  pools:   14 pools, 2992 pgs
> >  objects: 91.58M objects, 290 TiB
> >  usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
> >  pgs: 2985 active+clean
> >   7active+clean+scrubbing+deep
> >
> > Couple of missing or strange things:
> >
> > 1. Missing mds
> > 2. Missing rgw
> > 3. New warning showing up
> >
> > But overall, better than a couple hours ago. If anybody is still
> > reading
> > and has any suggestions about how to solve the 3 items above, that
> > would
> > be great! Otherwise, back to scanning the internet for ideas...
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Gregory Farnum

I really don't think those options will impact anything; what's likely
going on is that because things are dirty, they need to keep a long
history around to do their peering. But I haven't had to deal with
that in a while, so maybe I'm missing something.

On Wed, Sep 7, 2022 at 8:36 AM Wyll Ingersoll
 wrote:
>
>
> Can we tweak the osdmap pruning parameters to be more aggressive about 
> trimming those osdmaps?   Would that reduce data on the OSDs or only on the 
> MON DB?
> Looking at mon_min_osdmpa_epochs (500) and mon_osdmap_full_prune_min (1).
>
> Is there a way to find out how many osdmaps are currently being kept?
> ____
> From: Gregory Farnum 
> Sent: Wednesday, September 7, 2022 10:58 AM
> To: Wyll Ingersoll 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] data usage growing despite data being written
>
> On Wed, Sep 7, 2022 at 7:38 AM Wyll Ingersoll
>  wrote:
> >
> > I'm sure we probably have but I'm not sure what else to do.  We are 
> > desperate to get data off of these 99%+ OSDs and the cluster by itself 
> > isn't doing it.
> >
> > The crushmap appears ok.  we have replicated pools and a large EC pool, all 
> > are using host-based failure domains.  The new osds on the newly added 
> > hosts are slowly filling, just not as much as we expected.
> >
> > We have far too many osds at 99%+ and they continue to fill up.  How do we 
> > remove the excess OSDMap data, is it even possible?
> >
> > If we shouldn't be migrating PGs and we cannot remove data, what are our 
> > options to get it to balance again and stop filling up with OSDMaps and 
> > other internal ceph data?
>
> Well, you can turn things off, figure out the proper mapping, and use
> the ceph-objectstore-tool to migrate PGs to their proper destinations
> (letting the cluster clean up the excess copies if you can afford to —
> deleting things is always scary).
> But I haven't had to help recover a death-looping cluster in around a
> decade, so that's about all the options I can offer up.
> -Greg
>
>
> >
> > thanks!
> >
> >
> >
> > 
> > From: Gregory Farnum 
> > Sent: Wednesday, September 7, 2022 10:01 AM
> > To: Wyll Ingersoll 
> > Cc: ceph-users@ceph.io 
> > Subject: Re: [ceph-users] data usage growing despite data being written
> >
> > On Tue, Sep 6, 2022 at 2:08 PM Wyll Ingersoll
> >  wrote:
> > >
> > >
> > > Our cluster has not had any data written to it externally in several 
> > > weeks, but yet the overall data usage has been growing.
> > > Is this due to heavy recovery activity?  If so, what can be done (if 
> > > anything) to reduce the data generated during recovery.
> > >
> > > We've been trying to move PGs away from high-usage OSDS (many over 99%), 
> > > but it's like playing whack-a-mole, the cluster keeps sending new data to 
> > > already overly full osds making further recovery nearly impossible.
> >
> > I may be missing something, but I think you've really slowed things
> > down by continually migrating PGs around while the cluster is already
> > unhealthy. It forces a lot of new OSDMap generation and general churn
> > (which itself slows down data movement.)
> >
> > I'd also examine your crush map carefully, since it sounded like you'd
> > added some new hosts and they weren't getting the data you expected
> > them to. Perhaps there's some kind of imbalance (eg, they aren't in
> > racks, and selecting those is part of your crush rule?).
> > -Greg
> >
> > >
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> >
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Gregory Farnum

On Wed, Sep 7, 2022 at 7:38 AM Wyll Ingersoll
 wrote:
>
> I'm sure we probably have but I'm not sure what else to do.  We are desperate 
> to get data off of these 99%+ OSDs and the cluster by itself isn't doing it.
>
> The crushmap appears ok.  we have replicated pools and a large EC pool, all 
> are using host-based failure domains.  The new osds on the newly added hosts 
> are slowly filling, just not as much as we expected.
>
> We have far too many osds at 99%+ and they continue to fill up.  How do we 
> remove the excess OSDMap data, is it even possible?
>
> If we shouldn't be migrating PGs and we cannot remove data, what are our 
> options to get it to balance again and stop filling up with OSDMaps and other 
> internal ceph data?

Well, you can turn things off, figure out the proper mapping, and use
the ceph-objectstore-tool to migrate PGs to their proper destinations
(letting the cluster clean up the excess copies if you can afford to —
deleting things is always scary).
But I haven't had to help recover a death-looping cluster in around a
decade, so that's about all the options I can offer up.
-Greg


>
> thanks!
>
>
>
> 
> From: Gregory Farnum 
> Sent: Wednesday, September 7, 2022 10:01 AM
> To: Wyll Ingersoll 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] data usage growing despite data being written
>
> On Tue, Sep 6, 2022 at 2:08 PM Wyll Ingersoll
>  wrote:
> >
> >
> > Our cluster has not had any data written to it externally in several weeks, 
> > but yet the overall data usage has been growing.
> > Is this due to heavy recovery activity?  If so, what can be done (if 
> > anything) to reduce the data generated during recovery.
> >
> > We've been trying to move PGs away from high-usage OSDS (many over 99%), 
> > but it's like playing whack-a-mole, the cluster keeps sending new data to 
> > already overly full osds making further recovery nearly impossible.
>
> I may be missing something, but I think you've really slowed things
> down by continually migrating PGs around while the cluster is already
> unhealthy. It forces a lot of new OSDMap generation and general churn
> (which itself slows down data movement.)
>
> I'd also examine your crush map carefully, since it sounded like you'd
> added some new hosts and they weren't getting the data you expected
> them to. Perhaps there's some kind of imbalance (eg, they aren't in
> racks, and selecting those is part of your crush rule?).
> -Greg
>
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs blocklist recovery and recover_session mount option

2022-09-07 Thread Gregory Farnum

On Tue, Aug 16, 2022 at 3:14 PM Vladimir Brik
 wrote:
>
> Hello
>
> I'd like to understand what is the proper/safe way to
> recover when a cephfs client becomes blocklisted by the MDS.
>
> The man page of mount.ceph talks about recover_session=clean
> option, but it has the following text I am not sure I am
> interpreting correctly:
>
> "After reconnect, file locks become stale because the MDS
> loses track of them. If an inode contains any stale file
> locks, read/write on the inode is not allowed until
> applications release all stale file locks."
>
> Does "application" in this context refer to the cephfs
> kernel client or a user space process?
>

Hmm, I haven't thought about this before. But the application asks for
locks, and those locks need to get released. The kernel client can't
do it on its own.

> Does this mean that if the application terminates while the
> client is blocklisted without releasing locks, the locked
> inodes (and the files they belong to) become inaccessible
> forever?
>
> Is reboot the only safe way to deal with blocklisting,
> assuming applications were writing to files when it happened?

Basically, yes. When you're disconnected, changes can be made by other
clients without regard to any buffered changes that the disconnected
client may have made. And then if a reconnect is forced, the
previously-disconnected node just writes out whatever it has, which
may roll back data from other clients that were alive the whole time.
There are ways to narrow that window that we incrementally work on,
but that fundamental conflict will always be there.
-Greg


>
>
> Thanks very much
>
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: data usage growing despite data being written

2022-09-07 Thread Gregory Farnum

On Tue, Sep 6, 2022 at 2:08 PM Wyll Ingersoll
 wrote:
>
>
> Our cluster has not had any data written to it externally in several weeks, 
> but yet the overall data usage has been growing.
> Is this due to heavy recovery activity?  If so, what can be done (if 
> anything) to reduce the data generated during recovery.
>
> We've been trying to move PGs away from high-usage OSDS (many over 99%), but 
> it's like playing whack-a-mole, the cluster keeps sending new data to already 
> overly full osds making further recovery nearly impossible.

I may be missing something, but I think you've really slowed things
down by continually migrating PGs around while the cluster is already
unhealthy. It forces a lot of new OSDMap generation and general churn
(which itself slows down data movement.)

I'd also examine your crush map carefully, since it sounded like you'd
added some new hosts and they weren't getting the data you expected
them to. Perhaps there's some kind of imbalance (eg, they aren't in
racks, and selecting those is part of your crush rule?).
-Greg

>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Help] Does MSGR2 protocol use openssl for encryption

2022-09-02 Thread Gregory Farnum

We partly rolled our own with AES-GCM. See
https://docs.ceph.com/en/quincy/rados/configuration/msgr2/#connection-modes
and https://docs.ceph.com/en/quincy/dev/msgr2/#frame-format
-Greg

On Wed, Aug 24, 2022 at 4:50 PM Jinhao Hu  wrote:
>
> Hi,
>
> I have a question about the MSGR protocol Ceph used for in-transit
> encryption. Does it use openssl for encryption? If not, what tools does it
> use to encrypt the data? or Ceph implemented its own encryption method?
>
> Thanks,
> Jinhao
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS MDS sizing

2022-09-02 Thread Gregory Farnum

On Sun, Aug 28, 2022 at 12:19 PM Vladimir Brik
 wrote:
>
> Hello
>
> Is there a way to query or get an approximate value of an
> MDS's cache hit ratio without using "dump loads" command
> (which seems to be a relatively expensive operation) for
> monitoring and such?
Unfortunately, I'm not seeing one. What problem are you actually
trying to solve with that information? The expensive part of that
command should be dumping all the directories in cache, so a new admin
command to dump just the cache hit rate and related statistics should
be a pretty simple feature — PRs welcome! ;)
-Greg

>
>
> Vlad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Changing the cluster network range

2022-09-02 Thread Gregory Farnum

On Mon, Aug 29, 2022 at 12:49 AM Burkhard Linke
 wrote:
>
> Hi,
>
>
> some years ago we changed our setup from a IPoIB cluster network to a
> single network setup, which is a similar operation.
>
>
> The OSD use the cluster network for heartbeats and backfilling
> operation; both use standard tcp connection. There is no "global view"
> on the networks involved; OSDs announce their public and private network
> (if present) via an update to the OSD map on OSD boot. OSDs expect to be
> able to create TCP connections to the announced IP addresses and ports.
> Mon and mgr instances do not use the cluster network at all.
>
> If you want to change the networks (either public or private), you need
> to ensure that during the migration TCP connectivity between the old
> networks and the new networks is possible, e.g. via a route on some
> router. Since we had an isolated IPoIB networks without any connections
> to some router, we used one of the ceph hosts as router. Worked fine for
> a migration in live production ;-)

To be a little more explicit about this: Ceph stores the IP addresses
of live OSDs and MDSes in their respective cluster maps, but otherwise
does not care about them at all — they are updated to the daemon's
current IP on every boot. The monitor IP addresses are fixed
identities, so moving them requires either adding new monitors and
removing old ones, or else doing surgery on their databases to change
the IPs by editing the monmap they store (and then updating the local
config for clients and OSDs so they point to the new locations and can
find them on bootup.)
But for the OSDs etc, you're really just worried about the local
configs or your deployment tool. (And, depending on how you arrange
things, you may need to take care to avoid the OSDs moving into new
CRUSH buckets and migrating all their data.)
-Greg

>
> Regarding the network size: I'm not sure whether the code requires an
> exact CIDR match for the interface. If in doubt, have a look at the
> source code
>
> As already mentioned in another answer, most setups do not require an
> extra cluster network. It is extra effort both in setup, maintenance and
> operating. Unless your network is the bottleneck you might want to use
> this pending configuration change to switch to a single network setup.
>
>
> Regards,
>
> Burkhard Linke
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Potential bug in cephfs-data-scan?

2022-08-25 Thread Gregory Farnum

On Fri, Aug 19, 2022 at 7:17 AM Patrick Donnelly  wrote:
>
> On Fri, Aug 19, 2022 at 5:02 AM Jesper Lykkegaard Karlsen
>  wrote:
> >
> > Hi,
> >
> > I have recently been scanning the files in a PG with "cephfs-data-scan 
> > pg_files ...".
>
> Why?
>
> > Although, after a long time the scan was still running and the list of 
> > files consumed 44 GB, I stopped it, as something obviously was very wrong.
> >
> > It turns out some users had symlinks that looped and even a user had a 
> > symlink to "/".
>
> Symlinks are not stored in the data pool. This should be irrelevant.

pg_files is the version that tells you what existing files may have
holes in them from lost data. It does this by walking through the tree
and does depend on the MDS.

So yeah, this is a bug. It shouldn't be hard to fix for anybody who
spends a bit of time looking at the code, so feel free to do that and
file a PR, or generate a tracker! :)
-Greg

>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS perforamnce degradation in root directory

2022-08-15 Thread Gregory Farnum

I was wondering if it had something to do with quota enforcement. The other
possibility that occurs to me is if other clients are monitoring the
system, or an admin pane (eg the dashboard) is displaying per-volume or
per-client stats, they may be poking at the mountpoint and interrupting
exclusive client caps?
-Greg

On Mon, Aug 15, 2022 at 8:19 PM Xiubo Li  wrote:

>
> On 8/9/22 4:07 PM, Robert Sander wrote:
> > Hi,
> >
> > we have a cluster with 7 nodes each with 10 SSD OSDs providing CephFS
> > to a CloudStack system as primary storage.
> >
> > When copying a large file into the root directory of the CephFS the
> > bandwidth drops from 500MB/s to 50MB/s after around 30 seconds. We see
> > some MDS activity in the output of "ceph fs status" at the same time.
> >
> > When copying the same file to a subdirectory of the CephFS the
> > performance stays at 500MB/s for the whole time. MDS activity does not
> > seems to influence the performance here.
> >
> > There are appr 270 other files in the root directory. CloudStack
> > stores VM images in qcow2 format there.
> >
> > Is this a known issue?
> > Is there something special with the root directory of a CephFS wrt
> > write performance?
>
> AFAIK there is no special with the root dir. From my local test there is
> not difference with the subdir.
>
> BTW, could you test it for more than once for the root dir ? When you
> are doing this for the first time the ceph may need to allocate the disk
> spaces, which will take a little time.
>
> Thanks.
>
> >
> > Regards
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: linux distro requirements for reef

2022-08-10 Thread Gregory Farnum

The immediate driver is both a switch to newer versions of python, and to
newer compilers supporting more C++20 features.

More generally, supporting multiple versions of a distribution is a lot of
work and when Reef comes out next year, CentOS9 will be over a year old. We
generally move new stable releases to the newest long-term release of any
distro we package for. That means CentOS9 for Reef.

We aren’t dropping any distros for Quincy, of course, which is our current
stable release.
-Greg

On Wed, Aug 10, 2022 at 10:27 AM Konstantin Shalygin  wrote:

> Ken, can you please describe what incompatibilities or dependencies are
> causing to not build packages for c8s? It's not obvious from the first
> message, from community side 🙂
>
>
> Thanks,
> k
>
> Sent from my iPhone
>
> > On 10 Aug 2022, at 20:02, Ken Dreyer  wrote:
> >
> > On Wed, Aug 10, 2022 at 11:35 AM Konstantin Shalygin 
> wrote:
> >>
> >> Hi Ken,
> >>
> >> CentOS 8 Stream will continue to receive packages or have some barrires
> for R?
> >
> > No, starting with Reef, we will no longer build nor ship RPMs for
> > CentOS 8 Stream (and debs for Ubuntu Focal) from download.ceph.com.
> > The only CentOS Stream version for Reef+ will be CentOS 9 Stream.
> >
> > - Ken
> >
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade from Octopus to Pacific cannot get monitor to join

2022-07-28 Thread Gregory Farnum

On Wed, Jul 27, 2022 at 4:54 PM  wrote:
>
> Currently, all of the nodes are running in docker. The only way to upgrade is 
> to redeploy with docker (ceph orch daemon redeploy), which is essentially 
> making a new monitor. Am I missing something?

Apparently. I don't have any experience with Docker, and unfortunately
very little with containers in general, so I'm not sure what process
you need to follow, though. cephadm certainly managers to do it
properly — you want to maintain the existing disk store.

How do you do it for OSDs? Surely you don't create throw away an old
OSD, create a new one, and wait for migration to complete before doing
the next...
-Greg

>
> Is there some prep work I could/should be doing?
>
> I want to do a staggered upgrade as noted here 
> (https://docs.ceph.com/en/pacific/cephadm/upgrade/). That says for a 
> staggered upgrade the order is mgr -> mon, etc. But that was not working for 
> me because it said the --daemon-types was not supported.
>
> Basically I'm confused on what is the 'proper' way to upgrade then. There 
> isn't any way that I see to upgrade the 'code' they are running because it's 
> all in docker containers. But maybe I'm missing something obvious
>
> Thanks
>
>
>
>
> July 27, 2022 4:34 PM, "Gregory Farnum"  wrote:
>
> On Wed, Jul 27, 2022 at 10:24 AM  wrote:
>
> Currently running Octopus 15.2.16, trying to upgrade to Pacific using cephadm.
>
> 3 mon nodes running 15.2.16
> 2 mgr nodes running 16.2.9
> 15 OSD's running 15.2.16
>
> The mon/mgr nodes are running in lxc containers on Ubuntu running docker from 
> the docker repo (not the Ubuntu repo). Using cephadm to remove one of the 
> monitor nodes, and then re-add it back with a 16.2.9 version. The monitor 
> node runs but never joins the cluster. Also, this causes the other 2 mon 
> nodes to start flapping. Also tried adding 2 mon nodes (for a total of 5 
> mons) on bare metal running Ubuntu (with docker running from the docker repo) 
> and the mon's won't join and won't even show up in 'ceph status'
>
> The way you’re phrasing this it sounds like you’re removing existing monitors 
> and adding newly-created ones. That won’t work across major version 
> boundaries like this (at least, without a bit of prep work you aren’t doing) 
> because of how monitors bootstrap themselves and their cluster membership. 
> You need to upgrade the code running on the existing monitors instead, which 
> is the documented upgrade process AFAIK.
> -Greg
>
>
>
> Can't find anything in the logs regarding why it's failing. The docker 
> container starts and seems to try to join the cluster but just sits and 
> doesn't join. The other two start flapping and then eventually I have to stop 
> the new mon. I can add the monitor back by changing the container_image to 
> 15.2.16 and it will re-join the cluster as expected.
>
> The cluster was previously running nautilus installed using ceph-deploy
>
> Tried setting 'mon_mds_skip_sanity true' from reading another post but it 
> doesn't appear to help.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Stretch Cluster - df pool size (Max Avail)

2022-07-28 Thread Gregory Farnum

https://tracker.ceph.com/issues/56650

There's a PR in progress to resolve this issue now. (Thanks, Prashant!)
-Greg

On Thu, Jul 28, 2022 at 7:52 AM Nicolas FONTAINE  wrote:
>
> Hello,
>
> We have exactly the same problem. Did you find an answer or should we
> open a bug report?
>
> Sincerely,
>
> Nicolas.
>
> Le 23/06/2022 à 11:42, Kilian Ries a écrit :
> > Hi Joachim,
> >
> >
> > yes i assigned the stretch rule to the pool (4x replica / 2x min). The rule 
> > says that two replicas should be in every datacenter.
> >
> >
> > $ ceph osd tree
> > ID   CLASS  WEIGHTTYPE NAME   STATUS  REWEIGHT  PRI-AFF
> >   -1 62.87799  root default
> > -17 31.43900  datacenter site1
> > -15 31.43900  rack b7
> >   -3 10.48000  host host01
> >0ssd   1.74699  osd.0   up   1.0  1.0
> >1ssd   1.74699  osd.1   up   1.0  1.0
> >2ssd   1.74699  osd.2   up   1.0  1.0
> >3ssd   1.74699  osd.3   up   1.0  1.0
> >4ssd   1.74699  osd.4   up   1.0  1.0
> >5ssd   1.74699  osd.5   up   1.0  1.0
> >   -5 10.48000  host host02
> >6ssd   1.74699  osd.6   up   1.0  1.0
> >7ssd   1.74699  osd.7   up   1.0  1.0
> >8ssd   1.74699  osd.8   up   1.0  1.0
> >9ssd   1.74699  osd.9   up   1.0  1.0
> >   10ssd   1.74699  osd.10  up   1.0  1.0
> >   11ssd   1.74699  osd.11  up   1.0  1.0
> >   -7 10.48000  host host03
> >   12ssd   1.74699  osd.12  up   1.0  1.0
> >   13ssd   1.74699  osd.13  up   1.0  1.0
> >   14ssd   1.74699  osd.14  up   1.0  1.0
> >   15ssd   1.74699  osd.15  up   1.0  1.0
> >   16ssd   1.74699  osd.16  up   1.0  1.0
> >   17ssd   1.74699  osd.17  up   1.0  1.0
> > -18 31.43900  datacenter site2
> > -16 31.43900  rack h2
> >   -9 10.48000  host host04
> >   18ssd   1.74699  osd.18  up   1.0  1.0
> >   19ssd   1.74699  osd.19  up   1.0  1.0
> >   20ssd   1.74699  osd.20  up   1.0  1.0
> >   21ssd   1.74699  osd.21  up   1.0  1.0
> >   22ssd   1.74699  osd.22  up   1.0  1.0
> >   23ssd   1.74699  osd.23  up   1.0  1.0
> > -11 10.48000  host host05
> >   24ssd   1.74699  osd.24  up   1.0  1.0
> >   25ssd   1.74699  osd.25  up   1.0  1.0
> >   26ssd   1.74699  osd.26  up   1.0  1.0
> >   27ssd   1.74699  osd.27  up   1.0  1.0
> >   28ssd   1.74699  osd.28  up   1.0  1.0
> >   29ssd   1.74699  osd.29  up   1.0  1.0
> > -13 10.48000  host host06
> >   30ssd   1.74699  osd.30  up   1.0  1.0
> >   31ssd   1.74699  osd.31  up   1.0  1.0
> >   32ssd   1.74699  osd.32  up   1.0  1.0
> >   33ssd   1.74699  osd.33  up   1.0  1.0
> >   34ssd   1.74699  osd.34  up   1.0  1.0
> >   35ssd   1.74699  osd.35  up   1.0  1.0
> >
> >
> > So regarding my calculation it should be
> >
> >
> > (6x Nodes * 6x SSD * 1,8TB) / 4 = 16 TB
> >
> >
> > Is this maybe a bug in the stretch mode that i only get displayed half the 
> > size available?
> >
> >
> > Regards,
> >
> > Kilian
> >
> >
> > 
> > Von: Clyso GmbH - Ceph Foundation Member 
> > Gesendet: Mittwoch, 22. Juni 2022 18:20:59
> > An: Kilian Ries; ceph-users(a)ceph.io
> > Betreff: Re: [ceph-users] Ceph Stretch Cluster - df pool size (Max Avail)
> >
> > Hi Kilian,
> >
> > we do not currently use this mode of ceph clustering. but normally you
> > need to assign the crush rule to the pool as well, otherwise it will
> > take rule 0 as default.
> >
> > the following calculation for rule 0 would also work approximately:
> >
> > (3 Nodes *6 x SSD *1,8TB)/4 = 8,1 TB
> >
> > hope it helps, Joachim
> >
> >
> > ___
> > Clyso GmbH - Ceph Foun

[ceph-users] Re: cannot set quota on ceph fs root

2022-07-28 Thread Gregory Farnum

On Thu, Jul 28, 2022 at 1:01 AM Frank Schilder  wrote:
>
> Hi all,
>
> I'm trying to set a quota on the ceph fs file system root, but it fails with 
> "setfattr: /mnt/adm/cephfs: Invalid argument". I can set quotas on any 
> sub-directory. Is this intentional? The documentation 
> (https://docs.ceph.com/en/octopus/cephfs/quota/#quotas) says
>
> > CephFS allows quotas to be set on any directory in the system.
>
> Any includes the fs root. Is the documentation incorrect or is this a bug?

I'm not immediately seeing why we can't set quota on the root, but the
root inode is special in a lot of ways so this doesn't surprise me.
I'd probably regard it as a docs bug.

That said, there's also a good chance that the setfattr is getting
intercepted before Ceph ever sees it, since by setting it on the root
you're necessarily interacting with a mount point in Linux and those
can also be finicky...You could see if it works by using cephfs-shell.
-Greg


>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster running without monitors

2022-07-28 Thread Gregory Farnum

On Thu, Jul 28, 2022 at 5:32 AM Johannes Liebl  wrote:
>
> Hi Ceph Users,
>
>
> I am currently evaluating different cluster layouts and as a test I stopped 
> two of my three monitors while client traffic was running on the nodes.?
>
>
> Only when I restartet an OSD all PGs which were related to that OSD went 
> down, but the rest were still active and serving requests.
>
>
> A second try ran for 5:30 Hours without a hitch after which I aborted the 
> Test since nothing was happening.
>
>
> Now I want to know; Is this behavior by design?
>
> It strikes me as odd that this more or less undefined state is still 
> operational.

Yep, it's on purpose! I would not count on this behavior because a lot
of routine operations can disturb it[1], but Ceph does its best to
continue operating as it can by not relying on the other daemons
whenever possible.

Monitors are required for updates to the cluster maps, but as long as
the cluster is stable and no new maps need to be generated, things
will keep operating until something requires an update and that gets
blocked. As you saw, when an OSD got restarted, that changed the
cluster state and required updates which couldn't get processed, so
the affected PGs couldn't go active.
-Greg
[1]: RBD snapshots go through the monitors; MDSes send beacons to the
monitors and will shut down if those don't get acknowledged so I don't
think CephFS will keep running in this case; CephX does key rotations
which will eventually block access to the OSDs as keys time out; any
kind of PG peering or recovery needs the monitors to update values;
etc.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade from Octopus to Pacific cannot get monitor to join

2022-07-27 Thread Gregory Farnum

On Wed, Jul 27, 2022 at 10:24 AM  wrote:

> Currently running Octopus 15.2.16, trying to upgrade to Pacific using
> cephadm.
>
> 3 mon nodes running 15.2.16
> 2 mgr nodes running 16.2.9
> 15 OSD's running 15.2.16
>
> The mon/mgr nodes are running in lxc containers on Ubuntu running docker
> from the docker repo (not the Ubuntu repo). Using cephadm to remove one of
> the monitor nodes, and then re-add it back with a 16.2.9 version. The
> monitor node runs but never joins the cluster. Also, this causes the other
> 2 mon nodes to start flapping. Also tried adding 2 mon nodes (for a total
> of 5 mons) on bare metal running Ubuntu (with docker running from the
> docker repo) and the mon's won't join and won't even show up in 'ceph
> status'


The way you’re phrasing this it sounds like you’re removing existing
monitors and adding newly-created ones. That won’t work across major
version boundaries like this (at least, without a bit of prep work you
aren’t doing) because of how monitors bootstrap themselves and their
cluster membership. You need to upgrade the code running on the existing
monitors instead, which is the documented upgrade process AFAIK.
-Greg


>
> Can't find anything in the logs regarding why it's failing. The docker
> container starts and seems to try to join the cluster but just sits and
> doesn't join. The other two start flapping and then eventually I have to
> stop the new mon. I can add the monitor back by changing the
> container_image to 15.2.16 and it will re-join the cluster as expected.
>
> The cluster was previously running nautilus installed using ceph-deploy
>
> Tried setting 'mon_mds_skip_sanity true' from reading another post but it
> doesn't appear to help.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-26 Thread Gregory Farnum

On Tue, Jul 26, 2022 at 3:41 PM Yuri Weinstein  wrote:

> Greg, I started testing this PR.
> What do you want to rerun for it?  Are fs, kcephfs, multimds suites
> sufficient?


We just need to run the mgr/volumes tests — I think those are all in the fs
suite but Kotresh or Ramana can let us know.
-Greg


>
> On Tue, Jul 26, 2022 at 3:16 PM Gregory Farnum  wrote:
> >
> > We can’t do the final release until the recent mgr/volumes security
> fixes get merged in, though.
> > https://github.com/ceph/ceph/pull/47236
> >
> > On Tue, Jul 26, 2022 at 3:12 PM Ramana Krisna Venkatesh Raja <
> rr...@redhat.com> wrote:
> >>
> >> On Thu, Jul 21, 2022 at 10:28 AM Yuri Weinstein 
> wrote:
> >> >
> >> > Details of this release are summarized here:
> >> >
> >> > https://tracker.ceph.com/issues/56484
> >> > Release Notes - https://github.com/ceph/ceph/pull/47198
> >> >
> >> > Seeking approvals for:
> >> >
> >> > rados - Neha, Travis, Ernesto, Adam
> >> > rgw - Casey
> >> > fs, kcephfs, multimds - Venky, Patrick
> >>
> >> fs, kcephfs, multimds approved.
> >>
> >> Review of results,
> >> https://tracker.ceph.com/projects/cephfs/wiki/Octopus#2022-Jul-26
> >>
> >> Thanks,
> >> Ramana
> >>
> >> > rbd - Ilya, Deepika
> >> > krbd  Ilya, Deepika
> >> > ceph-ansible - Brad pls take a look
> >> >
> >> > Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
> >> >
> >> > PS:  some tests are still running but I will be off-line for several
> hours and wanted to start the review process.
> >> >
> >> > Thx
> >> > YuriW
> >> > ___
> >> > Dev mailing list -- d...@ceph.io
> >> > To unsubscribe send an email to dev-le...@ceph.io
> >>
> >> ___
> >> Dev mailing list -- d...@ceph.io
> >> To unsubscribe send an email to dev-le...@ceph.io
> >>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: octopus v15.2.17 QE Validation status

2022-07-26 Thread Gregory Farnum

We can’t do the final release until the recent mgr/volumes security fixes
get merged in, though.
https://github.com/ceph/ceph/pull/47236

On Tue, Jul 26, 2022 at 3:12 PM Ramana Krisna Venkatesh Raja <
rr...@redhat.com> wrote:

> On Thu, Jul 21, 2022 at 10:28 AM Yuri Weinstein 
> wrote:
> >
> > Details of this release are summarized here:
> >
> > https://tracker.ceph.com/issues/56484
> > Release Notes - https://github.com/ceph/ceph/pull/47198
> >
> > Seeking approvals for:
> >
> > rados - Neha, Travis, Ernesto, Adam
> > rgw - Casey
> > fs, kcephfs, multimds - Venky, Patrick
>
> fs, kcephfs, multimds approved.
>
> Review of results,
> https://tracker.ceph.com/projects/cephfs/wiki/Octopus#2022-Jul-26
>
> Thanks,
> Ramana
>
> > rbd - Ilya, Deepika
> > krbd  Ilya, Deepika
> > ceph-ansible - Brad pls take a look
> >
> > Please reply to this email with approval and/or trackers of known
> issues/PRs to address them.
> >
> > PS:  some tests are still running but I will be off-line for several
> hours and wanted to start the review process.
> >
> > Thx
> > YuriW
> > ___
> > Dev mailing list -- d...@ceph.io
> > To unsubscribe send an email to dev-le...@ceph.io
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: LibCephFS Python Mount Failure

2022-07-26 Thread Gregory Farnum

It looks like you’re setting environment variables that force your new
keyring,  it you aren’t telling the library to use your new CephX user. So
it opens your new keyring and looks for the default (client.admin) user and
doesn’t get anything.
-Greg

On Tue, Jul 26, 2022 at 7:54 AM Adam Carrgilson (NBI) <
adam.carrgil...@nbi.ac.uk> wrote:

> I've disabled the part of the script that catches the Python exception and
> allowed it to print everything out and it looks like the OSError with the
> code 13, is a permissions error:
>
> Traceback (most recent call last):
>   File "./get-ceph-quota-statistics.py", line 274, in 
> main(args)
>   File "./get-ceph-quota-statistics.py", line 30, in main
> cephfs = login() # holds CephFS bindings
>   File "./get-ceph-quota-statistics.py", line 94, in login
> cephfs.mount(filesystem_name=b'cephfs')
>   File "cephfs.pyx", line 684, in cephfs.LibCephFS.mount
>   File "cephfs.pyx", line 676, in cephfs.LibCephFS.init
> cephfs.OSError: error calling ceph_init: Permission denied [Errno 13]
>
> Now I've tested a FUSE mount with the same keyfile and that functions as
> expected, so I'm having to assume that somehow the Python script either
> doesn't have all of the properties I've supplied (which I doubt, because I
> can point it at files with admin credentials and it works fine), something
> within the Python CephFS library might be hardcoded to particular values
> which I'm having problems with, or maybe something else?
>
> Is there a way to interrogate the Python object before I do the
> cephfs.mount, just to confirm the options are as I expect?
>
> Alternatively, python-cephfs wraps around the CephFS library, right?
> Does the CephFS FUSE component utilise the same CephFS library?
> If not, is there a way to call something else on the command line directly
> to rule out problems there?
>
> Many Thanks,
> Adam.
>
> -Original Message-
> From: Adam Carrgilson (NBI) 
> Sent: 25 July 2022 16:24
> To: ceph-users@ceph.io
> Cc: Bogdan Adrian Velica 
> Subject: [ceph-users] Re: LibCephFS Python Mount Failure
>
> Thanks Bogdan,
>
> I’m running this script at the moment as my development system’s root user
> account, I don’t have a particular ceph user on this standalone system, and
> I don’t think I’ll be able to control the user account of the monitoring
> hosts either (I think they might run under a user account dedicated to the
> monitoring) but I’m interested to what you think I should test here?
>
> I can definitely run the code as the root user, it can read my custom
> configuration and key files, when I specify those providing the admin user
> credentials, it works as expected, but when I specify the monitoring
> credentials it errors with that ceph_init message.
>
> My script can open and read the configuration and key files (I print those
> to screen), and I do attempt to pull back the environment before I execute
> the mount, and it does include my addition of the CEPH_ARGS. That said,
> those also work when those particular files are for the admin ceph account,
> and that can’t be picking up anything from the default locations as I’ve
> deliberately removed them from there.
>
> Is there any way to make the Python LibCephFS more verbose to better
> understand its error message?
>
> Adam.
>
> From: Bogdan Adrian Velica 
> Sent: 25 July 2022 14:50
> To: Adam Carrgilson (NBI) 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] LibCephFS Python Mount Failure
>
> Hi Adam,
>
> I think this might be related to the user you are running the script as,
> try running the scrip as ceph user (or the user you are running your ceph
> with). Also make sure the variable os.environ.get is used (i might be
> mistaking here). do a print or something first to see the key is loaded.
> Just my 2 cents...
>
> Best of luck,
>
>  --
> Bogdan Velica
> Ceph Support Engineer
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> 
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io
>
> On Mon, Jul 25, 2022 at 4:06 PM Adam Carrgilson (NBI) <
> adam.carrgil...@nbi.ac.uk> wrote:
> Hi all,
>
> I'm trying to put together a script to gather CephFS quota utilisation.
> I'm using the CephFS Python library from here:
> https://docs.ceph.com/en/latest/cephfs/api/libcephfs-py/
> and I've followed the rather a good guide on how to use it here:
> https://jayjeetc.medium.com/up-and-running-with-libcephfs-7629455f0cdc#934a
>
> I have been able to get this working, however; I want this to be able to
> be portable to run it on our monitoring agents, and specifically, I want to
> be able to use a limited permission account, so read-only permissions and
> network limitations.
> I originally couldn't find a method to specify a custom keyfile to use
> through the library, but with some assistance, I've found that I can use
> the Python command: os.env

[ceph-users] Re: ceph-fs crashes on getfattr

2022-07-12 Thread Gregory Farnum

On Tue, Jul 12, 2022 at 1:46 PM Andras Pataki
 wrote:
>
> We also had a full MDS crash a couple of weeks ago due to what seems to
> be another back-ported feature going wrong.  As soon as I deployed the
> 16.2.9 ceph-fuse client, someone asked for an unknown ceph xattr, which
> crashed the MDS and kept crashing all the standby MDS's as well as they
> got activated.
>
> The reason seems to be that a new message is sent by the 16.2.9
> ceph-fuse (and NOT by the 16.2.7 one) to the MDS for unknown ceph xattr
> (now called vxattr).  The MDS code has a couple of switch statements by
> message type that do an abort when a message is received that the MDS
> does not expect, which is exactly what happened.  Then, as the MDS
> crashed, a standby got activated, the clients got to replay their
> pending requests - and ... the next MDS crashed due to the same message
> - bringing the whole file system down.Finally I found the culprit client
> - killed it, which got the cluster back to working.  I ended up patching
> the 16.2.9 client not to send this message and deployed that change quickly.
>
> My question is - why do we backport significant new features to stable
> releases?  Especially ones that change the messaging API?  I absolutely
> not expect such a crash when updating a point release of a client.

Generally, backports like this occur because it fixes a problem which
is visible to users and is deemed important enough to be worth the
risk. But obviously this one went quite badly wrong — we're going to
do an RCA and figure out what process changes we need to make to
prevent this kind of thing in future.

(Also, the abort-on-unknown-message behavior is going to get removed.
That's not appropriate behavior at this stage in Ceph's life.
https://tracker.ceph.com/issues/56522)
-Greg

>
> Andras
>
> On 7/12/22 07:01, Frank Schilder wrote:
> > Hi Gregory,
> >
> > Thanks for your fast reply. I created 
> > https://urldefense.com/v3/__https://tracker.ceph.com/issues/56529__;!!DSb-azq1wVFtOg!Rp8Drbtzjblq3DeEE3NuOkr3ERgUTUvU7IZekfMqL7bAlq6UREUe8Dlk2Q46rssYChOsfegNZAOvNBRSDLR_$
> >   and attached the standard logs. In case you need more, please let me 
> > know. Note that I added some more buggy behaviour, the vxattr handling 
> > seems broken more or less all the way around.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Gregory Farnum 
> > Sent: 11 July 2022 19:14:26
> > To: Frank Schilder
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] ceph-fs crashes on getfattr
> >
> > On Mon, Jul 11, 2022 at 8:26 AM Frank Schilder  wrote:
> >> Hi all,
> >>
> >> we made a very weird observation on our ceph test cluster today. A simple 
> >> getfattr with a misspelled attribute name sends the MDS cluster into a 
> >> crash+restart loop. Something as simple as
> >>
> >>getfattr -n ceph.dir.layout.po /mnt/cephfs
> >>
> >> kills a ceph-fs completely. The problem can be resolved if one executes a 
> >> "umount -f /mnt/cephfs" on the host where the getfattr was executed. The 
> >> MDS daemons need a restart. One might also need to clear the OSD blacklist.
> >>
> >> We observe this with a kernel client on 5.18.6-1.el7.elrepo.x86_64 (Centos 
> >> 7) with mimic and I'm absolutely sure I have not seen this problem with 
> >> mimic on earlier 5.9.X-kernel versions.
> >>
> >> Is this known to be a kernel client bug? Possibly fixed already?
> > That obviously shouldn't happen. Please file a tracker ticket.
> >
> > There's been a fair bit of churn in how we handle the "vxattrs" so my
> > guess is an incompatibility got introduced between newer clients and
> > the old server implementation, but obviously we want it to work and we
> > especially shouldn't be crashing the MDS. Skimming through it I'm
> > actually not seeing what a client *could* do in that path to crash the
> > server so I'm a bit confused...
> > Oh. I think I see it now, but I'd like to confirm. Yeah, please make
> > that tracker ticket and attach the backtrace you get.
> > Thanks,
> > -Greg
> >
> >> Best regards,
> >> =
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-fs crashes on getfattr

2022-07-11 Thread Gregory Farnum

On Mon, Jul 11, 2022 at 8:26 AM Frank Schilder  wrote:
>
> Hi all,
>
> we made a very weird observation on our ceph test cluster today. A simple 
> getfattr with a misspelled attribute name sends the MDS cluster into a 
> crash+restart loop. Something as simple as
>
>   getfattr -n ceph.dir.layout.po /mnt/cephfs
>
> kills a ceph-fs completely. The problem can be resolved if one executes a 
> "umount -f /mnt/cephfs" on the host where the getfattr was executed. The MDS 
> daemons need a restart. One might also need to clear the OSD blacklist.
>
> We observe this with a kernel client on 5.18.6-1.el7.elrepo.x86_64 (Centos 7) 
> with mimic and I'm absolutely sure I have not seen this problem with mimic on 
> earlier 5.9.X-kernel versions.
>
> Is this known to be a kernel client bug? Possibly fixed already?

That obviously shouldn't happen. Please file a tracker ticket.

There's been a fair bit of churn in how we handle the "vxattrs" so my
guess is an incompatibility got introduced between newer clients and
the old server implementation, but obviously we want it to work and we
especially shouldn't be crashing the MDS. Skimming through it I'm
actually not seeing what a client *could* do in that path to crash the
server so I'm a bit confused...
Oh. I think I see it now, but I'd like to confirm. Yeah, please make
that tracker ticket and attach the backtrace you get.
Thanks,
-Greg

>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs client permission restrictions?

2022-06-23 Thread Gregory Farnum

On Thu, Jun 23, 2022 at 8:18 AM Wyll Ingersoll
 wrote:
>
> Is it possible to craft a cephfs client authorization key that will allow the 
> client read/write access to a path within the FS, but NOT allow the client to 
> modify the permissions of that path?
> For example, allow RW access to /cephfs/foo (path=/foo) but prevent the 
> client from modifying permissions on /foo.

Cephx won't do this on its own.— it enforces subtree-based access and
can restrict clients to acting as a specific (set of) uid/gids, but it
doesn't add extra stuff on top of that. (Modifying permissions is, you
know, a write.)

This is part of the standard Linux security model though, right? So
you can make somebody else the owner and give your restricted user
access via a group.
-Greg

>
> thanks,
>   Wyllys Ingersoll
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Possible to recover deleted files from CephFS?

2022-06-14 Thread Gregory Farnum

On Tue, Jun 14, 2022 at 8:50 AM Michael Sherman  wrote:
>
> Hi,
>
> We discovered that a number of files were deleted from our cephfs filesystem, 
> and haven’t been able to find current backups or snapshots.
>
> Is it possible to “undelete” a file by modifying metadata? Using 
> `cephfs-journal-tool`, I am able to find the `unlink` event for each file, 
> looking like the following:
>
> $ cephfs-journal-tool --rank cephfs:all event get 
> --path="images/060862a9-a648-4e7e-96e3-5ba3dea29eab" list
> …
> 2022-06-09 17:09:20.123155 0x170da7fc UPDATE:  (unlink_local)
>   stray5/1001fee
>   images/060862a9-a648-4e7e-96e3-5ba3dea29eab
>
> I saw the disaster-recovery-tools mentioned 
> here,
>  but didn’t know if they would be helpful in the case of a deletion.
>
> Thank you in advance for any help.

Once files are unlinked they get moved into the stray directory, and
then into the purge queue when they are truly unused.

The purge queue processes them and deletes the backing objects.

So the first thing you should do is turn off the MDS, as that is what
performs the actual deletions.

If you've already found the unlink events, you know the inode numbers
you want. You can look in rados for the backing objects and just copy
them out (and reassemble them if the file was >4MB). CephFS files are
stored in RADOS with the pattern .. If your cluster isn't too big, you can just:
rados -p  ls | grep 1001fee
for the example file you referenced above. (Or more probably, dump the
listing into a file and search that for the inode numbers).

If listing all the objects takes too long, you can construct the
object names in the other direction, which is simple enough but I
can't recall offhand the number of digits you start out with for the
 portion of the object name, so you'll have to
look at one and figure that out yourself. ;)

The disaster recovery tooling is really meant to recover a broken
filesystem; massaging it to get erroneously-deleted files back into
the tree would be rough. The only way I can think of doing that is
using the procedure to recover into a new metadata pool, and
performing just the cephfs-data-scan bits (because recovering the
metadata would obviously delete all the files again). But then your
tree (while self-consistent) would look strange again with files that
are in old locations and things, so I wouldn't recommend it.
-Greg

> -Mike Sherman
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Feedback/questions regarding cephfs-mirror

2022-06-10 Thread Gregory Farnum

On Wed, Jun 8, 2022 at 12:36 AM Andreas Teuchert
 wrote:
>
>
> Hello,
>
> we're currently evaluating cephfs-mirror.
>
> We have two data centers with one Ceph cluster in each DC. For now, the
> Ceph clusters are only used for CephFS. On each cluster we have one FS
> that contains a directory for customer data and below that there are
> directories for each customer. The customers access their directories
> via CephFS. Generally customers only have data in one data center.
>
> In order to be able to quickly restore the service, we want to mirror
> all customer data in one DC to the Ceph cluster in the other DC. Then,
> in case one Ceph cluster becomes unavailable, we would do the following
> (let's say the clusters are cluster A and cluster B, and cluster A
> became unavailable):
>
> - "Unmount" the broken mount on clients connecting to cluster A.
> - Mount the customer directories from cluster B.
> - Repair/restore cluster A.
> - Break the mirror relation from cluster A to cluster B.
> - Create a mirror relation from cluster B to cluster A (for the data
> that should be on A).
> - Ensure that cluster A is regularly updated with the current data from
> cluster B.
> - In a maintenance window: Unmount the directory on all clients that
> should connect to cluster A, sync all data (that should be on A) from B
> to A, break the mirror relation from B to A, have all clients mount the
> directories from A, create a mirror relation from A to B.
>
>
> Setting up the mirroring looks like this:
>
> # Cluster A: There are two FSs: fs-a for customer data, fs-b-backup for
> data mirrored from cluster B
> # Cluster B: There are two FSs: fs-b for customer data, fs-a-backup for
> data mirrored from cluster A
>
> (cluster-a)# ls /mnt/fs-a/customers
> customer1 customer2 customer3
>
> (cluster-b)# ls /mnt/fs-b/customers
> customer4 customer5 customer6
>
> # Enable mirroring on both clusters
> (cluster-a)# ceph orch apply cephfs-mirror
> (cluster-a)# ceph mgr module enable mirroring
> (cluster-b)# ceph orch apply cephfs-mirror
> (cluster-b)# ceph mgr module enable mirroring
>
> # Setup mirroring of fs-a from cluster A to cluster B
> (cluster-b)# ceph fs authorize fs-a-backup client.mirror_a / rwps
> (cluster-b)# ceph fs snapshot mirror peer_bootstrap create fs-a-backup
> client.mirror_a cluster-a
>
> (cluster-a)# ceph fs snapshot mirror enable fs-a
> (cluster-a)# ceph fs snapshot mirror peer_bootstrap import fs-a 
> (cluster-a)# ceph fs snapshot mirror add fs-a /customers/customer1
> (cluster-a)# ceph fs snapshot mirror add fs-a /customers/customer2
> (cluster-a)# ceph fs snapshot mirror add fs-a /customers/customer3
>
> # Setup mirroring of fs-b from cluster B to cluster A
> (cluster-a)# ceph fs authorize fs-b-backup client.mirror_b / rwps
> (cluster-a)# ceph fs snapshot mirror peer_bootstrap create fs-b-backup
> client.mirror_b cluster-b
>
> (cluster-b)# ceph fs snapshot mirror enable fs-b
> (cluster-b)# ceph fs snapshot mirror peer_bootstrap import fs-b 
> (cluster-b)# ceph fs snapshot mirror add fs-b /customers/customer4
> (cluster-b)# ceph fs snapshot mirror add fs-b /customers/customer5
> (cluster-b)# ceph fs snapshot mirror add fs-b /customers/customer6
>
> # Snapshots for the customer directories are created daily via schedule.
>
> # Result: Customer data from A mirrored to B and vice-versa
> (cluster-a)# ls /mnt/fs-b-backup/customers
> customer4 customer5 customer6
>
> (cluster-b)# ls /mnt/fs-a-backup/customers
> customer1 customer2 customer3
>
>
> In order to fail over from cluster A to cluster B, we do the following:
>
> # Create clients for access to fs-a-backup on cluster B
> # Mount customer directories from fs-a-backup on cluster B
>
> # When cluster A is available again
> (cluster-a)# ceph fs snapshot mirror remove fs-a /customers/customer1
> (cluster-a)# ceph fs snapshot mirror remove fs-a /customers/customer2
> (cluster-a)# ceph fs snapshot mirror remove fs-a /customers/customer3
> (cluster-a)# ceph fs snapshot mirror disable fs-a
>
> (cluster-a)# ceph fs authorize fs-a client.mirror_b_failover / rwps
> (cluster-a)# ceph fs snapshot mirror peer_bootstrap create fs-a
> client.mirror_b_failover cluster-b
> (cluster-a)# rmdir /mnt/fs-a/customers/*/.snap/*
>
> (cluster-b)# setfattr -x ceph.mirror.info /mnt/fs-a-backup
> (cluster-b)# ceph fs snapshot mirror enable fs-a-backup
> (cluster-b)# ceph fs snapshot mirror peer_bootstrap import fs-a-backup
> 
> (cluster-b)# ceph fs snapshot mirror add fs-a-backup /customers/customer1
> (cluster-b)# ceph fs snapshot mirror add fs-a-backup /customers/customer2
> (cluster-b)# ceph fs snapshot mirror add fs-a-backup /customers/customer3
>
>
> Now, all clients connect to cluster B and cluster A is only used as a
> mirror destination.
> At some point we will have a maintenance window and switch all clients
> that should be on cluster A back to cluster A.
>
>
> After testing this setup, there are a couple of issues that we have with
> the current cephfs-mirror implementation:
>
>

[ceph-users] Re: Ceph on RHEL 9

2022-06-10 Thread Gregory Farnum

We aren't building for Centos 9 yet, so I guess the python dependency
declarations don't work with the versions in that release.
I've put updating to 9 on the agenda for the next CLT.

(Do note that we don't test upstream packages against RHEL, so if
Centos Stream does something which doesn't match the RHEL release it
still might get busted.)
-Greg

On Thu, Jun 9, 2022 at 6:57 PM Robert W. Eckert  wrote:
>
> Does anyone have any pointers to install CEPH on Rhel 9?
>
> -Original Message-
> From: Robert W. Eckert 
> Sent: Saturday, May 28, 2022 8:28 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Ceph on RHEL 9
>
> Hi- I started to update my 3 host cluster to RHEL 9, but came across a bit of 
> a stumbling block.
>
> The upgrade process uses the RHEL leapp process, which ran through a few 
> simple things to clean up, and told me everything was hunky dory, but when I 
> kicked off the first server, the server wouldn't boot because I had a ceph 
> filesystem mounted in /etc/fstab, commenting it out, let the upgrade happen.
>
> Then I went to check on the ceph client which appears to be uninstalled.
>
> When I tried to install ceph,  I got:
>
> [root@story ~]# dnf install ceph
> Updating Subscription Management repositories.
> Last metadata expiration check: 0:07:58 ago on Sat 28 May 2022 08:06:52 PM 
> EDT.
> Error:
> Problem: package ceph-2:17.2.0-0.el8.x86_64 requires ceph-mgr = 
> 2:17.2.0-0.el8, but none of the providers can be installed
>   - conflicting requests
>   - nothing provides libpython3.6m.so.1.0()(64bit) needed by 
> ceph-mgr-2:17.2.0-0.el8.x86_64 (try to add '--skip-broken' to skip 
> uninstallable packages or '--nobest' to use not only best candidate packages)
>
> This is the content of my /etc/yum.repos.d/ceph.conf
>
> [ceph]
> name=Ceph packages for $basearch
> baseurl=https://download.ceph.com/rpm-quincy/el8/$basearch
> enabled=1
> priority=2
> gpgcheck=1
> gpgkey=https://download.ceph.com/keys/release.asc
>
> [ceph-noarch]
> name=Ceph noarch packages
> baseurl=https://download.ceph.com/rpm-quincy/el8/noarch
> enabled=1
> priority=2
> gpgcheck=1
> gpgkey=https://download.ceph.com/keys/release.asc
>
> [ceph-source]
> name=Ceph source packages
> baseurl=https://download.ceph.com/rpm-quincy/el8/SRPMS
> enabled=0
> priority=2
> gpgcheck=1
> gpgkey=https://download.ceph.com/keys/release.asc
> Is there anything I should change for el9 (I don't see el9 rpms out yet).
>
> Or should I  wait before updating the other two servers?
>
> Thanks,
> Rob
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretch cluster questions

2022-05-16 Thread Gregory Farnum

I'm not quite clear where the confusion is coming from here, but there
are some misunderstandings. Let me go over it a bit:

On Tue, May 10, 2022 at 1:29 AM Frank Schilder  wrote:
>
> > What you are missing from stretch mode is that your CRUSH rule wouldn't
> > guarantee at least one copy in surviving room (min_size=2 can be
> > achieved with 2 copies in lost room).
>
> I'm afraid this deserves a bit more explanation. How would it be possible 
> that, when both sites are up and with a 4(2) replicated rule, that a 
> committed write does not guarantee all 4 copies to be present? As far as I 
> understood the description of ceph's IO path, if all members of a PG are up, 
> a write is only acknowledged to a client after all shards/copies have been 
> committed to disk.

So in a perfectly normal PG of size 4, min_size 2, the OSDs are happy
to end peering and go active with only 2 up OSDs. That's what min_size
means. A PG won't serve IO until it's active, and it requires min_size
participants to do so — but once it's active, it acknowledges writes
once the live participants have written them down.

> In other words, with a 4(2) rule with 2 copies per DC, if one DC goes down 
> you *always* have 2 life copies and still read access in the other DC. 
> Setting min-size to 1 would allow write access too, albeit with a risk of 
> data loss (a 4(2) rule is really not secure for a 2DC HA set-up as in 
> degraded state you end up with 2(1) in 1 DC, its much better to use a wide EC 
> profile with m>k to achieve redundant single-site writes).

Nope, there is no read access to a PG which doesn't have min_size
active copies. And if you have 4 *live* copies and lose a DC, yes, you
still have two copies. But consider an alternative scenario:
1) 2 copies in each of 2 DCs.
2) Two OSDs in DC 1 restart, which happens to share PG x.
3) PG x goes active with the remaining two OSDs in DC 2.

Does (3) make sense there?

So now add in step 4:
4) DC 2 gets hit by a meteor.

Now, you have no current copies of PG x because the only current
copies got hit by a meteor.

>
> The only situation I could imagine this not being guaranteed (both DCs 
> holding 2 copies at all times in healthy condition) is that writes happen 
> while one DC is down, the down DC comes up and the other DC goes down before 
> recovery finishes. However, then stretch mode will not help either.

Right, so you just skipped over the part that it helps with: stretch
mode *guarantees* that a PG has OSDs from both DCs in its acting set
before the PG can finish peering. Redoing the scenario from before
1) 2 copies in each of 2 DCs,
2) Two OSDs in DC 1 restart, which happens to share PG x
3) PG x cannot go active because it lacks a replica in DC 1.
4) DC 2 gets hit by a meteor
5) All OSDs in DC 1 come back up
6) All PGs go active

So stretch mode adds another dimension to "the PG can finish peering
and go active" which includes the CRUSH buckets as a requirement, in
addition to a simple count of the replicas.

> My understanding of the useful part is, that stretch mode elects one monitor 
> to be special and act as a tie-breaker in case a DC goes down or a split 
> brain situation occurs between 2DCs. The change of min-size in the 
> stretch-rule looks a bit useless and even dangerous to me. A stretched 
> cluster should be designed to have a secure redundancy scheme per site and, 
> for replicated rules, that would mean size=6, min_size=2 (degraded 3(2)). 
> Much better seems to be something like an EC profile k=4, m=6 with 5 shards 
> per DC, which has only 150% overhead compared with 500% overhead of a 6(2) 
> replicated rule.

Yeah, the min_size change is because you don't want to become
unavailable when rebooting any of your surviving nodes. When you lost
a DC, you effectively go from running an ultra-secure 4-copy system to
a rather-less-secure 2-copy system. And generally people with 2-copy
systems want their data to still be available when doing maintenance.
;)
(Plus, well, hopefully you either get the other data center back
quickly, or you can expand the cluster to get to a nice safe 3-copy
system.)

But, yes. As Maximilian suggests, the use case for stretch mode is
pretty specific. If you're using RGW, you should be better-served by
its multisite feature, and if your application can stomach
asynchronous replication that will be much less expensive. RBD has
both sync and async replication options across clusters.
But sometimes you really just want exactly the same data in exactly
the same place at the same time. That's what stretch mode is for.
-Greg

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: repairing damaged cephfs_metadata pool

2022-05-16 Thread Gregory Farnum

On Tue, May 10, 2022 at 2:47 PM Horvath, Dustin Marshall
 wrote:
>
> Hi there, newcomer here.
>
> I've been trying to figure out if it's possible to repair or recover cephfs 
> after some unfortunate issues a couple of months ago; these couple of nodes 
> have been offline most of the time since the incident.
>
> I'm sure the problem is that I lack the ceph expertise to quite sus out where 
> the broken bits are. This was a 2-node cluster (I know I know) that had a 
> hypervisor primary disk fail, and the entire OS was lost. I reinstalled the 
> hypervisor, rejoined it to the cluster (proxmox), rejoined ceph to the other 
> node, re-added the OSDs. It came back with quorum problems and some PGs were 
> inconsistent and some were lost. Some of that is due to my own fiddling 
> around, which possibly exacerbated things. Eventually I had to edit the 
> monmap down to 1 monitor, which had all kinds of screwy journal issues...it's 
> been a while since I've tried resuscitating this, so the details in my memory 
> are fuzzy.
>
> My cluster health isn't awful. Output is basically this:
> ```
> root@pve02:~# ceph -s
>   cluster:
> id: 8b31840b-5706-4c92-8135-0d6e03976af1
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 filesystem is offline
> 1 mds daemon damaged
> noout flag(s) set
> 16 daemons have recently crashed
>
>   services:
> mon: 1 daemons, quorum pve02 (age 3d)
> mgr: pve01(active, since 4d)
> mds: 0/1 daemons up
> osd: 7 osds: 7 up (since 2d), 7 in (since 7w)
>  flags noout
>
>   data:
>volumes: 0/1 healthy, 1 recovering; 1 damaged
> pools:   5 pools, 576 pgs
> objects: 1.51M objects, 4.0 TiB
> usage:   8.2 TiB used, 9.1 TiB / 17 TiB avail
> pgs: 575 active+clean
>  1   active+clean+scrubbing+deep
>
>   io:
> client:   241 KiB/s wr, 0 op/s rd, 10 op/s wr
> ```
>
> I've tried a couple times running down the steps in here 
> (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/), but I 
> always hit an error at scan_links, where I get a crash dump of sorts. If I 
> try and mark the cephfs as repaired/joinable, MDS daemons will try and replay 
> and then fail.

Yeah, that generally won't work until the process is fully complete —
otherwise the MDS starts hitting the metadata inconsistencies from
having a halfway-done FS!

> The only occurrences of err/ERR in the MDS logs are a line like this:
> ```
> 2022-05-07T18:31:26.342-0500 7f22b44d8700  1 mds.0.94  waiting for osdmap 
> 301772 (which blocklists prior instance)
> 2022-05-07T18:31:26.346-0500 7f22adccb700 -1 log_channel(cluster) log [ERR] : 
> failed to read JournalPointer: -1 ((1) Operation not permitted)
> 2022-05-07T18:31:26.346-0500 7f22af4ce700  0 mds.0.journaler.pq(ro) error 
> getting journal off disk

That pretty much means the mds log/journal doesn't actually exist. I'm
actually surprised that this is the thing that causes crash since you
probably did the "cephfs-journal-tool --rank=0 journal reset" command
in that doc.

But as the page says, these are advanced tools which can wreck your
filesystem if you do them wrong, and the details matter. You'll have
to share as much as you can of what's been done to the cluster. Even
if you did some aborted recovery procedures, just running through it
again may work out. We'd need the scan_links error for certain,
though.
-Greg

> ```
>
> I haven't had much luck on the googles with diagnosing that error; seems 
> uncommon. My hope is that the cephfs_data pool is fine. I actually never had 
> any inconsistent PG issues on a pool other than the metadata pool, so that's 
> the only one that suffered actual acute injury during the hardware 
> failure/quorum loss.
> If I had more experience with the rados tools, I'd probably be more helpful. 
> I have plenty of logs lying about and can perform any diagnoses that might 
> help, but I hate to spam too much here right out of the gate.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Incomplete file write/read from Ceph FS

2022-05-06 Thread Gregory Farnum

Do you have any locking which guarantees that nodes don't copy files
which are still in the process of being written?
CephFS will guarantee any readers see the results of writes which are
already reported complete while reading, but I don't see any
guarantees about atomicity in
https://docs.microsoft.com/en-us/dotnet/api/system.io.file.writeallbytes?view=net-6.0,
and those byte counts sound like they may be C# pages which are
getting sent out incrementally.
-Greg

On Thu, May 5, 2022 at 1:30 PM Kiran Ramesh  wrote:
>
> We have a kubernetes cluster with 18 nodes (1 master + 17 workers) with ceph 
> cluster setup with rook operator.
>
> There are 3 monitors. Each of the worker nodes expose a disk as OSD, so 17 
> OSDs. There is a one active MDS and data is set to have 3 replicas.
>
> A PVC from the ceph file storage is mounted as a volume in 12 of the worker 
> nodes.
>
> This storage is used as a shared cache. Each of the worker checks for a file 
> in its local disk which if misses, looks for the file in the shared cache. If 
> present, pulls it into the local cache. If the file is not present in the 
> shared cache, file is downloaded from the source into the local disk and then 
> copied over to the shared cache.
>
> The file read/write operations is done through the C# System.IO : 
> File.ReadAllBytes and File.WriteAllBytes methods.
>
> When the worker has a shared cache hit and is copied over from the shared 
> cache to local disk, occasionally 0, 512, 1024 or 2048 bytes alone are copied 
> by C# file operations.
>
> Thanks,
> Kiran
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [progress WARNING root] complete: ev ... does not exist, oh my!

2022-05-06 Thread Gregory Farnum

On Fri, May 6, 2022 at 5:58 AM Harry G. Coin  wrote:
>
> I tried searching for the meaning of a ceph Quincy all caps WARNING
> message, and failed.  So I need help.   Ceph tells me my cluster is
> 'healthy', yet emits a bunch of 'progress WARNING root] comlete ev' ...
> messages.  Which I score right up there with the helpful dmesg "yama,
> becoming mindful",
>
> Should I care, and if I should, what is to be done?Here's the log snip:

Well, I've never seen it before (and I don't work on that code), but
this error came from the progress module, which is the thing that
gives you pretty little charts about how long until stuff like
rebalancing finishes.

It seems to happen when it gets a report of something being completed,
that the module itself doesn't remember. So you should probably grab
(or generate and grab) some mgr logs and create a ticket for it so the
team can prevent that happening, but I also don't think it's something
to worry about.
-Greg


>
> May  6 07:48:51 noc3 bash[3206]: cluster 2022-05-06T12:48:49.294641+
> mgr.noc3.sybsfb (mgr.14574839) 20656 : cluster [DBG] pgmap v19338: 1809
> pgs: 2 active+clean+scrubbing+deep, 1807 active+clean; 16 TiB data, 41
> TiB used, 29 TiB / 70 TiB avail; 469 KiB/s rd, 4.7 KiB/s wr, 2 op/s
> May  6 07:48:51 noc3 bash[3206]: audit 2022-05-06T12:48:49.313491+
> mon.noc1 (mon.3) 336 : audit [DBG] from='mgr.14574839
> [fc00:1002:c7::43]:0/501702592' entity='mgr.noc3.sybsfb' cmd=[{"prefix":
> "config dump", "format": "json"}]: dispatch
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> dc5810d7-7a30-4c8f-bafa-3158423c49f3 does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> c81b591e-6498-41bd-98bb-edbf80c690f8 does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> a9632817-10e7-4a60-ae5c-a4220d7ca00b does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> 29a7ca4d-6e2a-423a-9530-3f61c0dcdbfe does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> 68de11a0-92a4-48b6-8420-752bcdd79182 does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> a9437122-8ff8-4de9-a048-8a3c0262b02c does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> f15c0540-9089-4a96-884e-d75668f84796 does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> eeaf605a-9c55-44c9-9c69-8c7c35ca7591 does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> ba0ff860-4fc5-4c84-b337-1c8c616b5fbd does not exist
> May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+
> 7f2e20629700  0 [progress WARNING root] complete: ev
> 656fcf28-3ce1-4d6d-8ec2-eac5b6f0a233 does not exist
> May  6 07:48:52 noc3 bash[3203]: :::10.12.112.66 - -
> [06/May/2022:12:48:52] "GET /metrics HTTP/1.1" 200 421310 ""
> "Prometheus/2.33.4"
> May  6 07:48:53 noc3 bash[3206]: audit 2022-05-06T12:48:51.273954+
> mon.noc1 (mon.3) 337 : audit [INF] from='mgr.14574839
> [fc00:1002:c7::43]:0/501702592' entity='mgr.noc3.sybsfb' cmd=[{"prefix":
> "config rm", "format": "json", "who": "client", "name":
> "mon_cluster_log_file_level"}]: dispatch
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretch cluster questions

2022-05-06 Thread Gregory Farnum

On Fri, May 6, 2022 at 3:21 AM Eneko Lacunza  wrote:

> Hi,
>
> Just made some basic tests, feature works nicely as far as I have tested :)
>
> I created 2 aditional pools each with a matching stretch rule:
> - size=2/min=1 (not advised I know)
> - size=6/min=3 (some kind of paranoid config)
>
> When entering degraded stretch mode, the following changed where made
> automatically:
> - size=4/min=2 -> size=4/min=1
> - size=2/min=1 -> size=2/min=0 (!)
> - size=6/min=3 -> size=6/min=1 (!)
>
> Not really sure about what calc is performed here, but:
> - It would be better to check not decrement min value below 1?
> - Changing min=3 to min=2 would be better (safer)?
>

Hah, probably. The calculation is just:
newp.min_size = pgi.second.min_size / 2; // only support 2 zones now

And then...

>
> Also, when stretch bucket was back online and after recovery was complete:
> - size=4/min=2 -> size=4/min=1 -> size=4/min=2
> - size=2/min=1 -> size=2/min=0 -> size=2/min=2
> - size=6/min=3 -> size=6/min=1 -> size=6/min=2
>
> This time it seems recovery is setting min=2 as a fixed value.
>

newp.min_size = g_conf().get_val("mon_stretch_pool_min_size");

>
> Would it make sense that when entering degraded mode, min was set to
> min(round(size/2)-1, 1); and after recovery it was set to round(size/2)?
>

Yeah, that seems more sensible. I made a ticket:
https://tracker.ceph.com/issues/55573

Thanks!
-Greg


>
> Otherwise awesome feature really! :-)
>
> Cheers
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stretch cluster questions

2022-05-04 Thread Gregory Farnum

On Wed, May 4, 2022 at 1:25 AM Eneko Lacunza  wrote:

> Hi Gregory,
>
> El 3/5/22 a las 22:30, Gregory Farnum escribió:
>
> On Mon, Apr 25, 2022 at 12:57 AM Eneko Lacunza  
>  wrote:
>
> We're looking to deploy a stretch cluster for a 2-CPD deployment. I have
> read the following 
> docs:https://docs.ceph.com/en/latest/rados/operations/stretch-mode/#stretch-clusters
>
> I have some questions:
>
> - Can we have multiple pools in a stretch cluster?
>
> Yes.
>
>
> - Can we have multiple different crush rules in a stretch cluster? I'm
> asking this because the command for stretch mode activation asks for a
> rule...
>
> Right, so what happens there is any pool with a default rule gets
> switched to the specified CRUSH rule. That doesn't stop you from
> changing the rule after engaging stretch mode, or giving it a
> non-default rule ahead of time. You just have to be careful to make
> sure it satisfies the stretch mode rules about placing across data
> centers.
>
> So the only purpose of the "stretch_rule" param to "enable_stretch_mode"
> is replacing default replicated rule?
>
> If so, when stretch mode is activated, the described special behaviour
> applies to all crush rules/pools:
>
> - the OSDs will only take PGs active when they peer across data centers
> (or whatever other CRUSH bucket type you specified), assuming both are alive
>
> - Pools will increase in size from the default 3 to 4, expecting 2 copies
> in each site (for size=2 pools will it increase to 4 too?)
>
> Is this accurate?
>

Yep! AFAIK most clusters just use the default pool for replication, barring
mixed media types. *shrug*


>
> We want to have different purpose pools on this Ceph cluster:
>
> - Important VM disks, with 2 copies in each DC (SSD class)
> - Ephemeral VM disks, with just 2 copies overall (SSD class)
> - Backup data in just one DC (HDD class).
>
> Objective of the 2-DC deployment is disaster recovery, HA isn't
> required, but I'll take it if deployment is reasonable :-) .
>
> I'm leery of this for the reasons described in the docs — if you don't
> have 2 replicas per site, you lose data availability every time an OSD
> goes down for any reason (or else you have a window while recovery
> happens where the data is not physically available in both sites,
> which rather negates the purpose).
>
>
> Is this because the following: "the OSDs will only take PGs active when
> they peer across data centers (or whatever other CRUSH bucket type you
> specified), assuming both are alive"?
>

>
> This conclusion wasn't obvious to me, but after your reply and a third
> read, now seems expected :-)
>

Right!


>
> Thanks for your comments. I'll be doing some tests, will write back if I
> find something unexpected (to me?) :-)
>
> Cheers
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 | https://www.binovo.esAstigarragako Bidea, 2 - 2º 
> <https://www.google.com/maps/search/Astigarragako+Bidea,+2+-+2%C2%BA?entry=gmail&source=g>
>  izda. Oficina 10-11, 20180 Oiartzun
> https://www.youtube.com/user/CANALBINOVOhttps://www.linkedin.com/company/37269706/
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [CephFS, MDS] internal MDS internal heartbeat is not healthy!

2022-05-03 Thread Gregory Farnum

Okay, so you started out with 2 active MDSes and then they failed on a restart?
And in an effort to fix it you changed max_mds to 3? (That was a bad
idea, but *probably* didn't actually hurt anything this time — adding
new work to scale out a system which already can't turn on just
overloads it more!)

The logs here are not very apparent about what's going on. You should
set "debug ms = 1" and "debug mds = 20" on your MDSes, restart them
all, and then use ceph-post-file to upload them for analysis. The logs
here are very sparse and if the MDS internal heartbeat is unhealthy
there's something wrong in the depths that unfortunately isn't being
output in what's visible.
-Greg

On Tue, May 3, 2022 at 1:25 PM Wagner-Kerschbaumer
 wrote:
>
> i All!
> My CephFS data pool on a 15.2.12 stopped working overnight.
> I have to much data on there what I planned to migrate today. (Not
> possible now after I cant get cephfs back up)
>
> Something is very off, and I cant pinpoint what. the mds keeps failing
>
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.343+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 24.0037s ago
> ); MDS internal heartbeat is not healthy!
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:40 fh_ceph_a conmon[4835]: 2022-05-03T11:58:40.843+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 24.5037s ago); MDS internal heartbeat is not
> healthy!
> May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
> 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank' had timed out after
> 15
> May 03 11:58:41 fh_ceph_a conmon[4835]: 2022-05-03T11:58:41.343+0200
> 7fffe4bb0700  0 mds.beacon.fh_ceph_a Skipping beacon heartbeat to
> monitors (last acked 25.0037s ago); MDS internal heartbeat is not
> healthy!
> [root@fh_ceph_b /]#free -h
>   totalusedfree  shared  buff/cache
> available
> Mem:  251Gi   168Gi75Gi   4.0Gi   7.1Gi
> 70Gi
> Swap: 4.0Gi  0B   4.0Gi
> [root@fh_ceph_b /]# ceph -s
>   cluster:
> id: deadbeef-7d25-40ec-abc4-202104a6d54a
> health: HEALTH_WARN
> 1 filesystem is degraded
> 1 nearfull osd(s)
> 13 pool(s) nearfull
>
>   services:
> mon: 3 daemons, quorum fh_ceph_a,fh_ceph_b,fh_ceph_c (age 5M)
> mgr: fh_ceph_b(active, since 5M), standbys: fh_ceph_a, fh_ceph_c,
> fh_ceph_d
> mds: cephfs:2/2 {0=fh_ceph_c=up:resolve,1=fh_ceph_a=up:replay} 1
> up:standby
> osd: 40 osds: 40 up (since 5M), 40 in (since 5M)
> rgw: 4 daemons active (fh_ceph_a.rgw0, fh_ceph_b.rgw0,
> fh_ceph_c.rgw0, fh_ceph_d.rgw0)
>
>   task status:
>
>   data:
> pools:   13 pools, 1929 pgs
> objects: 48.08M objects, 122 TiB
> usage:   423 TiB used, 215 TiB / 638 TiB avail
> pgs: 1922 active+clean
>  7active+clean+scrubbing+deep
>
>   io:
> client:   6.2 MiB/s rd, 2 op/s rd, 0 op/s wr
>
> after setting ceph fs set cephfs max_mds 3 and some time the state on
> one changed at least to resolve
>
> (example)
> [root@fh_ceph_a ~] :date ; podman exec ceph-mon-fh_ceph_a ceph fs
> status cephfs
> Tue  3 May 12:14:12 CEST 2022
> cephfs - 40 clients
> ==
> RANK   STATE  MDS ACTIVITY   DNSINOS
>  0resolve  fh_ceph_c27.0k  27.0k
>  1 replay  fh_ceph_d   0  0
>   POOL TYPE USED  AVAIL
> cephfs_metadata  metadata  48.7G  17.5T
>   cephfs_data  data 367T  17.5T
> STANDBY MDS
>  fh_ceph_b
>  fh_ceph_a
> MDS version: ceph version 15.2.12
> (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)
>
> logs of failing mds (journalctl -f -u ceph-mds@$(hostname).service --
> since "5 minutes ago")
> May 03 11:59:37 fh_ceph_b conmon[12777]:-20> 2022-05-
> 03T11:59:36.068+0200 7fffe63b3700 10 monclient: _check_auth_rotating
> have uptodate secrets (they expire after 2022-05-
> 03T11:59:06.069985+0200)
> May 03 11:59:37 fh_ceph_b conmon[12777]:-19> 2022-05-
> 03T11:59:36.085+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:-18> 2022-05-
> 03T11:59:36.085+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
> beacon heartbeat to monitors (last acked 51.0078s ago); MDS internal
> heartbeat is not healthy!
> May 03 11:59:37 fh_ceph_b conmon[12777]:-17> 2022-05-
> 03T11:59:36.585+0200 7fffe4bb0700  1 heartbeat_map is_healthy 'MDSRank'
> had timed out after 15
> May 03 11:59:37 fh_ceph_b conmon[12777]:-16> 2022-05-
> 03T11:59:36.585+0200 7fffe4bb0700  0 mds.beacon.fh_ceph_b Skipping
> beacon heartbeat to monitors (last acked 51.5078s ago); M

[ceph-users] Re: Stretch cluster questions

2022-05-03 Thread Gregory Farnum

On Mon, Apr 25, 2022 at 12:57 AM Eneko Lacunza  wrote:
>
> Hi all,
>
> We're looking to deploy a stretch cluster for a 2-CPD deployment. I have
> read the following docs:
> https://docs.ceph.com/en/latest/rados/operations/stretch-mode/#stretch-clusters
>
> I have some questions:
>
> - Can we have multiple pools in a stretch cluster?

Yes.

> - Can we have multiple different crush rules in a stretch cluster? I'm
> asking this because the command for stretch mode activation asks for a
> rule...

Right, so what happens there is any pool with a default rule gets
switched to the specified CRUSH rule. That doesn't stop you from
changing the rule after engaging stretch mode, or giving it a
non-default rule ahead of time. You just have to be careful to make
sure it satisfies the stretch mode rules about placing across data
centers.


> We want to have different purpose pools on this Ceph cluster:
>
> - Important VM disks, with 2 copies in each DC (SSD class)
> - Ephemeral VM disks, with just 2 copies overall (SSD class)
> - Backup data in just one DC (HDD class).
>
> Objective of the 2-DC deployment is disaster recovery, HA isn't
> required, but I'll take it if deployment is reasonable :-) .

I'm leery of this for the reasons described in the docs — if you don't
have 2 replicas per site, you lose data availability every time an OSD
goes down for any reason (or else you have a window while recovery
happens where the data is not physically available in both sites,
which rather negates the purpose).
-Greg

>
> Alternative would be a size=4/min=3 pool for important VM disks in a
> no-strech cluster...
>
> Thanks
>
> Eneko Lacunza
> Zuzendari teknikoa | Director técnico
> Binovo IT Human Project
>
> Tel. +34 943 569 206 |https://www.binovo.es
> Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun
>
> https://www.youtube.com/user/CANALBINOVO
> https://www.linkedin.com/company/37269706/
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Stop Rebalancing

2022-04-13 Thread Gregory Farnum

On Wed, Apr 13, 2022 at 10:01 AM Dan van der Ster  wrote:
>
> I would set the pg_num, not pgp_num. In older versions of ceph you could
> manipulate these things separately, but in pacific I'm not confident about
> what setting pgp_num directly will do in this exact scenario.
>
> To understand, the difference between these two depends on if you're
> splitting or merging.
> First, definitions: pg_num is the number of PGs and pgp_num is the number
> used for placing objects.
>
> So if pgp_num < pg_num, then at steady state only pgp_num pgs actually
> store data, and the other pg_num-pgp_num PGs are sitting empty.

Wait, what? That's not right! pgp_num is pg *placement* number; it
controls how we map PGs to OSDs. But the full pg still exists as its
own thing on the OSD and has its own data structures and objects. If
currently the cluster has reduced pgp_num it has changed the locations
of PGs, but it hasn't merged any PGs together. Changing the pg_num and
causing merges will invoke a whole new workload which can be pretty
substantial.
-Greg

>
> To merge PGs, Ceph decreases pgp_num to squeeze the objects into fewer pgs,
> then decreases pg_num as the PGs are emptied to actually delete the now
> empty PGs.
>
> Splitting is similar but in reverse: first, Ceph creates new empty PGs by
> increasing pg_num. Then it gradually increases pgp_num to start sending
> data to the new PGs.
>
> That's the general idea, anyway.
>
> Long story short, set pg_num to something close to the current
> pgp_num_target.
>
> .. Dan
>
>
> On Wed., Apr. 13, 2022, 18:43 Ray Cunningham, 
> wrote:
>
> > Thank you so much, Dan!
> >
> > Can you confirm for me that for pool7, which has 2048/2048 for pg_num and
> > 883/2048 for pgp_num, we should change pg_num or pgp_num? And can they be
> > different for a single pool, or does pg_num and pgp_num have to always be
> > the same?
> >
> > IF we just set pgp_num to 890 we will have pg_num at 2048 and pgp_num at
> > 890, is that ok? Because if we reduce the pg_num by 1200 it will just start
> > a whole new load of misplaced object rebalancing. Won't it?
> >
> > Thank you,
> > Ray
> >
> >
> > -Original Message-
> > From: Dan van der Ster 
> > Sent: Wednesday, April 13, 2022 11:11 AM
> > To: Ray Cunningham 
> > Cc: ceph-users@ceph.io
> > Subject: Re: [ceph-users] Stop Rebalancing
> >
> > Hi, Thanks.
> >
> > norebalance/nobackfill are useful to pause ongoing backfilling, but aren't
> > the best option now to get the PGs to go active+clean and let the mon db
> > come back under control. Unset those before continuing.
> >
> > I think you need to set the pg_num for pool1 to something close to but
> > less than 926. (Or whatever the pg_num_target is when you run the command
> > below).
> > The idea is to let a few more merges complete successfully but then once
> > all PGs are active+clean to take a decision about the other interventions
> > you want to carry out.
> > So this ought to be good:
> > ceph osd pool set pool1 pg_num 920
> >
> > Then for pool7 this looks like splitting is ongoing. You should be able to
> > pause that by setting the pg_num to something just above 883.
> > I would do:
> > ceph osd pool set pool7 pg_num 890
> >
> > It may even be fastest to just set those pg_num values to exactly what the
> > current pgp_num_target is. You can try it.
> >
> > Once your cluster is stable again, then you should set those to the
> > nearest power of two.
> > Personally I would wait for #53729 to be fixed before embarking on future
> > pg_num changes.
> > (You'll have to mute a warning in the meantime -- check the docs after the
> > warning appears).
> >
> > Cheers, dan
> >
> > On Wed, Apr 13, 2022 at 5:16 PM Ray Cunningham <
> > ray.cunning...@keepertech.com> wrote:
> > >
> > > Perfect timing, I was just about to reply. We have disabled autoscaler
> > on all pools now.
> > >
> > > Unfortunately, I can't just copy and paste from this system...
> > >
> > > `ceph osd pool ls detail` only 2 pools have any difference.
> > > pool1:  pgnum 940, pgnum target 256, pgpnum 926 pgpnum target 256
> > > pool7:  pgnum 2048, pgnum target 2048, pgpnum883, pgpnum target 2048
> > >
> > > ` ceph osd pool autoscale-status`
> > > Size is defined
> > > target size is empty
> > > Rate is 7 for all pools except pool7, which is 1.333730697632 Raw
> > > capacity is defined Ratio for pool1 is .0177, pool7 is .4200 and all
> > > others is 0 Target and Effective Ratio is empty Bias is 1.0 for all
> > > PG_NUM: pool1 is 256, pool7 is 2048 and all others are 32.
> > > New PG_NUM is empty
> > > Autoscale is now off for all
> > > Profile is scale-up
> > >
> > >
> > > We have set norebalance and nobackfill and are watching to see what
> > happens.
> > >
> > > Thank you,
> > > Ray
> > >
> > > -Original Message-
> > > From: Dan van der Ster 
> > > Sent: Wednesday, April 13, 2022 10:00 AM
> > > To: Ray Cunningham 
> > > Cc: ceph-users@ceph.io
> > > Subject: Re: [ceph-users] Stop Rebalancing
> > >
> > > One mor

[ceph-users] Re: Cephfs default data pool (inode backtrace) no longer a thing?

2022-03-21 Thread Gregory Farnum

The backtraces are written out asynchronously by the MDS to those
objects, so there can be a delay between file creation and when they
appear. In fact I think backtraces only get written when the inode in
question is falling out of the MDS journal, so if you have a
relatively small number of flies which are consistently getting
updated, they can go arbitrarily long without the backtrace object
getting written out to RADOS objects.

Side note: I think you should be able to specify your EC pool as the
default data pool for a filesystem, which would prevent you from
needing a separate replicated pool storing those backtraces. Unless we
still have a programmed limit from back when EC pools couldn't handle
omap?
-Greg

On Thu, Mar 17, 2022 at 1:15 PM Vladimir Brik
 wrote:
>
> Never mind. I don't know what changed, but I am seeing
> 0-size objects in the default pool now.
>
> Vlad
>
> On 3/16/22 11:02, Vladimir Brik wrote:
> >  > Are you sure there are no objects?
> > Yes. In the 16.2.7 cluster "ceph df" reports no objects in
> > the default data pool. I am wondering if I need to something
> > special ensure that recovery data is stored in a fast pool
> > and not together with data in the EC pool.
> >
> > In my other cluster that was deployed when Ceph was on
> > version 14 (but upgraded to 15 since), there are a lot of
> > 0-size objects in the default data pool.
> >
> > Vlad
> >
> >
> > On 3/16/22 04:16, Frank Schilder wrote:
> >> Are you sure there are no objects? Here is what it looks
> >> on our FS:
> >>
> >>  NAME ID USED%USED
> >> MAX AVAIL OBJECTS
> >>  con-fs2-meta112 474 MiB
> >> 0.04   1.0 TiB  35687606
> >>  con-fs2-meta213 0 B
> >> 0   1.0 TiB 300163323
> >>
> >> Meta1 is the meta-data pool and meta2 the default data
> >> pool. It shows 0 bytes, but contains 10x the objects that
> >> sit in the meta data pool. These objects contain only meta
> >> data. That's why no actual usage is reported (at least on
> >> mimic).
> >>
> >> The data in this default data pool is a serious challenge
> >> for recovery. I put it on fast SSDs, but the large number
> >> of objects requires aggressive recovery options. With the
> >> default settings recovery of this pool takes longer than
> >> the rebuild of data in the EC data pools on HDD. I also
> >> allocated lots of PGs to it to reduce the object count per
> >> PG. Having this data on fast drives with tuned settings
> >> helps a lot with overall recovery and snaptrim.
> >>
> >> Best regards,
> >> =
> >> Frank Schilder
> >> AIT Risø Campus
> >> Bygning 109, rum S14
> >>
> >> 
> >> From: Vladimir Brik 
> >> Sent: 15 March 2022 20:53:25
> >> To: ceph-users
> >> Subject: [ceph-users] Cephfs default data pool (inode
> >> backtrace) no longer a thing?
> >>
> >> Hello
> >>
> >> https://docs.ceph.com/en/latest/cephfs/createfs/ mentions a
> >> "default data pool" that is used for "inode backtrace
> >> information, which is used for hard link management and
> >> disaster recovery", and "all CephFS inodes have at least one
> >> object in the default data pool".
> >>
> >> I noticed that when I create a volume using "ceph fs volume
> >> create" and then add the EC data pool where my files
> >> actually are, the default pool remains empty (no objects).
> >>
> >> Does this mean that the recommendation from the link above
> >> "If erasure-coded pools are planned for file system data, it
> >> is best to configure the default as a replicated pool" is no
> >> longer applicable, or do I need to configure something to
> >> avoid a performance hit when using EC data pools?
> >>
> >>
> >> Thanks
> >>
> >> Vlad
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Not able to start MDS after upgrade to 16.2.7

2022-02-11 Thread Gregory Farnum

On Fri, Feb 11, 2022 at 10:53 AM Izzy Kulbe  wrote:

> Hi,
>
> If the MDS host has enough spare memory, setting
> > `mds_cache_memory_limit`[*] to 9GB (or more if it permits) would get
> > rid of this warning. Could you check if that improves the situation?
> > Normally, the MDS starts trimming its cache when it overshoots the
> > cache limit.
> >
>
> That won't work. I tried setting it to 48GB and it still uses as much. The
> 9GB just were at the time I ran the command, the MDS will run until memory
> exhaustion(I'd rather not wait until that happens given then it's down to
> the OOM killer to hopefully stop the right service).
>
> BTW, you earlier mentioned MDS using 150+GB of memory. That does not
> > seem to be happening now (after Dan's config suggestion?).
> >
>
> As said above, will still happen, I will happily take up all RAM + Swap. I
> tried increasing swap but the MDS daemon will just balloon further until
> the OOM stops it.
>
> You don't seem to have a standby MDS configured? Is that intentional?
> >
>
> I stopped the service on the second host where the standby would run since
> it would just go into an endless loop of the first and second server
> switching roles until the active node runs into memory exhaustion and gets
> OOM killed. I figured that having a more controlled setup with only one MDS
> and with the standby deactivated would yield me a better setup for
> debugging in my situation.
>
> At this point the cluster does have an active MDS (probably with an
> > oversized cache). Could you try the suggestions I mentioned earlier?
> >
>
> If you're referring to the larger MDS Cache size, I've tried that. Also
> tried mds_oft_prefetch_dirfrags as per Zheng Yan's suggestion.
>
> Also a little update on what else I've tried, since I'm running out of time
> on a full reset(we kinda need that cluster so I don't have unlimited time
> to play around with the broken version of it):
>
> I stopped the MDS, then tried the commands below and finally started the
> MDS again. The result was the same - the MDS logs are full of "mds.0.24343
> mds has 1 queued contexts" and "mds.0.cache adjust_subtree_auth" getting
> logged a lot.
>
> cephfs-journal-tool --rank=backupfs:0 journal reset
> ceph fs reset backupfs --yes-i-really-mean-it
> cephfs-table-tool all reset session
> cephfs-table-tool all reset snap
> cephfs-table-tool all reset inode
> ceph config set mds mds_wipe_sessions 1 (it would run into an assert error
> without)


At this point you really can’t do anything with this filesystem except copy
the data out and delete it, then. You’ve destroyed lots of metadata it
needs to avoid doing things like copying over existing files when you start
writing new ones.
-Greg


>
> I'd then start the MDS and it would be listed as active in the ceph fs
> status:
>
> ceph fs status
> backupfs - 2 clients
> 
> RANK  STATEMDS   ACTIVITY DNS
>  INOS   DIRS   CAPS
>  0active  backupfs.SERVER.fsrhfw  Reqs:0 /s  2926k  2926k  1073
>  0
> POOLTYPE USED  AVAIL
> cephfs.backupfs.meta  metadata   198G   623G
> cephfs.backupfs.datadata97.9T  30.4T
>
>
> On Fri, 11 Feb 2022 at 17:05, Izzy Kulbe  wrote:
>
> > Hi,
> >
> > at the moment no clients should be connected to the MDS(since the MDS
> > doesn't come up) and the cluster only serves these MDS. The MDS also
> didn't
> > start properly with mds_wipe_sessions = true.
> >
> > ceph health detail with the MDS trying to run:
> >
> > HEALTH_WARN 1 failed cephadm daemon(s); 3 large omap objects; 1 MDSs
> > report oversized cache; insufficient standby MDS daemons available
> > [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
> > daemon mds.backupfs.SERVER.fsrhfw on SERVER is in error state
> > [WRN] LARGE_OMAP_OBJECTS: 3 large omap objects
> > 3 large objects found in pool 'cephfs.backupfs.meta'
> > Search the cluster log for 'Large omap object found' for more
> details.
> > [WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
> > mds.backupfs.SERVER.fsrhfw(mds.0): MDS cache is too large (9GB/4GB);
> 0
> > inodes in use by clients, 0 stray files
> > [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons
> available
> > have 0; want 1 more
> >
> >
> > ceph versions
> > {
> > "mon": {
> > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)": 5
> > },
> > "mgr": {
> > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)": 2
> > },
> > "osd": {
> > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)": 19
> > },
> > "mds": {
> > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)": 1
> > },
> > "overall": {
> > "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)": 27
> > }
> > }
> >
> >
> > This is what the logs o

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Gregory Farnum

“Up” is the set of OSDs which are alive from the calculated crush mapping.
“Acting” includes those extras which have been added in to bring the PG up
to proper size. So the PG does have 3 live OSDs serving it.

But perhaps the safety check *is* looking at up instead of acting? That
seems like a plausible bug. (Also, if crush is failing to map properly,
that’s not a great sign for your cluster health or design.)

On Thu, Feb 10, 2022 at 11:26 AM 胡 玮文  wrote:

> I believe this is the reason.
>
> I mean number of OSDs in the “up” set should be at least 1 greater than
> the min_size for the upgrade to proceed. Or once any OSD is stopped, it can
> drop below min_size, and prevent the pg from becoming active. So just
> cleanup the misplaced and the upgrade should proceed automatically.
>
> But I’m a little confused. I think if you have only 2 up OSD in a
> replicate x3 pool, it should in degraded state, and should give you a
> HEALTH_WARN.
>
> 在 2022年2月11日，03:06，Zach Heise (SSCC)  写道：
>
> 
>
> Hi Weiwen, thanks for replying.
>
> All of my replicated pools, including the newest ssdpool I made most
> recently, have a min_size of 2. My other two EC pools have a min_size of 3.
>
> Looking at pg dump output again, it does look like the two EC pools have
> exactly 4 OSDs listed in the "Acting" column, and everything else has 3
> OSDs in Acting. So that's as it should be, I believe?
>
> I do have some 'misplaced' objects on 8 different PGs (the
> active+clean+remapped ones listed in my original ceph -s output), that only
> have 2 "up" OSDs listed, but in the "Acting" columns each have 3 OSDs as
> they should. Apparently these 231 misplaced objects aren't enough to cause
> ceph to drop out of HEALTH_OK status.
>
> Zach
>
>
> On 2022-02-10 12:41 PM, huw...@outlook.com
> wrote:
>
> Hi Zach,
>
> How about your min_size setting? Have you checked the number of OSDs in
> the acting set of every PG is at least 1 greater than the min_size of the
> corresponding pool?
>
> Weiwen Hu
>
>
>
> 在 2022年2月10日，05:02，Zach Heise (SSCC)  he...@ssc.wisc.edu> 写道：
>
> Hello,
>
> ceph health detail says my 5-node cluster is healthy, yet when I ran ceph
> orch upgrade start --ceph-version 16.2.7 everything seemed to go fine until
> we got to the OSD section, now for the past hour, every 15 seconds a new
> log entry of  'Upgrade: unsafe to stop osd(s) at this time (1 PGs are or
> would become offline)' appears in the logs.
>
> ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything too.
> Yet somehow 1 PG is (apparently) holding up all the OSD upgrades and not
> letting the process finish. Should I stop the upgrade and try it again? (I
> haven't done that before so was just nervous to try it). Any other ideas?
>
>  cluster:
>id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
>health: HEALTH_OK
>   services:
>mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
>mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
>mds: 1/1 daemons up, 1 hot standby
>osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>   data:
>volumes: 1/1 healthy
>pools:   7 pools, 193 pgs
>objects: 3.72k objects, 14 GiB
>usage:   43 GiB used, 64 TiB / 64 TiB avail
>pgs: 231/11170 objects misplaced (2.068%)
> 185 active+clean
> 8   active+clean+remapped
>   io:
>client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>   progress:
>Upgrade to 16.2.7 (5m)
>  [=...] (remaining: 24m)
>
> --
> Zach
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io ceph-users-le...@ceph.io>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster healthy, but 16.2.7 osd daemon upgrade says its unsafe to stop them?

2022-02-10 Thread Gregory Farnum

I don’t know how to get better errors out of cephadm, but the only way I
can think of for this to happen is if your crush rule is somehow placing
multiple replicas of a pg on a single host that cephadm wants to upgrade.
So check your rules, your pool sizes, and osd tree?
-Greg

On Thu, Feb 10, 2022 at 8:25 AM Zach Heise (SSCC) 
wrote:

> It could be an issue with the devicehealthpool as you are correct, it is a
> single PG - but when the cluster is reporting that everything is healthy,
> it's difficult where to go from there. What I don't understand is why its
> refusing to upgrade ANY of the osd daemons; I have 33 of them, why would a
> single PG going offline be a problem for all of them?
>
> I did try stopping the upgrade and restarting it, but it just picks up at
> the same place (11/56 daemons upgraded) and immediately reports the same
> issue.
>
> Is there any way to at least tell which PG is the problematic one?
>
>
> Zach
>
>
> On 2022-02-09 4:19 PM, anthony.da...@gmail.com wrote:
>
> Speculation:  might the devicehealth pool be involved?  It seems to typically 
> have just 1 PG.
>
>
>
>
> On Feb 9, 2022, at 1:41 PM, Zach Heise (SSCC)  
>  wrote:
>
> Good afternoon, thank you for your reply. Yes I know you are right, 
> eventually we'll switch to an odd number of mons rather than even. We're 
> still in 'testing' mode right now and only my coworkers and I are using the 
> cluster.
>
> Of the 7 pools, all but 2 are replica x3. The last two are EC 2+2.
>
> Zach Heise
>
>
> On 2022-02-09 3:38 PM, sascha.art...@gmail.com wrote:
>
> Hello,
>
> all your pools running replica > 1?
> also having 4 monitors is pretty bad for split brain situations..
>
> Zach Heise (SSCC)   schrieb am Mi., 
> 9. Feb. 2022, 22:02:
>
>Hello,
>
>ceph health detail says my 5-node cluster is healthy, yet when I ran
>ceph orch upgrade start --ceph-version 16.2.7 everything seemed to go
>fine until we got to the OSD section, now for the past hour, every 15
>seconds a new log entry of  'Upgrade: unsafe to stop osd(s) at
>this time
>(1 PGs are or would become offline)' appears in the logs.
>
>ceph pg dump_stuck (unclean, degraded, etc) shows "ok" for everything
>too. Yet somehow 1 PG is (apparently) holding up all the OSD upgrades
>and not letting the process finish. Should I stop the upgrade and
>try it
>again? (I haven't done that before so was just nervous to try it).
>Any
>other ideas?
>
>   cluster:
> id: 9aa000e8-b999-11eb-82f2-ecf4bbcc0ac0
> health: HEALTH_OK
>
>   services:
> mon: 4 daemons, quorum ceph05,ceph04,ceph01,ceph03 (age 92m)
> mgr: ceph03.futetp(active, since 97m), standbys: ceph01.fblojp
> mds: 1/1 daemons up, 1 hot standby
> osd: 33 osds: 33 up (since 2h), 33 in (since 4h); 9 remapped pgs
>
>   data:
> volumes: 1/1 healthy
> pools:   7 pools, 193 pgs
> objects: 3.72k objects, 14 GiB
> usage:   43 GiB used, 64 TiB / 64 TiB avail
> pgs: 231/11170 objects misplaced (2.068%)
>  185 active+clean
>  8   active+clean+remapped
>
>   io:
> client:   1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
>
>   progress:
> Upgrade to 16.2.7 (5m)
>   [=...] (remaining: 24m)
>
>-- Zach
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephfs: [ERR] loaded dup inode

2022-02-08 Thread Gregory Farnum

On Tue, Feb 8, 2022 at 7:30 AM Dan van der Ster  wrote:
>
> On Tue, Feb 8, 2022 at 1:04 PM Frank Schilder  wrote:
> > The reason for this seemingly strange behaviour was an old static snapshot 
> > taken in an entirely different directory. Apparently, ceph fs snapshots are 
> > not local to an FS directory sub-tree but always global on the entire FS 
> > despite the fact that you can only access the sub-tree in the snapshot, 
> > which easily leads to the wrong conclusion that only data below the 
> > directory is in the snapshot. As a consequence, the static snapshot was 
> > accumulating the garbage from the rotating snapshots even though these 
> > sub-trees were completely disjoint.
>
> So are you saying that if I do this I'll have 1M files in stray?

No, happily.

The thing that's happening here post-dates my main previous stretch on
CephFS and I had forgotten it, but there's a note in the developer
docs: https://docs.ceph.com/en/latest/dev/cephfs-snapshots/#hard-links
(I fortuitously stumbled across this from an entirely different
direction/discussion just after seeing this thread and put the pieces
together!)

Basically, hard links are *the worst*. For everything in filesystems.
I spent a lot of time trying to figure out how to handle hard links
being renamed across snapshots[1] and never managed it, and the
eventual "solution" was to give up and do the degenerate thing:
If there's a file with multiple hard links, that file is a member of
*every* snapshot.

Doing anything about this will take a lot of time. There's probably an
opportunity to improve it for users of the subvolumes library, as
those subvolumes do get tagged a bit, so I'll see if we can look into
that. But for generic CephFS, I'm not sure what the solution will look
like at all.

Sorry folks. :/
-Greg

[1]: The issue is that, if you have a hard linked file in two places,
you would expect it to be snapshotted whenever a snapshot covering
either location occurs. But in CephFS the file can only live in one
location, and the other location has to just hold a reference to it
instead. So say you have inode Y at path A, and then hard link it in
at path B. Given how snapshots work, when you open up Y from A, you
would need to check all the snapshots that apply from both A and B's
trees. But 1) opening up other paths is a challenge all on its own,
and 2) without an inode and its backtrace to provide a lookup resolve
point, it's impossible to maintain a lookup that scales and is
possible to keep consistent.
(Oh, I did just have one idea, but I'm not sure if it would fix every
issue or just that scalable backtrace lookup:
https://tracker.ceph.com/issues/54205)

>
> mkdir /a
> cd /a
> for i in {1..100}; do touch $i; done  # create 1M files in /a
> cd ..
> mkdir /b
> mkdir /b/.snap/testsnap  # create a snap in the empty dir /b
> rm -rf /a/
>
>
> Cheers, Dan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph Performance very bad even in Memory?!

2022-01-31 Thread Gregory Farnum

There's a lot going on here. Some things I noticed you should be aware
of in relation to the tests you performed:

* Ceph may not have the performance ceiling you're looking for. A
write IO takes about half a millisecond of CPU time, which used to be
very fast and is now pretty slow compared to an NVMe device. Crimson
will reduce this but is not ready for real users yet. Of course, it
scales out quite well, which your test is not going to explore with a
single client and 4 OSDs.

* If you are seeing reported 4k IO latencies measured in seconds,
something has gone horribly wrong. Ceph doesn't do that, unless you
simply queue up so much work it can't keep up (and it tries to prevent
you from doing that).

* I don't know what the *current* numbers are, but in the not-distant
past, 40k IOPs was about as much as a single rbd device could handle
on the client side. So whatever else is happening, there's a good
chance that's the limit you're actually exposing in this test.

* Your ram disks may not be as fast as you think they are under a
non-trivial load. Network IO, moving data between kernel and userspace
that Ceph has to do and local FIO doesn't, etc will all take up
roughly equivalent portions of that 10GB/s bandwidth you saw and split
up the streams, which may slow it down. Once your CPU has to do
anything else, it will be able to feed the RAM less quickly because
it's doing other things. Etc etc etc (Memory bandwidth is a *really*
complex topic.)

There are definitely proprietary distributed storage systems that can
go faster than Ceph, and there may be open-source ones — but most of
them don't provide the durability and consistency guarantees you'd
expect under a lot of failure scenarios.
-Greg


On Sat, Jan 29, 2022 at 8:42 PM sascha a.  wrote:
>
> Hello,
>
> Im currently in progress of setting up a production ceph cluster on a 40
> gbit network (for sure 40gb internal and public network).
>
> Did a lot of machine/linux tweeking already:
>
> - cpupower state disable
> - lowlatency kernel
> - kernel tweekings
> - rx buffer optimize
> - affinity mappings
> - correct bus mapping of pcie cards
> - mtu..
> + many more
>
> My machines are connected over a switch which is capable of doing multiple
> TBit/s.
>
> iperf result between two machines:
> single connection ~20 Gbit/s (reaching single core limits)
> multiple connections ~39 Gbit/s
>
> Perfect starting point i would say. Let's assume we only reach 20 Gbit/s as
> network speed.
>
> Now i wanted to check how much overhead ceph has and what's the absolute
> maximum i could get out of my cluster.
>
> Using 8 Servers (16 cores + HT), (all the same cpus/ram/network cards).
> Having plenty of RAM which I'm using for my tests. Simple pool by using 3x
> replication.
> For that reason and to prove that my servers have high quality and speed I
> created 70 GB RAM-drives on all of them.
>
> The RAM-drives were created by using the kernel module "brd".
> Benchmarking the RAM-drives gave the following result by using fio:
>
> 5m IO/s read@4k
> 4m IO/s write@4k
>
> read and write | latency below 10 us@QD=16
> read and write | latency below 50 us@QD=256
>
> 1,5 GB/s sequential read@4k (QD=1)
> 1,0 GB/s sequential write@4k (QD=1)
>
> 15 GB/s  read@4k (QD=256)
> 10 GB/s  write@4k (QD=256)
>
> Pretty impressive, disks we are all dreaming about I would say.
>
> Making sure i don't bottleneck anything i created following Setup:
>
> - 3 Servers, each running a mon,mgr and mds (all running in RAM including
> their dirs in /var/lib/ceph.. by using tmpfs/brd ramdisk)
> - 4 Servers mapping their ramdrive as OSDs, created with bluestore by using
> ceph-volume raw or lvm.
> - 1 Server using rbd-nbd to map one rbd as drive and to benchmark it
>
> I would in this scenario expect impressive results, the only
> bottleneck between the servers is the 20 Gbit network speed.
> Everything else is running completely in low latency ECC memory and should
> be blazing fast until the network speed is reached.
>
> The benchmark was monitored by using this tool here:
> https://github.com/ceph/ceph/blob/master/src/tools/histogram_dump.py also
> by looking at the raw data of  "ceph daemon osd.7 perf dump".
>
>
> *Result:*
> Either there is something really going wrong or ceph has huge bottlenecks
> inside the software which should be solved...
>
> histogram_dump spiked often the latency between "1M and 51M", ... which is
> when i read it correctly seconds?!
> How is that possible? This should always be between 0-99k..
>
> The result of perf dump was also crazy slow:
> https://pastebin.com/ukV0LXWH
>
> especially kpis here:
> - op_latency
> - op_w_latency
> - op_w_prepare_latency
> - state_deferred_cleanup_lat
>
> all in areas of milliseconds?
>
> FIO reached against the nbd drive 40k IOPs@4k with QD=256 and latency which
> is in 1-7 milliseconds..
>
> Calculating with 40gbit, 4k of data flows through the network
> about 4*1024/400/8 = 128ns, lets say its 2us cause of the overhead,
> multipli

[ceph-users] Re: Ideas for Powersaving on archive Cluster ?

2022-01-21 Thread Gregory Farnum

I would not recommend this on Ceph. There was a project where somebody
tried to make RADOS amenable to spinning down drives, but I don't
think it ever amounted to anything.

The issue is just that the OSDs need to do disk writes whenever they
get new OSDMaps, there's a lot of random stuff that updates them, and
nothing tries to constrain it to restrict writes in mostly-idle
clusters. So they wake up constantly to do internal maintenance and
heartbeats even if the cluster is idle.

If you *really* don't use the data often, the best approach is
probably just to turn it all off. You'll need to make sure it turns on
fast enough, but if you do a clean shutdown of everything with the
right settings applied (you may or may not need things like
nodown/noup when changing states, to prevent a lot of map churn) you
should be able to make it work.
-Greg

On Fri, Jan 21, 2022 at 7:40 AM Sebastian Mazza  wrote:
>
>
>
> > On 21.01.2022, at 14:36, Marc  wrote:
> >
> >>
> >> I wonder if it is possible to let the HDDs sleep or if the OSD daemons
> >> prevent a hold of the spindle motors. Or can it even create some problems
> >> for the OSD deamon if the HDD spines down?
> >> However, it should be easy to check on a cluster without any load and
> >> optimally on a Custer that is not in production, by something like:
> >>
> >
> > From what I can remember was always the test result of spinning down/up 
> > drives that it causes more wear/damage then just leaving them spinning.
> >
>
> If you do a spin down / up every 20 minutes or so the wear/damage of the 
> motors is probably a problem. But Christoph stated that the cluster is not 
> used for several days and I don't think one spin up/down per day generates 
> enough spin ups of the spindle motor to be concerned about that.
> I have backup storage servers (no ceph) that are running for many years now. 
> The HDDs in this server are spinning only for one or two hours per day and 
> compared to HHDs in productive server that reading and writing 24 / 7, they 
> hardly ever fail. So I wouldn't worry about wear and tear of the motors from 
> spin ups on an archive system that are only used once in a view days. 
> However, it could be that it heavily depends on the drives and I was only 
> extraordinary lucky with all the WD, HGST and Seagate drives in our backup 
> machines.
>
> Best regards,
> Sebastian
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v16.2.7 Pacific released

2022-01-11 Thread Gregory Farnum

On Tue, Jan 11, 2022 at 5:29 AM Dan van der Ster  wrote:
>
> Hi,
>
> Yes it's confusing -- the release notes are normally only published in
> master, which is shown as "latest", and are rarely backported to a
> release branch.
> The notes you're looking for are here:
> https://docs.ceph.com/en/latest/releases/pacific/#v16-2-7-pacific
>
> Zac is in cc -- maybe we can make this more friendly?

Yeah, we need to push minor-version release notes to their relevant
branch so users see it in the docs that are relevant to their version.
-Greg

>
> -- dan
>
> On Tue, Jan 11, 2022 at 2:11 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > I seem to have a problem finding correct release notes and upgrade 
> > instructions on the ceph docs. If I open https://docs.ceph.com/en/pacific/ 
> > and go to "Ceph Releases", pacific is not even listed even though these doc 
> > pages are supposedly about pacific. I also can't find the "upgrading from 
> > nautilus or octopus" pages anywhere. If I go to v:latest, I can see such 
> > information. Is there a link error that, when selecting version pacific 
> > shows the pages for octopus? This would be very confusing, because the link 
> > "https://docs.ceph.com"; forwards to "https://docs.ceph.com/en/pacific/"; 
> > which would show information on octopus.
> >
> > Best regards,
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Dan van der Ster 
> > Sent: 09 December 2021 17:58:22
> > To: David Galloway
> > Cc: ceph-annou...@ceph.io; Ceph Users; Ceph Developers; 
> > ceph-maintain...@ceph.io; Patrick Donnelly
> > Subject: [ceph-users] Re: v16.2.7 Pacific released
> >
> > Hi all,
> >
> > The release notes are missing an upgrade step that is needed only for
> > clusters *not* managed by cephadm.
> > This was noticed in
> > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7KSPSUE4VO274H5XQYNFCT7HKWT75BCY/
> >
> > If you are not using cephadm, you must disable FSMap sanity checks
> > *before starting the upgrade*:
> >
> >  ceph config set mon mon_mds_skip_sanity 1
> >
> > After the upgrade is finished and the cluster is stable, please remove
> > that setting:
> >
> > ceph config rm mon mon_mds_skip_sanity
> >
> > Clusters upgraded by cephadm take care of this step automatically.
> >
> > Best Regards,
> >
> > Dan
> >
> >
> >
> > On Wed, Dec 8, 2021 at 1:12 AM David Galloway  wrote:
> > >
> > > We're happy to announce the 7th backport release in the Pacific series.
> > > We recommend all users upgrade to this release.
> > >
> > > Notable Changes
> > > ---
> > >
> > > * Critical bug in OMAP format upgrade is fixed. This could cause data
> > > corruption (improperly formatted OMAP keys) after pre-Pacific cluster
> > > upgrade if bluestore-quick-fix-on-mount parameter is set to true or
> > > ceph-bluestore-tool's quick-fix/repair commands are invoked. Relevant
> > > tracker: https://tracker.ceph.com/issues/53062.
> > > bluestore-quick-fix-on-mount continues to be set to false, by default.
> > >
> > > * MGR: The pg_autoscaler will use the 'scale-up' profile as the default
> > > profile. 16.2.6 changed the default profile to 'scale-down' but we ran
> > > into issues with the device_health_metrics pool consuming too many PGs,
> > > which is not ideal for performance. So we will continue to use the
> > > 'scale-up' profile by default,  until we implement a limit on the number
> > > of PGs default pools should consume, in combination with the
> > > 'scale-down' profile.
> > >
> > > * Cephadm & Ceph Dashboard: NFS management has been completely reworked
> > > to ensure that NFS exports are managed consistently across the different
> > > Ceph components. Prior to this, there were 3 incompatible
> > > implementations for configuring the NFS exports: Ceph-Ansible/OpenStack
> > > Manila, Ceph Dashboard and 'mgr/nfs' module. With this release the
> > > 'mgr/nfs' way becomes the official interface, and the remaining
> > > components (Cephadm and Ceph Dashboard) adhere to it. While this might
> > > require manually migrating from the deprecated implementations, it will
> > > simplify the user experience for those heavily relying on NFS exports.
> > >
> > > * Dashboard: "Cluster Expansion Wizard". After the 'cephadm bootstrap'
> > > step, users that log into the Ceph Dashboard will be presented with a
> > > welcome screen. If they choose to follow the installation wizard, they
> > > will be guided through a set of steps to help them configure their Ceph
> > > cluster: expanding the cluster by adding more hosts, detecting and
> > > defining their storage devices, and finally deploying and configuring
> > > the different Ceph services.
> > >
> > > * OSD: When using mclock_scheduler for QoS, there is no longer a need to
> > > run any manual benchmark. The OSD now automatically sets an appropriate
> > > value for osd_mclock_max_capacity_iops by running a simple benchmark
> > > dur

[ceph-users] Re: OSD write op out of order

2021-12-27 Thread Gregory Farnum

On Mon, Dec 27, 2021 at 9:12 AM gyfelectric  wrote:

>
> Hi all,
>
> Recently, the problem of OSD disorder has often appeared in my
> environment(14.2.5) and my Fuse Client borken
> due to "FAILED assert(ob->last_commit_tid < tid)”. My application can’t
> work normally now.
>
> The time series that triggered this problem is like this:
> note:
> a. my datapool is: EC 4+2
> b. osd(osd.x) of pg_1 is down
>
> Event Sequences:
> t1: op_1(write) send to OSD and send 5 shards to 5 osds. only return 4
> shards except primary osd because there is osd(osd.x) down.
> t2: many other operations have occurred in this pg and record in pg_log
> t3: op_n(write) send to OSD and send 5 shards to 5 osds. only return 4
> shards except primary osd because there is osd(osd.x) down.
> t4: the peer osd report osd.x timeout to monitor and osd.x is marked down
> t5: pg_1 start canceling and requeueing op_1, op_2 … op_n to osd op_wq
> t6: pg_1 start peering and op_1 is trimmed from pg_log and dup map in this
> process
>

Unless I’m misunderstanding, either you have more ops that haven’t been
committed+acked than the length of the pg log dup tracking, or else there’s
a bug here and it’s trimming farther than it should.

Can you clarify which case? Because if you’re sending more ops than the pg
log length, this is an expected failure and not one that’s feasible to
resolve. You just need to spend the money to have enough memory for longer
logs and dup detection.

-Greg

t7: pg_1 become active and start reprocessing the op_1, op_2 … op_n
> t8: op_1 is not found in pg_log and dup map, so redo it.
> t9: op_n is found in pg_log or dup map and be considered completed, so
> return osd reply to client directly with tid_op_n
> t10: op_1 complete and return to client with tid_op_1. client will break
> down due to "assert(ob->last_commit_tid < tid)”
>
> I found some relative issues in https://tracker.ceph.com/issues/23827
>  which have some discussions
> about this problem.
> But i didn’t find an effective method to avoid this problem.
>
> I think the current mechanism to prevent non-idempotent op from being
> repeated is flawed, may be we should redesign it.
> How do you think about it? And if my idea is wrong, what should i do to
> avoid this problem?
>
> Any response is very grateful, thank you!
>
> gyfelectric
> gyfelect...@gmail.com
>
> 
> 签名由 网易邮箱大师  定制
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-mon pacific doesn't enter to quorum of nautilus cluster

2021-12-15 Thread Gregory Farnum

Hmm that ticket came from the slightly unusual scenario where you were
deploying a *new* Pacific monitor against an Octopus cluster.

Michael, is your cluster deployed with cephadm? And is this a new or
previously-existing monitor?

On Wed, Dec 15, 2021 at 12:09 AM Michael Uleysky  wrote:
>
> Thanks!
>
> As far as I can see, this is the same problem as mine.
>
> ср, 15 дек. 2021 г. в 16:49, Chris Dunlop :
>
> > On Wed, Dec 15, 2021 at 02:05:05PM +1000, Michael Uleysky wrote:
> > > I try to upgrade three-node nautilus cluster to pacific. I am updating
> > ceph
> > > on one node and restarting daemons. OSD ok, but monitor cannot enter
> > quorum.
> >
> > Sounds like the same thing as:
> >
> > Pacific mon won't join Octopus mons
> > https://tracker.ceph.com/issues/52488
> >
> > Unforutunately there's no resolution.
> >
> > For a bit more background, see also the thread starting:
> >
> > New pacific mon won't join with octopus mons
> > https://www.spinics.net/lists/ceph-devel/msg52181.html
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph container image repos

2021-12-14 Thread Gregory Farnum

I generated a quick doc PR so this doesn't trip over other users:
https://github.com/ceph/ceph/pull/44310. Thanks all!
-Greg

On Mon, Dec 13, 2021 at 10:59 AM John Petrini  wrote:
>
> "As of August 2021, new container images are pushed to quay.io
> registry only. Docker hub won't receive new content for that specific
> image but current images remain available.As of August 2021, new
> container images are pushed to quay.io registry only. Docker hub won't
> receive new content for that specific image but current images remain
> available."
>
> https://hub.docker.com/r/ceph/ceph
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Support for alternative RHEL derivatives

2021-12-13 Thread Gregory Farnum

On Mon, Dec 13, 2021 at 7:02 AM Benoit Knecht  wrote:
>
> Hi,
>
> As we're getting closer to CentOS 8 EOL, I'm sure plenty of Ceph users are
> looking to migrate from CentOS 8 to CentOS Stream 8 or one of the new RHEL
> derivatives, e.g. Rocky and Alma.
>
> The question of upstream support has already been raised in the past, but at
> the time Rocky and Alma were pretty much clones of CentOS. However now they're
> about to diverge in subtle ways, so I'm wondering if
>
> 1. The upstream Ceph project plans on building and QA testing against Rocky
>and/or Alma;
> 2. Specific packages will be provided on https://download.ceph.com/ for Rocky
>and/or Alma, or if the packages built for CentOS Stream are expected to be
>compatible (which goes back to the QA testing question above);
> 3. Any Ceph developers have been in touch with Rocky and/or Alma Storage SIGs;
> 4. Any Ceph users have already or are planning to migrate to Rocky or Alma.

When this has been discussed in the CLT, we've generally planned to
stay with CentOS Stream as the rpm/RHEL-family distro we build and
test on, unless problems arise or something else happens to make us
re-evaluate. I know not everybody is going to be happy with
containers, but we've never been very successful trying to package for
more than 2 or 3 distros (generally Ubuntu and CentOS; frequently
Debian if the build dependencies are tenable), and between those two
and containers that you can deploy "anywhere" we're pretty comfortable
with the coverage we are providing directly. Some other distros have
their own packages and I'd expect that maintaining distro packages for
a RHEL clone will be pretty simple if somebody wants to take on that
task.

I'm not familiar with any direct SIG contact, but there are lots of
things I don't hear. ;)

And as you allude to, distribution builds and testing are a good topic
for the Dev/User monthly meeting if you have specific thoughts or
insight.
-Greg

>
> If anyone else is interested in this topic, maybe it can be added to the 
> agenda
> for the next users+dev meetup.
>
> Cheers,
>
> --
> Ben
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in stopping state

2021-12-13 Thread Gregory Farnum

Hmm. Glad it's working now, at least.

On Mon, Dec 13, 2021 at 9:10 AM Frank Schilder  wrote:
>
> Dear Gregory,
>
> thanks for your fast response. The situation started worsening shortly after 
> I sent my e-mail and I had to take action. More operations got stuck in the 
> active MDS, leading to a failure of journal trimming. I more or less went 
> through all combinations of what you wrote. In the end, the last option was 
> to fail rank 0. This also crashed rank 1, but then things started to clear 
> up. There were more blocked ops warnings every now and then, but these were 
> temporary. After a bit of watching the 2 MDSes struggle, the "stopping" one 
> finally became standby and everything is up and clean right now.
>
> The reason for this exercise is a (rare?) race condition between dirfrag 
> subtree exports (multi-active MDS) and certain client operations (rename, 
> create), which leads to a complete standstill of the file system (the fs 
> journal size eventually reaches the pool quota). Its seems to be caused by 
> different clients (servers) accessing the same sub-dir on the file system 
> concurrently. It is only one workload where we observe this, no useful 
> information so far.

Please let us know or file a ticket in the tracker if you get enough
details for us to spend some time debugging that. That's definitely
supposed to work fine, of course. :/
-Greg

>
> Thanks again and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Gregory Farnum 
> Sent: 13 December 2021 17:39:55
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re: [ceph-users] MDS stuck in stopping state
>
> This looks awkward — just from the ops, it seems mds.1 is trying to
> move some stray items (presumably snapshots of since-deleted files,
> from what you said?) into mds0's stray directory, and then mds.0 tries
> to get auth pins from mds.1 but that fails for some reason which isn't
> apparent from the dump.
>
> Somebody might be able to get farther along by tracing logs of mds.1
> rebooting, but my guess is that rebooting both servers will clear it
> up. You might also try increasing max_mds to 2 and seeing if that jogs
> things loose; I'm not sure what would be less disruptive for you.
> -Greg
>
>
> On Mon, Dec 13, 2021 at 5:37 AM Frank Schilder  wrote:
> >
> > Hi all, I needed to reduce the number of active MDS daemons from 4 to 1. 
> > Unfortunately, the last MDS to stop is stuck in stopping state. Ceph 
> > version is mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be 
> > related to deleted snapshots; more info below. I failed the MDS in stopping 
> > state already several times in the hope that the operations get flushed 
> > out. Before failing rank 0, I would appreciate if someone could look at 
> > this issue and advise on how to proceed safely.
> >
> > Some diagnostic info:
> >
> > # ceph fs status
> > con-fs2 - 1659 clients
> > ===
> > +--+--+-+---+---+---+
> > | Rank |  State   |   MDS   |Activity   |  dns  |  inos |
> > +--+--+-+---+---+---+
> > |  0   |  active  | ceph-08 | Reqs:  176 /s | 2844k | 2775k |
> > |  1   | stopping | ceph-17 |   | 27.7k |   59  |
> > +--+--+-+---+---+---+
> > +-+--+---+---+
> > | Pool|   type   |  used | avail |
> > +-+--+---+---+
> > |con-fs2-meta1| metadata |  555M | 1261G |
> > |con-fs2-meta2|   data   |0  | 1261G |
> > | con-fs2-data|   data   | 1321T | 5756T |
> > | con-fs2-data-ec-ssd |   data   |  252G | 4035G |
> > |con-fs2-data2|   data   |  389T | 5233T |
> > +-+--+---+---+
> > +-+
> > | Standby MDS |
> > +-+
> > |   ceph-09   |
> > |   ceph-24   |
> > |   ceph-14   |
> > |   ceph-16   |
> > |   ceph-12   |
> > |   ceph-23   |
> > |   ceph-10   |
> > |   ceph-15   |
> > |   ceph-13   |
> > |   ceph-11   |
> > +-+
> > MDS version: ceph version 13.2.10 
> > (564bdc4ae87418a232fc901524470e1a0f76d641) mimic (stable)
> >
> > # ceph status
> >   cluster:
> > id:
> > health: HEALTH_WARN
> > 2 MDSs report slow requests
> >
> >   services:
> > mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26
> > mgr: ceph-01(

[ceph-users] Re: MDS stuck in stopping state

2021-12-13 Thread Gregory Farnum

This looks awkward — just from the ops, it seems mds.1 is trying to
move some stray items (presumably snapshots of since-deleted files,
from what you said?) into mds0's stray directory, and then mds.0 tries
to get auth pins from mds.1 but that fails for some reason which isn't
apparent from the dump.

Somebody might be able to get farther along by tracing logs of mds.1
rebooting, but my guess is that rebooting both servers will clear it
up. You might also try increasing max_mds to 2 and seeing if that jogs
things loose; I'm not sure what would be less disruptive for you.
-Greg


On Mon, Dec 13, 2021 at 5:37 AM Frank Schilder  wrote:
>
> Hi all, I needed to reduce the number of active MDS daemons from 4 to 1. 
> Unfortunately, the last MDS to stop is stuck in stopping state. Ceph version 
> is mimic 13.2.10. Each MDS has 3 blocked OPS, that seem to be related to 
> deleted snapshots; more info below. I failed the MDS in stopping state 
> already several times in the hope that the operations get flushed out. Before 
> failing rank 0, I would appreciate if someone could look at this issue and 
> advise on how to proceed safely.
>
> Some diagnostic info:
>
> # ceph fs status
> con-fs2 - 1659 clients
> ===
> +--+--+-+---+---+---+
> | Rank |  State   |   MDS   |Activity   |  dns  |  inos |
> +--+--+-+---+---+---+
> |  0   |  active  | ceph-08 | Reqs:  176 /s | 2844k | 2775k |
> |  1   | stopping | ceph-17 |   | 27.7k |   59  |
> +--+--+-+---+---+---+
> +-+--+---+---+
> | Pool|   type   |  used | avail |
> +-+--+---+---+
> |con-fs2-meta1| metadata |  555M | 1261G |
> |con-fs2-meta2|   data   |0  | 1261G |
> | con-fs2-data|   data   | 1321T | 5756T |
> | con-fs2-data-ec-ssd |   data   |  252G | 4035G |
> |con-fs2-data2|   data   |  389T | 5233T |
> +-+--+---+---+
> +-+
> | Standby MDS |
> +-+
> |   ceph-09   |
> |   ceph-24   |
> |   ceph-14   |
> |   ceph-16   |
> |   ceph-12   |
> |   ceph-23   |
> |   ceph-10   |
> |   ceph-15   |
> |   ceph-13   |
> |   ceph-11   |
> +-+
> MDS version: ceph version 13.2.10 (564bdc4ae87418a232fc901524470e1a0f76d641) 
> mimic (stable)
>
> # ceph status
>   cluster:
> id:
> health: HEALTH_WARN
> 2 MDSs report slow requests
>
>   services:
> mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26
> mgr: ceph-01(active), standbys: ceph-02, ceph-03, ceph-25, ceph-26
> mds: con-fs2-2/2/1 up  {0=ceph-08=up:active,1=ceph-17=up:stopping}, 10 
> up:standby
> osd: 1051 osds: 1050 up, 1050 in
>
>   data:
> pools:   13 pools, 17374 pgs
> objects: 1.01 G objects, 1.9 PiB
> usage:   2.3 PiB used, 9.2 PiB / 11 PiB avail
> pgs: 17352 active+clean
>  20active+clean+scrubbing+deep
>  2 active+clean+scrubbing
>
>   io:
> client:   129 MiB/s rd, 175 MiB/s wr, 2.57 kop/s rd, 2.77 kop/s wr
>
> # ceph health detail
> HEALTH_WARN 2 MDSs report slow requests
> MDS_SLOW_REQUEST 2 MDSs report slow requests
> mdsceph-08(mds.0): 3 slow requests are blocked > 30 secs
> mdsceph-17(mds.1): 3 slow requests are blocked > 30 secs
>
> # ssh ceph-08 ceph daemon mds.ceph-08 dump_blocked_ops
> {
> "ops": [
> {
> "description": "client_request(mds.1:126521 rename 
> #0x100/stray5/1000eec35f7 #0x101/stray5/1000eec35f7 caller_uid=0, 
> caller_gid=0{})",
> "initiated_at": "2021-12-13 13:08:59.430597",
> "age": 5034.983083,
> "duration": 5034.983109,
> "type_data": {
> "flag_point": "acquired locks",
> "reqid": "mds.1:126521",
> "op_type": "client_request",
> "client_info": {
> "client": "mds.1",
> "tid": 126521
> },
> "events": [
> {
> "time": "2021-12-13 13:08:59.430597",
> "event": "initiated"
> },
> {
> "time": "2021-12-13 13:08:59.430597",
> "event": "header_read"
> },
> {
> "time": "2021-12-13 13:08:59.430597",
> "event": "throttled"
> },
> {
> "time": "2021-12-13 13:08:59.430601",
> "event": "all_read"
> },
> {
> "time": "2021-12-13 13:09:00.730197",
> "event": "dispatched"
> },
> {
> "time": "2021-12-

[ceph-users] Re: CephFS Metadata Pool bandwidth usage

2021-12-13 Thread Gregory Farnum

0,
> "precalc_pgid": 0,
> "last_sent": "1121127.442264s",
> "age": 0.0085206,
> "attempts": 1,
> "snapid": "head",
> "snap_context": "0=[]",
> "mtime": "2021-12-10T08:35:34.592387+",
> "osd_ops": [
> "write 1747542~13387 [fadvise_dontneed] in=13387b"
> ]
> }
> ],
> "linger_ops": [],
> "pool_ops": [],
> "pool_stat_ops": [],
> "statfs_ops": [],
> "command_ops": []
> }
>
> Any suggestions would be much appreciated.
>
> Kind regards,
>
> András
>
>
> On Thu, Dec 9, 2021 at 7:48 PM Andras Sali  wrote:
>>
>> Hi Greg,
>>
>> Much appreciated for the reply, the image is also available at: 
>> https://tracker.ceph.com/attachments/download/5808/Bytes_per_op.png
>>
>> How the graph is generated: we back the cephfs metadata pool with Azure 
>> ultrassd disks. Azure reports for the disks each minute the average 
>> read/write iops (operations per sec) and average read/write throughput (in 
>> bytes per sec).
>>
>> We then divide the write throughput with the write IOPS number - this is the 
>> average write bytes / operation (we plot this in the above graph). We 
>> observe that this increases up to around 300kb, whilst after resetting the 
>> MDS, it stays around 32kb for some time (then starts increasing). The read 
>> bytes / operation are constantly much smaller.
>>
>> The issue is that once we are in the "high" regime, for the same operation 
>> that does for example 1000 IOPS, we need 300MB throughput, instead of 30MB 
>> throughput that we observe after a restart. The high throughput often 
>> results in reaching the VM level limits in Azure and after this the queue 
>> depth explodes and operations begin stalling.
>>
>> We will do the dump and report it as well once we have it.
>>
>> Thanks again for any ideas on this.
>>
>> Kind regards,
>>
>> Andras
>>
>>
>> On Thu, Dec 9, 2021, 15:07 Gregory Farnum  wrote:
>>>
>>> Andras,
>>>
>>> Unfortunately your attachment didn't come through the list. (It might
>>> work if you embed it inline? Not sure.) I don't know if anybody's
>>> looked too hard at this before, and without the image I don't know
>>> exactly what metric you're using to say something's 320KB in size. Can
>>> you explain more?
>>>
>>> It might help if you dump the objecter_requests from the MDS and share
>>> those — it'll display what objects are being written to with what
>>> sizes.
>>> -Greg
>>>
>>>
>>> On Wed, Dec 8, 2021 at 9:00 AM Andras Sali  wrote:
>>> >
>>> > Hi All,
>>> >
>>> > We have been observing that if we let our MDS run for some time, the
>>> > bandwidth usage of the disks in the metadata pool starts increasing
>>> > significantly (whilst IOPS is about constant), even though the number of
>>> > clients, the workloads or anything else doesn't change.
>>> >
>>> > However, after restarting the MDS, the issue goes away for some time and
>>> > the same workloads require 1/10th of the metadata disk bandwidth whilst
>>> > doing the same IOPS.
>>> >
>>> > We run our CephFS cluster in a cloud environment where the disk throughput
>>> > / bandwidth capacity is quite expensive to increase and we are hitting
>>> > bandwidth / throughput limits, even though we still have a lot of IOPS
>>> > capacity left.
>>> >
>>> > We suspect that somehow the journaling of the MDS becomes more extensive
>>> > (i.e. larger journal updates for each operation), but we couldn't really
>>> > pin down which parameter might affect this.
>>> >
>>> > I attach a plot of how the Bytes / Operation (throughput in MBps / IOPS)
>>> > evolves over time, when we restart the MDS, it drops to around 32kb (even
>>> > though the min block size for the metadata pool OSDs is 4kb in our
>>> > settings) and then increases over time to around 300kb.
>>> >
>>> > Any ideas on how to "fix" this and have a significantly lower bandwidth
>>> > usage would be really-really appreciated!
>>> >
>>> > Thank you and kind regards,
>>> >
>>> > Andras
>>> > ___
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>> >
>>>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS Metadata Pool bandwidth usage

2021-12-09 Thread Gregory Farnum

Andras,

Unfortunately your attachment didn't come through the list. (It might
work if you embed it inline? Not sure.) I don't know if anybody's
looked too hard at this before, and without the image I don't know
exactly what metric you're using to say something's 320KB in size. Can
you explain more?

It might help if you dump the objecter_requests from the MDS and share
those — it'll display what objects are being written to with what
sizes.
-Greg


On Wed, Dec 8, 2021 at 9:00 AM Andras Sali  wrote:
>
> Hi All,
>
> We have been observing that if we let our MDS run for some time, the
> bandwidth usage of the disks in the metadata pool starts increasing
> significantly (whilst IOPS is about constant), even though the number of
> clients, the workloads or anything else doesn't change.
>
> However, after restarting the MDS, the issue goes away for some time and
> the same workloads require 1/10th of the metadata disk bandwidth whilst
> doing the same IOPS.
>
> We run our CephFS cluster in a cloud environment where the disk throughput
> / bandwidth capacity is quite expensive to increase and we are hitting
> bandwidth / throughput limits, even though we still have a lot of IOPS
> capacity left.
>
> We suspect that somehow the journaling of the MDS becomes more extensive
> (i.e. larger journal updates for each operation), but we couldn't really
> pin down which parameter might affect this.
>
> I attach a plot of how the Bytes / Operation (throughput in MBps / IOPS)
> evolves over time, when we restart the MDS, it drops to around 32kb (even
> though the min block size for the metadata pool OSDs is 4kb in our
> settings) and then increases over time to around 300kb.
>
> Any ideas on how to "fix" this and have a significantly lower bandwidth
> usage would be really-really appreciated!
>
> Thank you and kind regards,
>
> Andras
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-22 Thread Gregory Farnum

Oh, I misread your initial email and thought you were on hard drives.
These do seem slow for SSDs.

You could try tracking down where the time is spent; perhaps run
strace and see which calls are taking a while, and go through the op
tracker on the MDS and see if it has anything that's obviously taking
a long time.
-Greg

On Wed, Nov 17, 2021 at 8:00 PM Sasha Litvak
 wrote:
>
> Gregory,
> Thank you for your reply, I do understand that a number of serialized lookups 
> may take time.  However if 3.25 sec is OK,  11.2 seconds sounds long, and I 
> had once removed a large subdirectory which took over 20 minutes to complete. 
>  I attempted to use nowsync mount option with kernel 5.15 and it seems to 
> hide latency (i.e. it is almost immediately returns prompt after recursive 
> directory removal.  However, I am not sure whether nowsync is safe to use 
> with kernel >= 5.8.  I also have kernel 5.3 on one of the client clusters and 
> nowsync there is not supported, however all rm operations happen reasonably 
> fast.  So the second question is, does 5.3's libceph behave differently on 
> recursing rm compared to 5.4 or 5.8?
>
>
> On Wed, Nov 17, 2021 at 9:52 AM Gregory Farnum  wrote:
>>
>> On Sat, Nov 13, 2021 at 5:25 PM Sasha Litvak
>>  wrote:
>> >
>> > I continued looking into the issue and have no idea what hinders the
>> > performance yet. However:
>> >
>> > 1. A client operating with kernel 5.3.0-42 (ubuntu 18.04) has no such
>> > problems.  I delete a directory with hashed subdirs (00 - ff) and total
>> > space taken by files ~707MB spread across those 256 in 3.25 s.
>>
>> Recursive rm first requires the client to get capabilities on the
>> files in question, and the MDS to read that data off disk.
>> Newly-created directories will be cached, but old ones might not be.
>>
>> So this might just be the consequence of having to do 256 serialized
>> disk lookups on hard drives. 3.25 seconds seems plausible to me.
>>
>> The number of bytes isn't going to have any impact on how long it
>> takes to delete from the client side — that deletion is just marking
>> it in the MDS, and then the MDS does the object removals in the
>> background.
>> -Greg
>>
>> >
>> > 2. A client operating with kernel 5.8.0-53 (ubuntu 20.04) processes a
>> > similar directory with less space taken ~ 530 MB spread across 256 subdirs
>> > in 11.2 s.
>> >
>> > 3.Yet another client with kernel 5.4.156 has similar latency removing
>> > directories as in line 2.
>> >
>> > In all scenarios, mounts are set with the same options, i.e.
>> > noatime,secret-file,acl.
>> >
>> > Client 1 has luminous, client 2 has octopus, client 3 has nautilus.   While
>> > they are all on the same LAN, ceph -s on 2 and 3 returns in ~ 800 ms and on
>> > client in ~300 ms.
>> >
>> > Any ideas are appreciated,
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Nov 12, 2021 at 8:44 PM Sasha Litvak 
>> > wrote:
>> >
>> > > The metadata pool is on the same type of drives as other pools; every 
>> > > node
>> > > uses SATA SSDs.  They are all read / write mix DC types.  Intel and 
>> > > Seagate.
>> > >
>> > > On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri 
>> > > wrote:
>> > >
>> > >> MDS RAM cache vs going to the metadata pool?  What type of drives is 
>> > >> your
>> > >> metadata pool on?
>> > >>
>> > >> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak 
>> > >> > 
>> > >> wrote:
>> > >> >
>> > >> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
>> > >> >  visibly hangs on the old directories.  Cluster is healthy,
>> > >> has a
>> > >> > light load, and any newly created directories deleted immediately 
>> > >> > (well
>> > >> rm
>> > >> > returns command prompt immediately).  The directories in question have
>> > >> 10 -
>> > >> > 20 small text files so nothing should be slow when removing them.
>> > >> >
>> > >> > I wonder if someone can please give me a hint on where to start
>> > >> > troubleshooting as I see no "big bad bear" yet.
>> > >> > ___
>> > >> > ceph-users mailing list -- ceph-users@ceph.io
>> > >> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >>
>> > >>
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>> >
>>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recursive delete hangs on cephfs

2021-11-17 Thread Gregory Farnum

On Sat, Nov 13, 2021 at 5:25 PM Sasha Litvak
 wrote:
>
> I continued looking into the issue and have no idea what hinders the
> performance yet. However:
>
> 1. A client operating with kernel 5.3.0-42 (ubuntu 18.04) has no such
> problems.  I delete a directory with hashed subdirs (00 - ff) and total
> space taken by files ~707MB spread across those 256 in 3.25 s.

Recursive rm first requires the client to get capabilities on the
files in question, and the MDS to read that data off disk.
Newly-created directories will be cached, but old ones might not be.

So this might just be the consequence of having to do 256 serialized
disk lookups on hard drives. 3.25 seconds seems plausible to me.

The number of bytes isn't going to have any impact on how long it
takes to delete from the client side — that deletion is just marking
it in the MDS, and then the MDS does the object removals in the
background.
-Greg

>
> 2. A client operating with kernel 5.8.0-53 (ubuntu 20.04) processes a
> similar directory with less space taken ~ 530 MB spread across 256 subdirs
> in 11.2 s.
>
> 3.Yet another client with kernel 5.4.156 has similar latency removing
> directories as in line 2.
>
> In all scenarios, mounts are set with the same options, i.e.
> noatime,secret-file,acl.
>
> Client 1 has luminous, client 2 has octopus, client 3 has nautilus.   While
> they are all on the same LAN, ceph -s on 2 and 3 returns in ~ 800 ms and on
> client in ~300 ms.
>
> Any ideas are appreciated,
>
>
>
>
>
>
>
>
>
> On Fri, Nov 12, 2021 at 8:44 PM Sasha Litvak 
> wrote:
>
> > The metadata pool is on the same type of drives as other pools; every node
> > uses SATA SSDs.  They are all read / write mix DC types.  Intel and Seagate.
> >
> > On Fri, Nov 12, 2021 at 8:02 PM Anthony D'Atri 
> > wrote:
> >
> >> MDS RAM cache vs going to the metadata pool?  What type of drives is your
> >> metadata pool on?
> >>
> >> > On Nov 12, 2021, at 5:30 PM, Sasha Litvak 
> >> wrote:
> >> >
> >> > I am running Pacific 16.2.4 cluster and recently noticed that rm -rf
> >> >  visibly hangs on the old directories.  Cluster is healthy,
> >> has a
> >> > light load, and any newly created directories deleted immediately (well
> >> rm
> >> > returns command prompt immediately).  The directories in question have
> >> 10 -
> >> > 20 small text files so nothing should be slow when removing them.
> >> >
> >> > I wonder if someone can please give me a hint on where to start
> >> > troubleshooting as I see no "big bad bear" yet.
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 >

1 - 100 of 158 matches

Mail list logo