[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-13 Thread Frédéric Nass
Hello,

'ceph osd deep-scrub 5' deep-scrubs all PGs for which osd.5 is primary (and 
only those).

You can check that from ceph-osd.5.log by running:
for pg in $(grep 'deep-scrub starts' /var/log/ceph/*/ceph-osd.5.log | awk 
'{print $8}') ; do echo "pg: $pg, primary osd is osd.$(ceph pg $pg query -f 
json | jq '.info.stats.acting_primary')" ; done

while

'ceph osd deep-scrub all' instructs all OSDs to start deep-scrubbing all PGs 
they're primary for, so in the end, all cluster's PGs.

So if the data you overwrote on osd.5 with 'dd' was part of a PG for which 
osd.5 was not the primary OSD then it wasn't deep-scrubbed.

man ceph 8 could rather say:

   Subcommand deep-scrub initiates deep scrub on all PGs osd  is 
primary for.

   Usage:

  ceph osd deep-scrub 

Regards,
Frédéric.

- Le 10 Juin 24, à 16:51, Petr Bena petr@bena.rocks a écrit :

> Most likely it wasn't, the ceph help or documentation is not very clear about
> this:
> 
> osd deep-scrub 
> initiate
> deep scrub on osd , or use  to deep scrub all
> 
> It doesn't say anything like "initiate deep scrub of primary PGs on osd"
> 
> I assumed it just runs a scrub of everything on given OSD.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-07 Thread Frédéric Nass
Hello Petr,

- Le 4 Juin 24, à 12:13, Petr Bena petr@bena.rocks a écrit :

> Hello,
> 
> I wanted to try out (lab ceph setup) what exactly is going to happen
> when parts of data on OSD disk gets corrupted. I created a simple test
> where I was going through the block device data until I found something
> that resembled user data (using dd and hexdump) (/dev/sdd is a block
> device that is used by OSD)
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 |
> hexdump -C
>   6e 20 69 64 3d 30 20 65  78 65 3d 22 2f 75 73 72  |n id=0
> exe="/usr|
> 0010  2f 73 62 69 6e 2f 73 73  68 64 22 20 68 6f 73 74 |/sbin/sshd"
> host|
> 
> Then I deliberately overwrote 32 bytes using random data:
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/urandom of=/dev/sdd bs=32
> count=1 seek=33920
> 
> INFRA [root@ceph-vm-lab5 ~]# dd if=/dev/sdd bs=32 count=1 skip=33920 |
> hexdump -C
>   25 75 af 3e 87 b0 3b 04  78 ba 79 e3 64 fc 76 d2
>|%u.>..;.x.y.d.v.|
> 0010  9e 94 00 c2 45 a5 e1 d2  a8 86 f1 25 fc 18 07 5a
>|E..%...Z|
> 
> At this point I would expect some sort of data corruption. I restarted
> the OSD daemon on this host to make sure it flushes any potentially
> buffered data. It restarted OK without noticing anything, which was
> expected.
> 
> Then I ran
> 
> ceph osd scrub 5
> 
> ceph osd deep-scrub 5
> 
> And waiting for all scheduled scrub operations for all PGs to finish.
> 
> No inconsistency was found. No errors reported, scrubs just finished OK,
> data are still visibly corrupt via hexdump.
> 
> Did I just hit some block of data that WAS used by OSD, but was marked
> deleted and therefore no longer used or am I missing something?

Possibly, if you deep-scrubed all PGs. I remember marking bad sectors in the 
past and still getting a fsck success on ceph-bluestore-tool fsck.

To be sure, you could overwrite the very same sector, stop the OSD and then:

$ ceph-bluestore-tool fsck --deep yes --path /var/lib/ceph/osd/ceph-X/

or (in containerized environment)

$ cephadm shell --name osd.X ceph-bluestore-tool fsck --deep yes --path 
/var/lib/ceph/osd/ceph-X/

osd.X being the OSD associated to drive /dev/sdd.

Regards,
Frédéric.


> I would expect CEPH to detect disk corruption and automatically replace the
> invalid data with a valid copy?
> 
> I use only replica pools in this lab setup, for RBD and CephFS.
> 
> Thanks
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Excessively Chatty Daemons RHCS v5

2024-06-07 Thread Frédéric Nass
Hi Joshua,

These messages actually deserve more attention than you think, I believe. You 
may hit this one [1] that Mark (comment #4) also hit with 16.2.10 (RHCS 5).
PR's here: https://github.com/ceph/ceph/pull/51669

Could you try raising osd_max_scrubs to 2 or 3 (now defaults to 3 in quincy and 
reef) and see if these logs disappear over the next hours/days?

Regards,
Frédéric.

- Le 4 Juin 24, à 18:39, Joshua Arulsamy jarul...@uwyo.edu a écrit :

> Hi,
> 
> I recently upgraded my RHCS cluster from v4 to v5 and moved to containerized
> daemons (podman) along the way. I noticed that there are a huge number of logs
> going to journald on each of my hosts. I am unsure why there are so many.
> 
> I tried changing the logging level at runtime with commands like these (from 
> the
> ceph docs):
> 
> ceph tell osd.\* config set debug_osd 0/5
> 
> I tried adjusting several different subsystems (also with 0/0) but I noticed
> that logs seem to come at the same rate/content. I'm not sure what to try 
> next?
> Is there a way to trace where logs are coming from?
> 
> Some of the sample log entries are events like this on the OSD nodes:
> 
> Jun 04 10:34:02 pf-osd1 ceph-osd-0[182875]: 2024-06-04T10:34:02.470-0600
> 7fc049c03700 -1 osd.0 pg_epoch: 703151 pg[35.39s0( v 703141'789389
> (701266'780746,703141'789389] local-lis/les=702935/702936 n=48162 
> ec=63726/27988
> lis/c=702935/702935 les/c/f=702936/702936/0 sis=702935)
> [0,194,132,3,177,159,83,18,149,14,145]p0(0) r=0 lpr=702935 crt=703141'789389
> lcod 703141'789388 mlcod 703141'789388 active+clean planned 
> DEEP_SCRUB_ON_ERROR]
> scrubber : handle_scrub_reserve_grant: received unsolicited
> reservation grant from osd 177(4) (0x55fdea6c4000)
> 
> These are very verbose messages and occur roughly every 0.5 second per daemon.
> On a cluster with 200 daemons this is getting unmanageable and is flooding my
> syslog servers.
> 
> Any advice on how to tame all the logs would be greatly appreciated!
> 
> Best,
> 
> Josh
> 
> Joshua Arulsamy
> HPC Systems Architect
> Advanced Research Computing Center
> University of Wyoming
> jarul...@uwyo.edu
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to setup NVMeoF?

2024-05-30 Thread Frédéric Nass
Hello Robert,

You could try:

ceph config set mgr mgr/cephadm/container_image_nvmeof 
"quay.io/ceph/nvmeof:1.2.13" or whatever image tag you need (1.2.13 is current 
latest).

Another way to run the image is by editing the unit.run file of the service or 
by directly running the container with podman run (you'll need to adjust names, 
cluster sid, etc.):

/usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM 
--authfile=/etc/ceph/podman-auth.json --net=host --init --name 
ceph-aa558815-042c-4fce-ac37-80c0255bf3c0-nvmeof-nvmeof_pool01-test-lis04h02-baakhx
 --pids-limit=-1 --ulimit memlock=-1:-1 --ulimit nofile=10240 
--cap-add=SYS_ADMIN --cap-add=CAP_SYS_NICE --log-driver journald 
--conmon-pidfile 
/run/ceph-aa558815-042c-4fce-ac37-80c0255bf3c0@nvmeof.nvmeof_pool01.test-lis04h02.baakhx.service-pid
 --cidfile 
/run/ceph-aa558815-042c-4fce-ac37-80c0255bf3c0@nvmeof.nvmeof_pool01.test-lis04h02.baakhx.service-cid
 --cgroups=split -e CONTAINER_IMAGE=quay.io/ceph/nvmeof:1.2.13 -e 
NODE_NAME=test-lis04h02.peta.libe.dc.univ-lorraine.fr -e 
CEPH_USE_RANDOM_NONCE=1 -v 
/var/lib/ceph/aa558815-042c-4fce-ac37-80c0255bf3c0/nvmeof.nvmeof_pool01.test-lis04h02.baakhx/config:/etc/ceph/ceph.conf:z
 -v 
/var/lib/ceph/aa558815-042c-4fce-ac37-80c0255bf3c0/nvmeof.nvmeof_pool01.test-lis04h02.baakhx/keyring:/etc/ceph/keyring:z
 -v 
/var/lib/ceph/aa558815-042c-4fce-ac37-80c0255bf3c0/nvmeof.nvmeof_pool01.test-lis04h02.baakhx/ceph-nvmeof.conf:/src/ceph-nvmeof.conf:z
 -v 
/var/lib/ceph/aa558815-042c-4fce-ac37-80c0255bf3c0/nvmeof.nvmeof_pool01.test-lis04h02.baakhx/configfs:/sys/kernel/config
 -v /dev/hugepages:/dev/hugepages -v /dev/vfio/vfio:/dev/vfio/vfio -v 
/etc/hosts:/etc/hosts:ro --mount 
type=bind,source=/lib/modules,destination=/lib/modules,ro=true 
quay.io/ceph/nvmeof:1.2.13

The commands I wrote here [1] in February should still work I believe.

Regards,
Frédéric.

[1] https://github.com/ceph/ceph-nvmeof/issues/459

- Le 30 Mai 24, à 13:03, Robert Sander r.san...@heinlein-support.de a écrit 
:

> Hi,
> 
> On 5/30/24 11:58, Robert Sander wrote:
> 
>> I am trying to follow the documentation at
>> https://docs.ceph.com/en/reef/rbd/nvmeof-target-configure/ to deploy an
>> NVMe over Fabric service.
> 
> It looks like the cephadm orchestrator in this 18.2.2 cluster uses the image
> quay.io/ceph/nvmeof:0.0.2 which is 9 months old.
> 
> When I try to redeploy the daemon with the latest image
> ceph orch daemon redeploy nvmeof.nvme01.cephtest29.gookea --image
> quay.io/ceph/nvmeof:latest
> it tells me:
> 
> Error EINVAL: Cannot redeploy nvmeof.nvme01.cephtest29.gookea with a new 
> image:
> Supported types are: mgr, mon, crash, osd, mds, rgw, rbd-mirror, 
> cephfs-mirror,
> ceph-exporter, iscsi, nfs
> 
> How do I set the container image for this service?
> 
> ceph config set nvmeof container_image quay.io/ceph/nvmeof:latest
> 
> does not work with Error EINVAL: unrecognized config target 'nvmeof'
> 
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: User + Dev Meetup Tomorrow!

2024-05-24 Thread Frédéric Nass
Hello Sebastian,

I just checked the survey and you're right, the issue was within the question. 
Got me a bit confused when I read it but I clicked anyway. Who doesn't like 
clicking? :-D

What best describes your deployment target? *
1/ Bare metal (RPMs/Binary)
2/ Containers (cephadm/Rook)
3/ Both

How funny is that.

Apart from that, I was thinking of some users having reported that they found 
the orchestrator a little obscure in its operation/decisions, particularly with 
regard to the creation of OSDs.

A nice feature would be to have a history of what the orchestrator did with the 
result of its action and the reason (in case of failure).
A 'ceph orch history' for example (or ceph orch status --details or --history 
or whatever). It would be much easier to read than the MGR's very verbose 
ceph.cephadm.log.

Like for example:

$ ceph orch history
DATE/TIME TASK  
  HOSTS RESULT
2024-05-24T10:40:44.866148Z   Applying tuned-profile latency-performance
  voltaire,lafontaine,rimbaud   SUCCESS
2024-05-24T10:39:44.866148Z   Applying mds.cephfs spec  
  verlaine,hugo SUCCESS
2024-05-24T10:33:44.866148Z   Applying service osd.osd_nodes_fifteen on host 
lamartine... lamartine FAILED (host has _no_schedule 
label)
2024-05-24T10:28:44.866148Z   Applying service rgw.s31 spec 
  eluard,baudelaire SUCCESS

We'd just have to "watch ceph orch history" and see what the orchestrator does 
in real time.

Cheers,
Frédéric.

- Le 24 Mai 24, à 15:07, Sebastian Wagner sebastian.wag...@croit.io a écrit 
:

> Hi Frédéric,
> 
> I agree. Maybe we should re-frame things? Containers can run on
> bare-metal and containers can run virtualized. And distribution packages
> can run bare-metal and virtualized as well.
> 
> What about asking independently about:
> 
>  * Do you run containers or distribution packages?
>  * Do you run bare-metal or virtualized?
> 
> Best,
> Sebastian
> 
> Am 24.05.24 um 12:28 schrieb Frédéric Nass:
>> Hello everyone,
>>
>> Nice talk yesterday. :-)
>>
>> Regarding containers vs RPMs and orchestration, and the related discussion 
>> from
>> yesterday, I wanted to share a few things (which I wasn't able to share
>> yesterday on the call due to a headset/bluetooth stack issue) to explain why 
>> we
>> use cephadm and ceph orch these days with bare-metal clusters even though, as
>> someone said, cephadm was not supposed to work with (nor support) bare-metal
>> clusters (which actually surprised me since cephadm is all about managing
>> containers on a host, regardless of its type). I also think this explains the
>> observation that was made that half of the reports (iirc) are supposedly 
>> using
>> cephadm with bare-metal clusters.
>>
>> Over the years, we've deployed and managed bare-metal clusters with 
>> ceph-deploy
>> in Hammer, then switched to ceph-ansible (take-over-existing-cluster.yml) 
>> with
>> Jewel (or was it Luminous?), and then moved to cephadm, cephadm-ansible and
>> ceph-orch with Pacific, to manage the exact same bare-metal cluster. I guess
>> this explains why some bare-metal cluster today are managed using cephadm.
>> These are not new clusters deployed with Rook in K8s environments, but 
>> existing
>> bare-metal clusters that continue to servce brilliantly 10 years after
>> installation.
>>
>> Regarding rpms vs containers, as mentioned during the call, not sure why one
>> would still want to use rpms vs containers considering the simplicity and
>> velocity that containers offer regarding upgrades with ceph orch clever
>> automation. Some reported performance reasons between rpms vs containers,
>> meaning rpms binaries would perform better than containers. Is there any
>> evidence of that?
>>
>> Perhaps the reason why people still use RPMs is instead that they have 
>> invested
>> a lot of time and effort into developing automation tools/scripts/playbooks 
>> for
>> RPMs installations and they consider the transition to ceph orch and
>> containerized environments as a significant challenge.
>>
>> Regarding containerized Ceph, I remember asking Sage for a minimalist CephOS
>> back in 2018 (there was no containers by that time). IIRC, he said 
>> maintaining
>> a ceph-specific Linux distro would take too much time and resources, so it 
>> was
>> not something considered at that time. Now that Ceph is all containers, I
>> really hope that a minimalist rolling Ceph distro comes out one day. ceph 

[ceph-users] Re: User + Dev Meetup Tomorrow!

2024-05-24 Thread Frédéric Nass
Hello everyone,

Nice talk yesterday. :-)

Regarding containers vs RPMs and orchestration, and the related discussion from 
yesterday, I wanted to share a few things (which I wasn't able to share 
yesterday on the call due to a headset/bluetooth stack issue) to explain why we 
use cephadm and ceph orch these days with bare-metal clusters even though, as 
someone said, cephadm was not supposed to work with (nor support) bare-metal 
clusters (which actually surprised me since cephadm is all about managing 
containers on a host, regardless of its type). I also think this explains the 
observation that was made that half of the reports (iirc) are supposedly using 
cephadm with bare-metal clusters.

Over the years, we've deployed and managed bare-metal clusters with ceph-deploy 
in Hammer, then switched to ceph-ansible (take-over-existing-cluster.yml) with 
Jewel (or was it Luminous?), and then moved to cephadm, cephadm-ansible and 
ceph-orch with Pacific, to manage the exact same bare-metal cluster. I guess 
this explains why some bare-metal cluster today are managed using cephadm. 
These are not new clusters deployed with Rook in K8s environments, but existing 
bare-metal clusters that continue to servce brilliantly 10 years after 
installation.

Regarding rpms vs containers, as mentioned during the call, not sure why one 
would still want to use rpms vs containers considering the simplicity and 
velocity that containers offer regarding upgrades with ceph orch clever 
automation. Some reported performance reasons between rpms vs containers, 
meaning rpms binaries would perform better than containers. Is there any 
evidence of that?

Perhaps the reason why people still use RPMs is instead that they have invested 
a lot of time and effort into developing automation tools/scripts/playbooks for 
RPMs installations and they consider the transition to ceph orch and 
containerized environments as a significant challenge.

Regarding containerized Ceph, I remember asking Sage for a minimalist CephOS 
back in 2018 (there was no containers by that time). IIRC, he said maintaining 
a ceph-specific Linux distro would take too much time and resources, so it was 
not something considered at that time. Now that Ceph is all containers, I 
really hope that a minimalist rolling Ceph distro comes out one day. ceph orch 
could even handle rare distro upgrades such as kernel upgrades as well as 
ordered reboots. This would make ceph clusters really easier to maintain over 
time (compared to the last complicated upgrade path from non-containerized 
RHEL7+RHCS4.3 to containerized RHEL9+RHCS5.2 that we had to follow a year ago).

Bests,
Frédéric.

- Le 23 Mai 24, à 15:58, Laura Flores lflo...@redhat.com a écrit :

> Hi all,
> 
> The meeting will be starting shortly! Join us at this link:
> https://meet.jit.si/ceph-user-dev-monthly
> 
> - Laura
> 
> On Wed, May 22, 2024 at 2:55 PM Laura Flores  wrote:
> 
>> Hi all,
>>
>> The User + Dev Meetup will be held tomorrow at 10:00 AM EDT. We will be
>> discussing the results of the latest survey, and users who attend will have
>> the opportunity to provide additional feedback in real time.
>>
>> See you there!
>> Laura Flores
>>
>> Meeting Details:
>> https://www.meetup.com/ceph-user-group/events/300883526/
>>
>> --
>>
>> Laura Flores
>>
>> She/Her/Hers
>>
>> Software Engineer, Ceph Storage 
>>
>> Chicago, IL
>>
>> lflo...@ibm.com | lflo...@redhat.com 
>> M: +17087388804
>>
>>
>>
> 
> --
> 
> Laura Flores
> 
> She/Her/Hers
> 
> Software Engineer, Ceph Storage 
> 
> Chicago, IL
> 
> lflo...@ibm.com | lflo...@redhat.com 
> M: +17087388804
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem with take-over-existing-cluster.yml playbook

2024-05-14 Thread Frédéric Nass
Vlad,

Can you double check you've set public_network correctly in all.yaml file and 
removed it from ceph_conf_overrides? Looks like it doesn't find MONs IPs in 
public_network range.

Most of the the time, you'd get around this by adding/removing/changing 
settings in all.yaml and/or group_vars/*.yaml files.

You can also try adding multiple - on the ansible-playbook command and see 
if you get something useful.

Regards,
Frédéric.



De : vladimir franciz blando 
Envoyé : mardi 14 mai 2024 21:23
À : Frédéric Nass
Cc: Eugen Block; ceph-users 
Objet : Re: [ceph-users] Re: Problem with take-over-existing-cluster.yml 
playbook

Yes. I copied and ran the take over script on the root dir of ceph-ansible. 

Regards,

Vladimir Franciz S. Blando 
about.me/vblando
***
Sent from Mobile Gmail


On Wed, May 15, 2024 at 3:16 AM Frédéric Nass  
wrote:
>
> Vlad,
>
> Can you make sure take-over-existing-cluster.yml is in the root directory of 
> ceph-ansible (/usr/share/ceph-ansible) when you run it (as per step 10. of 
> the documentation)?
>
> Regards,
> Frédéric.
>
> - Le 14 Mai 24, à 19:31, vladimir franciz blando 
>  a écrit :
>>
>> Hi,
>>
>> That didn't work either.
>>
>> Regards,
>> Vlad Blando
>>
>> On Tue, May 14, 2024 at 4:10 PM Frédéric Nass 
>>  wrote:
>>>
>>>
>>> Hello Vlad,
>>>
>>> We've seen this before a while back. Not sure to recall how we got around 
>>> this but you might want to try setting 'ip_version: ipv4' in your all.yaml 
>>> file since this seems to be a condition to the facts setting.
>>>
>>> - name: Set_fact _monitor_addresses - ipv4
>>>   ansible.builtin.set_fact:
>>>     _monitor_addresses: "{{ _monitor_addresses | default({}) | 
>>> combine({item: hostvars[item]['ansible_facts']['all_ipv4_addresses'] | 
>>> ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"
>>>   with_items: "{{ groups.get(mon_group_name, []) }}"
>>>   when:
>>>     - ip_version == 'ipv4'
>>>
>>> I can see we set it in our old all.yaml file.
>>>
>>> Regards,
>>> Frédéric.
>>>
>>> - Le 13 Mai 24, à 14:19, vladimir franciz blando 
>>> vladimir.bla...@gmail.com a écrit :
>>>
>>> > Hi,
>>> >
>>> > If I follow the guide, it only says to define the mons on the ansible 
>>> > hosts
>>> > files under the section [mons] which I did with this example (not real ip)
>>> >
>>> > [mons]
>>> > vlad-ceph1 monitor_address=192.168.1.1 ansible_user=ceph
>>> > vlad-ceph2 monitor_address=192.168.1.2 ansible_user=ceph
>>> > vlad-ceph3 monitor_address=192.168.1.3 ansible_user=ceph
>>> >
>>> >
>>> > Regards,
>>> > Vlad Blando <https://about.me/vblando>
>>> >
>>> >
>>> > On Wed, May 8, 2024 at 6:22 PM Eugen Block  wrote:
>>> >
>>> >> Hi,
>>> >>
>>> >> I'm not familiar with ceph-ansible. I'm not sure if I understand it
>>> >> correctly, according to [1] it tries to get the public IP range to
>>> >> define monitors (?). Can you verify if your mon sections in
>>> >> /etc/ansible/hosts are correct?
>>> >>
>>> >> ansible.builtin.set_fact:
>>> >>      _monitor_addresses: "{{ _monitor_addresses | default({}) |
>>> >> combine({item: hostvars[item]['ansible_facts']['all_ipv4_addresses'] |
>>> >> ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"
>>> >>
>>> >> [1]
>>> >>
>>> >> https://github.com/ceph/ceph-ansible/blob/878cce5b4847a9a112f9d07c0fd651aa15f1e58b/roles/ceph-facts/tasks/set_monitor_address.yml
>>> >>
>>> >> Zitat von vladimir franciz blando :
>>> >>
>>> >> > I know that only a few are using this script but just trying my luck 
>>> >> > here
>>> >> > if someone has the same issue as mine.
>>> >> >
>>> >> > But first, who has successfully used this script and what version did 
>>> >> > you
>>> >> > use? Im using this guide on my test environment (
>>> >> >
>>> >> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/importing-an-existing-ceph-cluster-to-ansible
&g

[ceph-users] Re: Problem with take-over-existing-cluster.yml playbook

2024-05-14 Thread Frédéric Nass
Vlad, 

Can you make sure take-over-existing-cluster.yml is in the root directory of 
ceph-ansible (/usr/share/ceph-ansible) when you run it (as per step 10. of the 
documentation)? 

Regards, 
Frédéric. 

- Le 14 Mai 24, à 19:31, vladimir franciz blando 
 a écrit : 

> Hi,

> That didn't work either.

> Regards,
> [ https://about.me/vblando | Vlad Blando ]

> On Tue, May 14, 2024 at 4:10 PM Frédéric Nass < [
> mailto:frederic.n...@univ-lorraine.fr | frederic.n...@univ-lorraine.fr ] >
> wrote:

>> Hello Vlad,

>> We've seen this before a while back. Not sure to recall how we got around 
>> this
>> but you might want to try setting 'ip_version: ipv4' in your all.yaml file
>> since this seems to be a condition to the facts setting.

>> - name: Set_fact _monitor_addresses - ipv4
>> ansible.builtin.set_fact:
>> _monitor_addresses: "{ { _monitor_addresses | default({}) | combine({item:
>> hostvars[item]['ansible_facts']['all_ipv4_addresses'] |
>> ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"
>> with_items: "{ { groups.get(mon_group_name, []) }}"
>> when:
>> - ip_version == 'ipv4'

>> I can see we set it in our old all.yaml file.

>> Regards,
>> Frédéric.

>> - Le 13 Mai 24, à 14:19, vladimir franciz blando [
>> mailto:vladimir.bla...@gmail.com | vladimir.bla...@gmail.com ] a écrit :

>> > Hi,

>> > If I follow the guide, it only says to define the mons on the ansible hosts
>> > files under the section [mons] which I did with this example (not real ip)

>> > [mons]
>> > vlad-ceph1 monitor_address=192.168.1.1 ansible_user=ceph
>> > vlad-ceph2 monitor_address=192.168.1.2 ansible_user=ceph
>> > vlad-ceph3 monitor_address=192.168.1.3 ansible_user=ceph


>> > Regards,
>> > Vlad Blando < [ https://about.me/vblando | https://about.me/vblando ] >


>>> On Wed, May 8, 2024 at 6:22 PM Eugen Block < [ mailto:ebl...@nde.ag |
>> > ebl...@nde.ag ] > wrote:

>> >> Hi,

>> >> I'm not familiar with ceph-ansible. I'm not sure if I understand it
>> >> correctly, according to [1] it tries to get the public IP range to
>> >> define monitors (?). Can you verify if your mon sections in
>> >> /etc/ansible/hosts are correct?

>> >> ansible.builtin.set_fact:
>> >> _monitor_addresses: "{ { _monitor_addresses | default({}) |
>> >> combine({item: hostvars[item]['ansible_facts']['all_ipv4_addresses'] |
>> >> ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"

>> >> [1]

>>>> [
>>>> https://github.com/ceph/ceph-ansible/blob/878cce5b4847a9a112f9d07c0fd651aa15f1e58b/roles/ceph-facts/tasks/set_monitor_address.yml
>>>> |
>>>> https://github.com/ceph/ceph-ansible/blob/878cce5b4847a9a112f9d07c0fd651aa15f1e58b/roles/ceph-facts/tasks/set_monitor_address.yml
>> >> ]

>>>> Zitat von vladimir franciz blando < [ mailto:vladimir.bla...@gmail.com |
>> >> vladimir.bla...@gmail.com ] >:

>> >> > I know that only a few are using this script but just trying my luck 
>> >> > here
>> >> > if someone has the same issue as mine.

>> >> > But first, who has successfully used this script and what version did 
>> >> > you
>> >> > use? Im using this guide on my test environment (

>>>> [
>>>> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/importing-an-existing-ceph-cluster-to-ansible
>>>> |
>>>> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/importing-an-existing-ceph-cluster-to-ansible
>> >> ]
>> >> > )

>> >> > Error encountered
>> >> > ---
>> >> > TASK [Generate ceph configuration file]
>> >> > **


>> >> ***
>> >> > fatal: [vladceph-1]: FAILED! =>
>> >> > msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
>> >> > undefined'
>> >> > fatal: [vladceph-3]: FAILED! =>
>> >> > msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
>> >> > undefined'
>> >> > fatal: [vladceph-2]: FAILED! =>
>> >> > msg: '''_monitor_addresses'' is undefined. ''_monitor_

[ceph-users] Re: Problem with take-over-existing-cluster.yml playbook

2024-05-14 Thread Frédéric Nass

Hello Vlad,

We've seen this before a while back. Not sure to recall how we got around this 
but you might want to try setting 'ip_version: ipv4' in your all.yaml file 
since this seems to be a condition to the facts setting.

- name: Set_fact _monitor_addresses - ipv4
  ansible.builtin.set_fact:
_monitor_addresses: "{{ _monitor_addresses | default({}) | combine({item: 
hostvars[item]['ansible_facts']['all_ipv4_addresses'] | 
ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"
  with_items: "{{ groups.get(mon_group_name, []) }}"
  when:
- ip_version == 'ipv4'

I can see we set it in our old all.yaml file.

Regards,
Frédéric.

- Le 13 Mai 24, à 14:19, vladimir franciz blando vladimir.bla...@gmail.com 
a écrit :

> Hi,
> 
> If I follow the guide, it only says to define the mons on the ansible hosts
> files under the section [mons] which I did with this example (not real ip)
> 
> [mons]
> vlad-ceph1 monitor_address=192.168.1.1 ansible_user=ceph
> vlad-ceph2 monitor_address=192.168.1.2 ansible_user=ceph
> vlad-ceph3 monitor_address=192.168.1.3 ansible_user=ceph
> 
> 
> Regards,
> Vlad Blando 
> 
> 
> On Wed, May 8, 2024 at 6:22 PM Eugen Block  wrote:
> 
>> Hi,
>>
>> I'm not familiar with ceph-ansible. I'm not sure if I understand it
>> correctly, according to [1] it tries to get the public IP range to
>> define monitors (?). Can you verify if your mon sections in
>> /etc/ansible/hosts are correct?
>>
>> ansible.builtin.set_fact:
>>  _monitor_addresses: "{{ _monitor_addresses | default({}) |
>> combine({item: hostvars[item]['ansible_facts']['all_ipv4_addresses'] |
>> ips_in_ranges(hostvars[item]['public_network'].split(',')) | first}) }}"
>>
>> [1]
>>
>> https://github.com/ceph/ceph-ansible/blob/878cce5b4847a9a112f9d07c0fd651aa15f1e58b/roles/ceph-facts/tasks/set_monitor_address.yml
>>
>> Zitat von vladimir franciz blando :
>>
>> > I know that only a few are using this script but just trying my luck here
>> > if someone has the same issue as mine.
>> >
>> > But first, who has successfully used this script and what version did you
>> > use? Im using this guide on my test environment (
>> >
>> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/importing-an-existing-ceph-cluster-to-ansible
>> > )
>> >
>> > Error encountered
>> > ---
>> > TASK [Generate ceph configuration file]
>> > **
>> >
>> >
>> ***
>> > fatal: [vladceph-1]: FAILED! =>
>> >   msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
>> > undefined'
>> > fatal: [vladceph-3]: FAILED! =>
>> >   msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
>> > undefined'
>> > fatal: [vladceph-2]: FAILED! =>
>> >   msg: '''_monitor_addresses'' is undefined. ''_monitor_addresses'' is
>> > undefined'
>> > ---
>> >
>> >
>> >
>> > Regards,
>> > Vlad Blando 
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS crash

2024-04-26 Thread Frédéric Nass
Hello,

'almost all diagnostic ceph subcommands hang!' -> this triggered my bell. We've 
had a similar issue with many ceph commands hanging due to a missing L3 ACL 
between MGRs and a new MDS machine that we added to the cluster.

I second Eugen analysis: network issue, whatever the OSI layer.

Regards,
Frédéric.

- Le 26 Avr 24, à 9:31, Eugen Block ebl...@nde.ag a écrit :

> Hi, it's unlikely that all OSDs fail at the same time, it seems like a
> network issue. Do you have an active MGR? Just a couple of days ago
> someone reported incorrect OSD stats because no MGR was up. Although
> your 'ceph health detail' output doesn't mention that, there are still
> issues when MGR processes are active according to ceph but don't
> respond anymore.
> I would probably start with basic network debugging, e. g. iperf,
> pings on public and cluster networks (if present) and so on.
> 
> Regards,
> Eugen
> 
> Zitat von Alexey GERASIMOV :
> 
>> Colleagues, I have the update.
>>
>> Starting from yestrerday the situation with ceph health is much
>> worse than it was previously.
>> We found that
>> - ceph -s informs us that some PGs are in stale state
>> -  almost all diagnostic ceph subcommands hang! For example, "ceph
>> osd ls" , "ceph osd dump",  "ceph osd tree", "ceph health detail"
>> provide the output - but "ceph osd status", all the commands "ceph
>> pg ..." and other ones hang.
>>
>> So, it looks that the crashes of MDS daemons were the first signs of
>> problems only.
>> I read that "stale" state for PGs means that all nodes storing this
>> placement group may be down - but it's wrong, all osd daemons are up
>> on all three nodes:
>>
>> --- ceph osd tree
>> ID  CLASS  WEIGHTTYPE NAME STATUS  REWEIGHT  PRI-AFF
>> -1 68.05609  root default
>> -3 22.68536  host asrv-dev-stor-1
>>  0hdd   5.45799  osd.0 up   1.0  1.0
>>  1hdd   5.45799  osd.1 up   1.0  1.0
>>  2hdd   5.45799  osd.2 up   1.0  1.0
>>  3hdd   5.45799  osd.3 up   1.0  1.0
>> 12ssd   0.42670  osd.12up   1.0  1.0
>> 13ssd   0.42670  osd.13up   1.0  1.0
>> -5 22.68536  host asrv-dev-stor-2
>>  4hdd   5.45799  osd.4 up   1.0  1.0
>>  5hdd   5.45799  osd.5 up   1.0  1.0
>>  6hdd   5.45799  osd.6 up   1.0  1.0
>>  7hdd   5.45799  osd.7 up   1.0  1.0
>> 14ssd   0.42670  osd.14up   1.0  1.0
>> 15ssd   0.42670  osd.15up   1.0  1.0
>> -7 22.68536  host asrv-dev-stor-3
>>  8hdd   5.45799  osd.8 up   1.0  1.0
>> 10hdd   5.45799  osd.10up   1.0  1.0
>> 11hdd   5.45799  osd.11up   1.0  1.0
>> 18hdd   5.45799  osd.18up   1.0  1.0
>> 16ssd   0.42670  osd.16up   1.0  1.0
>> 17ssd   0.42670  osd.17up   1.0  1.0
>>
>> May it be the physical problem with our drives? "smartctl -a"
>> informs nothing wrong.  We started the surface check using dd
>> command also but it will be 7 hours per drive at least...
>>
>> What should we do also?
>>
>> The output of  "ceph health detail":
>>
>> ceph health detail
>> HEALTH_ERR 1 MDSs report damaged metadata; insufficient standby MDS
>> daemons available; Reduced data availability: 50 pgs stale; 90
>> daemons have recently crashed; 3 mgr modules have recently crashed
>> [ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
>> mds.asrv-dev-stor-2(mds.0): Metadata damage detected
>> [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
>> have 0; want 1 more
>> [WRN] PG_AVAILABILITY: Reduced data availability: 50 pgs stale
>> pg 5.0 is stuck stale for 67m, current state stale+active+clean,
>> last acting [4,1,11]
>> pg 5.13 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,0,10]
>> pg 5.18 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,11,2]
>> pg 5.19 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,3,10]
>> pg 5.1e is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,7,11]
>> pg 5.22 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,6,18]
>> pg 5.26 is stuck stale for 67m, current state
>> stale+active+clean, last acting [4,1,18]
>> pg 5.29 is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,11,6]
>> pg 5.2b is stuck stale for 10h, current state
>> stale+active+clean, last acting [0,18,6]
>> pg 5.30 is stuck stale 

[ceph-users] Re: Impact of large PG splits

2024-04-25 Thread Frédéric Nass

Hello Eugen,

Thanks for sharing the good news. Did you have to raise mon_osd_nearfull_ratio 
temporarily? 

Frédéric.

- Le 25 Avr 24, à 12:35, Eugen Block ebl...@nde.ag a écrit :

> For those interested, just a short update: the split process is
> approaching its end, two days ago there were around 230 PGs left
> (target are 4096 PGs). So far there were no complaints, no cluster
> impact was reported (the cluster load is quite moderate, but still
> sensitive). Every now and then a single OSD (not the same) reaches 85%
> nearfull ratio, but that was expected since the first nearfull OSD was
> the root cause of this operation. I expect the balancer to kick in as
> soon as the backfill has completed or when there are less than 5%
> misplaced objects.
> 
> Zitat von Anthony D'Atri :
> 
>> One can up the ratios temporarily but it's all too easy to forget to
>> reduce them later, or think that it's okay to run all the time with
>> reduced headroom.
>>
>> Until a host blows up and you don't have enough space to recover into.
>>
>>> On Apr 12, 2024, at 05:01, Frédéric Nass
>>>  wrote:
>>>
>>>
>>> Oh, and yeah, considering "The fullest OSD is already at 85% usage"
>>> best move for now would be to add new hardware/OSDs (to avoid
>>> reaching the backfill too full limit), prior to start the splitting
>>> PGs before or after enabling upmap balancer depending on how the
>>> PGs got rebalanced (well enough or not) after adding new OSDs.
>>>
>>> BTW, what ceph version is this? You should make sure you're running
>>> v16.2.11+ or v17.2.4+ before splitting PGs to avoid this nasty bug:
>>> https://tracker.ceph.com/issues/53729
>>>
>>> Cheers,
>>> Frédéric.
>>>
>>> - Le 12 Avr 24, à 10:41, Frédéric Nass
>>> frederic.n...@univ-lorraine.fr a écrit :
>>>
>>>> Hello Eugen,
>>>>
>>>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph
>>>> daemon osd.0
>>>> config show | grep osd_op_queue)
>>>>
>>>> If WPQ, you might want to tune osd_recovery_sleep* values as they
>>>> do have a real
>>>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
>>>> before doing that.
>>>> If mClock scheduler then you might want to use a specific mClock profile as
>>>> suggested by Gregory (as osd_recovery_sleep* are not considered when using
>>>> mClock).
>>>>
>>>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
>>>> cluster only has 240, increasing osd_max_backfills to any values
>>>> higher than
>>>> 2-3 will not help much with the recovery/backfilling speed.
>>>>
>>>> All the way, you'll have to be patient. :-)
>>>>
>>>> Cheers,
>>>> Frédéric.
>>>>
>>>> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
>>>>
>>>>> Thank you for input!
>>>>> We started the split with max_backfills = 1 and watched for a few
>>>>> minutes, then gradually increased it to 8. Now it's backfilling with
>>>>> around 180 MB/s, not really much but since client impact has to be
>>>>> avoided if possible, we decided to let that run for a couple of hours.
>>>>> Then reevaluate the situation and maybe increase the backfills a bit
>>>>> more.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Zitat von Gregory Orange :
>>>>>
>>>>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>>>>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>>>>> objects. We are splitting for the same reason as you - improved
>>>>>> balance. We also thought long and hard before we began, concerned
>>>>>> about impact, stability etc.
>>>>>>
>>>>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>>>>> retain some control and stop it again fairly quickly if we weren't
>>>>>> happy with the behaviour. It also serves to limit the performance
>>>>>> impact on the cluster, but unfortunately it also makes the whole
>>>>>> process slower.
>>>>>>
>>>>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>>>>>> issues with the cluster. We could go higher, but are not in a rush
&

[ceph-users] Re: Orchestrator not automating services / OSD issue

2024-04-24 Thread Frédéric Nass
Hello Michael,

You can try this:

1/ check that the host shows up on ceph orch ls with the right label 'osds'
2/ check that the host is OK with ceph cephadm check-host . It should 
look like:
 (None) ok
podman (/usr/bin/podman) version 4.6.1 is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Hostname "" matches what is expected.
Host looks OK
3/ double check you service_type 'osd' with ceph orch ls --service-type osd 
--export
It should show the correct placement and spec (drives size, etc.)
4/ enable debugging with ceph config set mgr mgr/cephadm/log_to_cluster_level 
debug
5/ open a terminal and observe ceph -W cephadm --watch-debug
6/ ceph mgr fail
7/ ceph orch device ls --hostname= --wide --refresh (should show 
local bloc devices as Available and trigger the creation of the OSDs)

If your service_type 'osd' is correct, the orchestrator should deploy OSDs on 
the node.
If it does not then look for the reason why in ceph -W cephadm --watch-debug 
output.

Regards,
Frédéric.

- Le 24 Avr 24, à 3:22, Michael Baer c...@mikesoffice.com a écrit :

> Hi,
> 
> This problem started with trying to add a new storage server into a
> quincy v17.2.6 ceph cluster. Whatever I did, I could not add the drives
> on the new host as OSDs: via dashboard, via cephadm shell, by setting
> osd unmanaged to false.
> 
> But what I started realizing is that orchestrator will also no longer
> automatically manage services. I.e. if a service is set to manage by
> labels, removing and adding labels to different hosts for that service
> has no affect. Same if I set a service to be manage via hostnames. Same
> if I try to drain a host (the services/podman containers just keep
> running). Although, I am able to add/rm services via 'cephadm shell ceph
> orch daemon add/rm'. But Ceph will not manage automatically using
> labels/hostnames.
> 
> This apparently includes OSD daemons. I can not create and on the new
> host either automatically or manually, but I'm hoping the services/OSD
> issues are related and not two issues.
> 
> I haven't been able to find any obvious errors in /var/log/ceph,
> /var/log/syslog, logs , etc. I have been able to get 'slow
> ops' errors on monitors by trying to add OSDs manually (and having to
> restart the monitor). I've also gotten cephadm shell to hang. And had to
> restart managers. I'm not an expert and it could be something obvious,
> but I haven't been able to figure out a solution. If anyone has any
> suggestions, I would greatly appreciate them.
> 
> Thanks,
> Mike
> 
> --
> Michael Baer
> c...@mikesoffice.com
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-23 Thread Frédéric Nass
Exactly, strong consistency is why we chose Ceph over other SDS solutions back 
in 2014 (and disabled any non persistent cache along the IO path like HDD disk 
cache).
A major power outage in our town a few years back (a few days before Christmas) 
and a ups malfunction has proven us right.

Another reason to adopt Ceph today is that a cluster you build today to match a 
specific workload (lets say capacity) will accommodate any future workloads 
(for example performance) you may have tomorrow, simply by adding specific 
nodes to the cluster whatever the hardware will look like in decades.

Regards,
Frédéric.

- Le 23 Avr 24, à 13:04, Janne Johansson icepic...@gmail.com a écrit :

> Den tis 23 apr. 2024 kl 11:32 skrev Frédéric Nass
> :
>> Ceph is strongly consistent. Either you read/write objects/blocs/files with 
>> an
>> insured strong consistency OR you don't. Worst thing you can expect from 
>> Ceph,
>> as long as it's been properly designed, configured and operated is a 
>> temporary
>> loss of access to the data.
> 
> This is often more important than you think. All centralized storage
> systems will have to face some kind of latency when sending data over
> the network, when splitting the data into replicas or erasure coding
> shards, when waiting for all copies/shards are actually finished
> written (perhaps via journals) to the final destination and then
> lastly for the write to be acknowledged back to the writing client. If
> some vendor says that "because of our special code, this part takes
> zero time", they are basically telling you that they are lying about
> the status of the write in order to finish more quickly, because this
> wins them contracts or wins competitions.
> 
> It will not win you any smiles when there is an incident and data that
> was ACKed to be on disk suddenly isn't because some write cache lost
> power at the same time as the storage box and now some database have
> half-written transactions in it. Ceph is by no means the fastest
> possible way to store data on a network, but it is very good while
> still retaining the strong consistencies mentioned by Frederic above
> allowing for many clients to do many IOs in parallel against the
> cluster.
> 
> --
> May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-23 Thread Frédéric Nass
Hello,

My turn ;-)

Ceph is strongly consistent. Either you read/write objects/blocs/files with an 
insured strong consistency OR you don't. Worst thing you can expect from Ceph, 
as long as it's been properly designed, configured and operated is a temporary 
loss of access to the data.

There are now a few companies in the world with deep knowledge of Ceph that are 
designing, deploying and operating Ceph clusters in the best way for their 
customers, contributing to the leadership and development of Ceph at the 
highest level, some of them even offering their own downstream version of Ceph, 
ensuring customers are operating the most up-to-date, stable and best 
performing version of Ceph.

In the long term, it is more interesting to invest in software and in reliable, 
responsive support, attentive to customers, capable of pushing certain 
developments to improve Ceph and match customers needs than to buy overpriced 
hardware, with limited functionalities and lifespan, from vendors not always 
paying attention to how customers use their products.

Regards,
Frédéric.


- Le 17 Avr 24, à 17:06, sebcio t sebci...@o2.pl a écrit :

> Hi,
> I have problem to answer to this question:
> Why CEPH is better than other storage solutions?
> 
> I know this high level texts about
> - scalability,
> - flexibility,
> - distributed,
> - cost-Effectiveness
> 
> What convince me, but could be received also against, is ceph as a product has
> everything what I need it mean:
> block storage (RBD),
> file storage (CephFS),
> object storage (S3, Swift)
> and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
> 
> Also many other features which are usually sold as a option (mirroring, geo
> replication, etc) in paid solutions.
> I have problem to write it done piece by piece.
> I want convince my managers we are going in good direction.
> 
> Why not something from robin.io or purestorage, netapp, dell/EMC. From
> opensource longhorn or openEBS.
> 
> If you have ideas please write it.
> 
> Thanks,
> S.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm custom jinja2 service templates

2024-04-17 Thread Frédéric Nass
Hello Felix, 

You can download haproxy.cfg.j2 and keepalived.conf.j2 from here [1], tweak 
them to your needs and set them via: 

ceph config-key set mgr/cephadm/services/ingress/haproxy.cfg -i haproxy.cfg.j2 
ceph config-key set mgr/cephadm/services/ingress/keepalived.conf -i 
keepalived.conf.j2 

Then redeploy the Ingress service: 

ceph orch redeploy  

Regards, 
Frédéric. 

[1] 
https://github.com/ceph/ceph/tree/main/src/pybind/mgr/cephadm/templates/services/ingress
 

- Le 17 Avr 24, à 16:31, Stolte, Felix  a écrit : 

> Hi folks,
> I would like to use a custom jina2 template for an ingress service for 
> rendering
> the keepalived and haproxy config. Can someone tell me how to override the
> default templates?

> Best regards
> Felix

> -
> -
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDir Stefan Müller
> Geschaeftsfuehrung: Prof. Dr. Astrid Lambrecht (Vorsitzende),
> Karsten Beneke (stellv. Vorsitzender), Dr. Ir. Pieter Jansens, Prof. Dr. 
> Frauke
> Melchior
> -
> -

> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent

2024-04-12 Thread Frédéric Nass


- Le 12 Avr 24, à 15:17, Albert Shih albert.s...@obspm.fr a écrit :

> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
>> 
> Hi,
> 
>> 
>> Have you check the hardware status of the involved drives other than with
>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
>> DELL
>> hardware for example).
> 
> Yes, all my disk are «under» periodic check with smartctl + icinga.

Actually, I meant lower level tools (drive / server vendor tools).

> 
>> If these tools don't report any media error (that is bad blocs on disks) then
>> you might just be facing the bit rot phenomenon. But this is very rare and
>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
>> professional poker player's lifetime. ;-)
>> 
>> If no media error is reported, then you might want to check and update the
>> firmware of all drives.
> 
> You're perfectly right.
> 
> It's just a newbie error, I check on the «main» osd of the PG (meaning the
> first in the list) but forget to check on other.
> 

Ok.

> On when server I indeed get some error on a disk.
> 
> But strangely smartctl report nothing. I will add a check with dmesg.

That's why I pointed you to the drive / server vendor tools earlier as 
sometimes smartctl is missing the information you want.

> 
>> 
>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
>> these
>> inconsistencies repaired automatically on deep-scrubbing, but make sure 
>> you're
>> using the alert module [1] so to at least get informed about the scrub 
>> errors.
> 
> Thanks. I will look into because we got already icinga2 on site so I use
> icinga2 to check the cluster.
> 
> Is they are a list of what the alert module going to check ?

Basically the module checks for ceph status (ceph -s) changes.

https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py

Regards,
Frédéric.

> 
> 
> Regards
> 
> JAS
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> ven. 12 avril 2024 15:13:13 CEST
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG inconsistent

2024-04-12 Thread Frédéric Nass

Hello Albert,

Have you check the hardware status of the involved drives other than with 
smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for DELL 
hardware for example).
If these tools don't report any media error (that is bad blocs on disks) then 
you might just be facing the bit rot phenomenon. But this is very rare and 
should happen in a sysadmin's lifetime as often as a Royal Flush hand in a 
professional poker player's lifetime. ;-)

If no media error is reported, then you might want to check and update the 
firmware of all drives.

Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
these inconsistencies repaired automatically on deep-scrubbing, but make sure 
you're using the alert module [1] so to at least get informed about the scrub 
errors.

Regards,
Frédéric.

[1] https://docs.ceph.com/en/latest/mgr/alerts/

- Le 12 Avr 24, à 11:59, Albert Shih albert.s...@obspm.fr a écrit :

> Hi everyone.
> 
> I got a warning with
> 
> root@cthulhu1:/etc/ceph# ceph -s
>  cluster:
>id: 9c5bb196-c212-11ee-84f3-c3f2beae892d
>health: HEALTH_ERR
>1 scrub errors
>Possible data damage: 1 pg inconsistent
> 
> So I find the pg with the issue, and launch a pg repair (still waiting)
> 
> But I try to find «why» so I check all the OSD related on this pg and
> didn't find anything, no error from osd daemon, no errors from smartctl, no
> error from the kernel message.
> 
> So I just like to know if that's «normal» or should I scratch deeper.
> 
> JAS
> --
> Albert SHIH 嶺 
> France
> Heure locale/Local time:
> ven. 12 avril 2024 11:51:37 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
full limit), prior to start the splitting PGs before or after enabling upmap 
balancer depending on how the PGs got rebalanced (well enough or not) after 
adding new OSDs.

BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
or v17.2.4+ before splitting PGs to avoid this nasty bug: 
https://tracker.ceph.com/issues/53729

Cheers,
Frédéric.

- Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
écrit :

> Hello Eugen,
> 
> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
> osd.0
> config show | grep osd_op_queue)
> 
> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
> real
> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
> before doing that.
> If mClock scheduler then you might want to use a specific mClock profile as
> suggested by Gregory (as osd_recovery_sleep* are not considered when using
> mClock).
> 
> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
> cluster only has 240, increasing osd_max_backfills to any values higher than
> 2-3 will not help much with the recovery/backfilling speed.
> 
> All the way, you'll have to be patient. :-)
> 
> Cheers,
> Frédéric.
> 
> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
> 
>> Thank you for input!
>> We started the split with max_backfills = 1 and watched for a few
>> minutes, then gradually increased it to 8. Now it's backfilling with
>> around 180 MB/s, not really much but since client impact has to be
>> avoided if possible, we decided to let that run for a couple of hours.
>> Then reevaluate the situation and maybe increase the backfills a bit
>> more.
>> 
>> Thanks!
>> 
>> Zitat von Gregory Orange :
>> 
>>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>>> objects. We are splitting for the same reason as you - improved
>>> balance. We also thought long and hard before we began, concerned
>>> about impact, stability etc.
>>>
>>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>>> retain some control and stop it again fairly quickly if we weren't
>>> happy with the behaviour. It also serves to limit the performance
>>> impact on the cluster, but unfortunately it also makes the whole
>>> process slower.
>>>
>>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>>> issues with the cluster. We could go higher, but are not in a rush
>>> at this point. Sometimes nearfull osd warnings get high and MAX
>>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>>> interrupt it. So, we set pg_num to whatever the current value is
>>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>>> gets to work once the misplaced objects drop below the ratio, and
>>> things balance out. Nearfull osds drop usually to zero, and MAX
>>> AVAIL goes up again.
>>>
>>> The above behaviour is because while they share the same threshold
>>> setting, the autoscaler only runs every minute, and it won't run
>>> when misplaced are over the threshold. Meanwhile, checks for the
>>> next PG to split happen much more frequently, so the balancer never
>>> wins that race.
>>>
>>>
>>> We didn't know how long to expect it all to take, but decided that
>>> any improvement in PG size was worth starting. We now estimate it
>>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>>> total.
>>>
>>> We have lost a drive or two during the process, and of course
>>> degraded objects went up, and more backfilling work got going. We
>>> paused splits for at least one of those, to make sure the degraded
>>> objects were sorted out as quick as possible. We can't be sure it
>>> went any faster though - there's always a long tail on that sort of
>>> thing.
>>>
>>> Inconsistent objects are found at least a couple of times a week,
>>> and to get them repairing we disable scrubs, wait until they're
>>> stopped, then set the repair going and reenable scrubs. I don't know
>>> if this is special to the current higher splitting load, but we
>>> haven't noticed it before.
>>>
>>> HTH,
>>> Greg.
>>>
>>>
>>&g

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Frédéric Nass

Hello Eugen,

Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon osd.0 
config show | grep osd_op_queue)

If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
real impact on the recovery/backfilling speed. Just lower osd_max_backfills to 
1 before doing that.
If mClock scheduler then you might want to use a specific mClock profile as 
suggested by Gregory (as osd_recovery_sleep* are not considered when using 
mClock).

Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this 
cluster only has 240, increasing osd_max_backfills to any values higher than 
2-3 will not help much with the recovery/backfilling speed.

All the way, you'll have to be patient. :-)

Cheers,
Frédéric.

- Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :

> Thank you for input!
> We started the split with max_backfills = 1 and watched for a few
> minutes, then gradually increased it to 8. Now it's backfilling with
> around 180 MB/s, not really much but since client impact has to be
> avoided if possible, we decided to let that run for a couple of hours.
> Then reevaluate the situation and maybe increase the backfills a bit
> more.
> 
> Thanks!
> 
> Zitat von Gregory Orange :
> 
>> We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
>> with NVME RocksDB, used exclusively for RGWs, holding about 60b
>> objects. We are splitting for the same reason as you - improved
>> balance. We also thought long and hard before we began, concerned
>> about impact, stability etc.
>>
>> We set target_max_misplaced_ratio to 0.1% initially, so we could
>> retain some control and stop it again fairly quickly if we weren't
>> happy with the behaviour. It also serves to limit the performance
>> impact on the cluster, but unfortunately it also makes the whole
>> process slower.
>>
>> We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
>> issues with the cluster. We could go higher, but are not in a rush
>> at this point. Sometimes nearfull osd warnings get high and MAX
>> AVAIL on the data pool in `ceph df` gets low enough that we want to
>> interrupt it. So, we set pg_num to whatever the current value is
>> (ceph osd pool ls detail), and let it stabilise. Then the balancer
>> gets to work once the misplaced objects drop below the ratio, and
>> things balance out. Nearfull osds drop usually to zero, and MAX
>> AVAIL goes up again.
>>
>> The above behaviour is because while they share the same threshold
>> setting, the autoscaler only runs every minute, and it won't run
>> when misplaced are over the threshold. Meanwhile, checks for the
>> next PG to split happen much more frequently, so the balancer never
>> wins that race.
>>
>>
>> We didn't know how long to expect it all to take, but decided that
>> any improvement in PG size was worth starting. We now estimate it
>> will take another 2-3 weeks to complete, for a total of 4-5 weeks
>> total.
>>
>> We have lost a drive or two during the process, and of course
>> degraded objects went up, and more backfilling work got going. We
>> paused splits for at least one of those, to make sure the degraded
>> objects were sorted out as quick as possible. We can't be sure it
>> went any faster though - there's always a long tail on that sort of
>> thing.
>>
>> Inconsistent objects are found at least a couple of times a week,
>> and to get them repairing we disable scrubs, wait until they're
>> stopped, then set the repair going and reenable scrubs. I don't know
>> if this is special to the current higher splitting load, but we
>> haven't noticed it before.
>>
>> HTH,
>> Greg.
>>
>>
>> On 10/4/24 14:42, Eugen Block wrote:
>>> Thank you, Janne.
>>> I believe the default 5% target_max_misplaced_ratio would work as
>>> well, we've had good experience with that in the past, without the
>>> autoscaler. I just haven't dealt with such large PGs, I've been
>>> warning them for two years (when the PGs were only almost half this
>>> size) and now they finally started to listen. Well, they would
>>> still ignore it if it wouldn't impact all kinds of things now. ;-)
>>>
>>> Thanks,
>>> Eugen
>>>
>>> Zitat von Janne Johansson :
>>>
 Den tis 9 apr. 2024 kl 10:39 skrev Eugen Block :
> I'm trying to estimate the possible impact when large PGs are
> splitted. Here's one example of such a PG:
>
> PG_STAT  OBJECTS  BYTES OMAP_BYTES*  OMAP_KEYS*  LOG
> DISK_LOG    UP
> 86.3ff    277708  414403098409    0   0  3092
> 3092
> [187,166,122,226,171,234,177,163,155,34,81,239,101,13,117,8,57,111]

 If you ask for small increases of pg_num, it will only split that many
 PGs at a time, so while there will be a lot of data movement, (50% due
 to half of the data needs to go to another newly made PG, and on top
 of that, PGs per OSD will change, but also the balancing can now work
 better) it will not be affecting the whole cluster if you increase
 with 

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-23 Thread Frédéric Nass

Considering 
https://github.com/ceph/ceph/blob/f6edcef6efe209e8947887752bd2b833d0ca13b7/src/osd/OSD.cc#L10086,
 the OSD:

- always sets and updates its per osd osd_mclock_max_capacity_iops_{hdd,ssd} 
when the benchmark occurs and its measured iops is below or equal to 
osd_mclock_iops_capacity_threshold_{hdd,ssd}
but
- doesn't remove osd_mclock_max_capacity_iops_{hdd,ssd} when the measured iops 
exceeds osd_mclock_iops_capacity_threshold_{hdd,ssd} (500 for HDD and 80.000 
for SSD) and the current value for osd_mclock_max_capacity_iops_{hdd,ssd} is 
set below its default (315 for HDD and 21500 for SSD)

Thus per osd osd_mclock_max_capacity_iops_hdd sometimes being set as low as 
0.145327 (as per Michel's post) and never being updated afterwards leading to 
performance issues.
The idea of a minimum threshold below which 
osd_mclock_iops_capacity_threshold_{hdd,ssd} should not be set seems relevant.

CC'ing Sridhar to have his thoughts.

Cheers,
Frédéric.

- Le 22 Mar 24, à 19:37, Kai Stian Olstad ceph+l...@olstad.com a écrit :

> On Fri, Mar 22, 2024 at 06:51:44PM +0100, Frédéric Nass wrote:
>>
>>> The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every 
>>> time
>>> the OSD is started.
>>> If you check the OSD log you'll see it does the bench.
>> 
>>Are you sure about the update on every start? Does the update happen only if 
>>the
>>benchmark result is < 500 iops?
>> 
>>Looks like the OSD does not remove any set configuration when the benchmark
>>result is > 500 iops. Otherwise, the extremely low value that Michel reported
>>earlier (less than 1 iops) would have been updated over time.
>>I guess.
> 
> I'm not completely sure, it's a couple a month since I used mclock, have 
> switch
> back to wpq because of a nasty bug in mclock that can freeze cluster I/O.
> 
> It could be because I was testing osd_mclock_force_run_benchmark_on_init.
> The OSD had DB on SSD and data on HDD, so the measured to about 1700 IOPS and
> was ignored because of the 500 limit.
> So only the SSD got the osd_mclock_max_capacity_iops_ssd set.
> 
> --
> Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
  
 
> The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every 
> time the OSD is started. 
> If you check the OSD log you'll see it does the bench.  
  
Are you sure about the update on every start? Does the update happen only if 
the benchmark result is < 500 iops? 
  
Looks like the OSD does not remove any set configuration when the benchmark 
result is > 500 iops. Otherwise, the extremely low value that Michel reported 
earlier (less than 1 iops) would have been updated over time. 
I guess. 
  
 
 
Frédéric.  

 
 
 
 

-Message original-

De: Kai 
à: Frédéric 
Cc: Michel ; Pierre ; 
ceph-users 
Envoyé: vendredi 22 mars 2024 18:32 CET
Sujet : Re: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed 
for 1 month

On Fri, Mar 22, 2024 at 04:29:21PM +0100, Frédéric Nass wrote: 
>A/ these incredibly low values were calculated a while back with an unmature 
>version of the code or under some specific hardware conditions and you can 
>hope this won't happen again 

The OSD run bench and update osd_mclock_max_capacity_iops_{hdd,ssd} every time 
the OSD is started. 
If you check the OSD log you'll see it does the bench. 

-- 
Kai Stian Olstad 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
 
 
Michel, 
  
Log says that osd.29 is providing 2792 '4k' iops at 10.910 MiB/s. These figures 
suggest that a controller write-back cache is in use along the IO path. Is that 
right? 
  
Since 2792 is above 500, osd_mclock_max_capacity_iops_hdd falls back to 315 and 
OSD is suggesting running a benchmark and setting 
osd_mclock_max_capacity_iops_[hdd|ssd] accordingly. 
Removing any per osd osd_mclock_max_capacity_iops_hdd and restarting all 
concerned OSDs, checking that no osd_mclock_max_capacity_iops_hdd is set 
anymore should be enough for the time being. 
  
No sure why these OSDs had such pretty bad performance in the past. Maybe a 
controller firmware issue at that time. 
  
Regarding the write-back cache, be carefull to not set 
osd_mclock_max_capacity_iops_hdd too high as OSDs may not always benefit from 
the controller's write-back cache, especially during large IO workloads filling 
up the cache or would this cache be disabled due to controller's battery 
becoming defective. 
  
I'll be interested in what you decide for osd_mclock_max_capacity_iops_hdd in 
such configuration. 
  
Cheers, 
Frédéric.

 
 
 
 

-Message original-

De: Michel 
à: ceph-users 
Envoyé: vendredi 22 mars 2024 17:20 CET
Sujet : [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 
month

Hi, 

The attempt to rerun the bench was not really a success. I got the 
following messages: 

- 

Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: osd.29 83873 
maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth 
(MiB/sec): 10.910 iops: 2792.876 elapsed_sec: 1.074 
Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: log_channel(cluster) log 
[WRN] : OSD bench result of 2792.876456 IOPS exceeded the threshold 
limit of 500.00 IOPS for osd.29. IOPS capacity is unchanged at 
0.00 IOPS. The recommendation is to establish the osd's IOPS 
capacity using other benchmark tools (e.g. Fio) and then override 
osd_mclock_max_capacity_iops_[hdd|ssd]. 
- 

I decided as a first step to raise the osd_mclock_max_capacity_iops_hdd 
for the suspect OSD to 50. It was magic! I already managed to get 16 
over 17 scrubs/deep scrubs to be run and the last one is in progress. 

I now have to understand why this OSD had such bad perfs that 
osd_mclock_max_capacity_iops_hdd was set to such a low value... I have 
12 OSDs with an entry for their osd_mclock_max_capacity_iops_hdd and 
they are mostly on one server (with 2 OSDs on another one). I suspect 
there was a problem on these servers at some points. It is unclear why 
it is not enough to just rerun the benchmark and why a crazy value for 
an HDD is found... 

Best regards, 

Michel 

Le 22/03/2024 à 14:44, Michel Jouvin a écrit : 
> Hi Frédéric, 
> 
> I think you raise the right point, sorry if I misunderstood Pierre's 
> suggestion to look at OSD performances. Just before reading your 
> email, I was implementing Pierre's suggestion for max_osd_scrubs and I 
> saw the osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those 
> with a value different from the default). For the suspect OSD, the 
> value is very low, 0.145327, and I suspect it is the cause of the 
> problem. A few others have a value ~5 which I find also very low (all 
> OSDs are using the same recent HW/HDD). 
> 
> Thanks for these informations. I'll follow your suggestions to rerun 
> the benchmark and report if it improved the situation. 
> 
> Best regards, 
> 
> Michel 
> 
> Le 22/03/2024 à 12:18, Frédéric Nass a écrit : 
>> Hello Michel, 
>> 
>> Pierre also suggested checking the performance of this OSD's 
>> device(s) which can be done by running a ceph tell osd.x bench. 
>> 
>> One think I can think of is how the scrubbing speed of this very OSD 
>> could be influenced by mclock sheduling, would the max iops capacity 
>> calculated by this OSD during its initialization be significantly 
>> lower than other OSDs's. 
>> 
>> What I would do is check (from this OSD's log) the calculated value 
>> for max iops capacity and compare it to other OSDs. Eventually force 
>> a recalculation by setting 'ceph config set osd.x 
>> osd_mclock_force_run_benchmark_on_init true' and restart this OSD. 
>> 
>> Also I would: 
>> 
>> - compare running OSD's mclock values (cephadm shell ceph daemon 
>> osd.x config show | grep mclock) to other OSDs's. 
>> - compare ceph tell osd.x bench to other OSDs's benchmarks. 
>> - compare the rotational status of this OSD's db and data devices to 
>> other OSDs, to make sure things are in order. 
>> 
>> Bests, 
>> Frédéric. 
>> 
>> PS: If mclock is the culprit here, then setting osd_op_queue back to 
>> mpq for this only OSD would probably reveal it. Not sure about the 
>> implication of having a signel OSD running a diff

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass
 
 
 
Michel, 
  
Glad to know that was it. 
  
I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd value be 
set in cluster's config database since I don't have any set in my lab. 
Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set when the 
calculated value is below osd_mclock_iops_capacity_threshold_hdd, otherwise the 
OSD uses the default value of 315. 
  
Probably to rule out any insanely high calculated values. Would have been nice 
to also rule out any insanely low measured values. :-) 
  
Now either: 
  
A/ these incredibly low values were calculated a while back with an unmature 
version of the code or under some specific hardware conditions and you can hope 
this won't happen again 
  
OR 
  
B/ you don't want to rely on hope to much and you'll prefer to disable 
automatic calculation (osd_mclock_skip_benchmark = true) and set 
osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using a 
rack/host mask) after a precise evaluation of the performance of your OSDs. 
  
B/ would be more deterministic :-) 
  
Cheers, 
Frédéric.   
 
 
 
 
 

-Message original-

De: Michel 
à: Frédéric 
Cc: Pierre ; ceph-users 
Envoyé: vendredi 22 mars 2024 14:44 CET
Sujet : Re: [ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed 
for 1 month

Hi Frédéric, 

I think you raise the right point, sorry if I misunderstood Pierre's 
suggestion to look at OSD performances. Just before reading your email, 
I was implementing Pierre's suggestion for max_osd_scrubs and I saw the 
osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a 
value different from the default). For the suspect OSD, the value is 
very low, 0.145327, and I suspect it is the cause of the problem. A few 
others have a value ~5 which I find also very low (all OSDs are using 
the same recent HW/HDD). 

Thanks for these informations. I'll follow your suggestions to rerun the 
benchmark and report if it improved the situation. 

Best regards, 

Michel 

Le 22/03/2024 à 12:18, Frédéric Nass a écrit : 
> Hello Michel, 
> 
> Pierre also suggested checking the performance of this OSD's device(s) which 
> can be done by running a ceph tell osd.x bench. 
> 
> One think I can think of is how the scrubbing speed of this very OSD could be 
> influenced by mclock sheduling, would the max iops capacity calculated by 
> this OSD during its initialization be significantly lower than other OSDs's. 
> 
> What I would do is check (from this OSD's log) the calculated value for max 
> iops capacity and compare it to other OSDs. Eventually force a recalculation 
> by setting 'ceph config set osd.x osd_mclock_force_run_benchmark_on_init 
> true' and restart this OSD. 
> 
> Also I would: 
> 
> - compare running OSD's mclock values (cephadm shell ceph daemon osd.x config 
> show | grep mclock) to other OSDs's. 
> - compare ceph tell osd.x bench to other OSDs's benchmarks. 
> - compare the rotational status of this OSD's db and data devices to other 
> OSDs, to make sure things are in order. 
> 
> Bests, 
> Frédéric. 
> 
> PS: If mclock is the culprit here, then setting osd_op_queue back to mpq for 
> this only OSD would probably reveal it. Not sure about the implication of 
> having a signel OSD running a different scheduler in the cluster though. 
> 
> 
> - Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
> écrit : 
> 
>> Pierre, 
>> 
>> Yes, as mentioned in my initial email, I checked the OSD state and found 
>> nothing wrong either in the OSD logs or in the system logs (SMART errors). 
>> 
>> Thanks for the advice of increasing osd_max_scrubs, I may try it, but I 
>> doubt it is a contention problem because it really only affects a fixed 
>> set of PGs (no new PGS have a "stucked scrub") and there is a 
>> significant scrubbing activity going on continuously (~10K PGs in the 
>> cluster). 
>> 
>> Again, it is not a problem for me to try to kick out the suspect OSD and 
>> see it fixes the issue but as this cluster is pretty simple/low in terms 
>> of activity and I see nothing that may explain why we have this 
>> situation on a pretty new cluster (9 months, created in Quincy) and not 
>> on our 2 other production clusters, much more used, one of them being 
>> the backend storage of a significant OpenStack clouds, a cluster created 
>> 10 years ago with Infernetis and upgraded since then, a better candidate 
>> for this kind of problems! So, I'm happy to contribute to 
>> troubleshooting a potential issue in Reef if somebody finds it useful 
>> and can help. Else I'll try the approach that worked for Gunnar. 
>> 
>> Best regards, 
>> 
>> Michel 
>> 
>> Le 22/03/2024 à 09:59, Pierre Riteau a écrit : 

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Frédéric Nass

Hello Michel,

Pierre also suggested checking the performance of this OSD's device(s) which 
can be done by running a ceph tell osd.x bench.

One think I can think of is how the scrubbing speed of this very OSD could be 
influenced by mclock sheduling, would the max iops capacity calculated by this 
OSD during its initialization be significantly lower than other OSDs's.

What I would do is check (from this OSD's log) the calculated value for max 
iops capacity and compare it to other OSDs. Eventually force a recalculation by 
setting 'ceph config set osd.x osd_mclock_force_run_benchmark_on_init true' and 
restart this OSD.

Also I would:

- compare running OSD's mclock values (cephadm shell ceph daemon osd.x config 
show | grep mclock) to other OSDs's.
- compare ceph tell osd.x bench to other OSDs's benchmarks.
- compare the rotational status of this OSD's db and data devices to other 
OSDs, to make sure things are in order.

Bests,
Frédéric.

PS: If mclock is the culprit here, then setting osd_op_queue back to mpq for 
this only OSD would probably reveal it. Not sure about the implication of 
having a signel OSD running a different scheduler in the cluster though.


- Le 22 Mar 24, à 10:11, Michel Jouvin michel.jou...@ijclab.in2p3.fr a 
écrit :

> Pierre,
> 
> Yes, as mentioned in my initial email, I checked the OSD state and found
> nothing wrong either in the OSD logs or in the system logs (SMART errors).
> 
> Thanks for the advice of increasing osd_max_scrubs, I may try it, but I
> doubt it is a contention problem because it really only affects a fixed
> set of PGs (no new PGS have a "stucked scrub") and there is a
> significant scrubbing activity going on continuously (~10K PGs in the
> cluster).
> 
> Again, it is not a problem for me to try to kick out the suspect OSD and
> see it fixes the issue but as this cluster is pretty simple/low in terms
> of activity and I see nothing that may explain why we have this
> situation on a pretty new cluster (9 months, created in Quincy) and not
> on our 2 other production clusters, much more used, one of them being
> the backend storage of a significant OpenStack clouds, a cluster created
> 10 years ago with Infernetis and upgraded since then, a better candidate
> for this kind of problems! So, I'm happy to contribute to
> troubleshooting a potential issue in Reef if somebody finds it useful
> and can help. Else I'll try the approach that worked for Gunnar.
> 
> Best regards,
> 
> Michel
> 
> Le 22/03/2024 à 09:59, Pierre Riteau a écrit :
>> Hello Michel,
>>
>> It might be worth mentioning that the next releases of Reef and Quincy
>> should increase the default value of osd_max_scrubs from 1 to 3. See
>> the Reef pull request: https://github.com/ceph/ceph/pull/55173
>> You could try increasing this configuration setting if you
>> haven't already, but note that it can impact client I/O performance.
>>
>> Also, if the delays appear to be related to a single OSD, have you
>> checked the health and performance of this device?
>>
>> On Fri, 22 Mar 2024 at 09:29, Michel Jouvin
>>  wrote:
>>
>> Hi,
>>
>> As I said in my initial message, I'd in mind to do exactly the
>> same as I
>> identified in my initial analysis that all the PGs with this problem
>> where sharing one OSD (but only 20 PGs had the problem over ~200
>> hosted
>> by the OSD). But as I don't feel I'm in an urgent situation, I was
>> wondering if collecting more information on the problem may have some
>> value and which one... If it helps, I add below the `pg dump` for
>> the 17
>> PGs still with a "stucked scrub".
>>
>> I observed the "stucked scrubs" is lowering very slowly. In the
>> last 12
>> hours, 1 more PG was successfully scrubbed/deep scrubbed. In case
>> it was
>> not clear in my initial message, the lists of PGs with a too old
>> scrub
>> and too old deep scrub are the same.
>>
>> Without an answer, next week i may consider doing what you did:
>> remove
>> the suspect OSD (instead of just restarting it) and see it
>> unblocks the
>> stucked scrubs.
>>
>> Best regards,
>>
>> Michel
>>
>> - "ceph pg dump pgs" for the 17
>> PGs with
>> a too old scrub and deep scrub (same list)
>> 
>>
>> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED UNFOUND
>> BYTES    OMAP_BYTES*  OMAP_KEYS*  LOG    LOG_DUPS DISK_LOG  STATE
>> STATE_STAMP  VERSION   REPORTED
>> UP UP_PRIMARY  ACTING ACTING_PRIMARY
>> LAST_SCRUB    SCRUB_STAMP LAST_DEEP_SCRUB
>> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
>> SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED
>> 29.7e3   260   0 0  0 0
>> 1090519040    0   0   1978   500
>> 1978 

[ceph-users] Leaked clone objects

2024-03-19 Thread Frédéric Nass

 
 
  
Hello, 
  
Over the last few weeks, we have observed a abnormal increase of a pool's data 
usage (by a factor of 2). It turns out that we are hit by this bug [1]. 
  
In short, if you happened to take pool snapshots and removed them by using the 
following command 
  
'ceph osd pool rmsnap {pool-name} {snap-name}' 
  
instead of using this command 
  
'rados -p {pool-name} rmsnap {snap-name}' 
  
then you may have leaked clone objects (not trimmed) in your cluster, occupying 
space that you can't reclaim. 
  
You may have such leaked objects if (not exclusively): 
  
- 'rados df' reports CLONES for pools with no snapshots 
- 'rgw-orphan-list' (if RWG pools) reports objects that you can't 'stat' but 
for which 'listsnaps' shows a cloneid. 
  
'ceph osd pool force-remove-snap {pool-name}' should have the OSDs re-trim 
these leaked clone objects when [2] makes it to quincy, reef, squid (and 
hopefully pacific). 
  
Hope this helps, 
  
Regards, 
Frédéric. 
  
[1] https://tracker.ceph.com/issues/64646 
[2] https://github.com/ceph/ceph/pull/53545  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-03-16 Thread Frédéric Nass
 
  
Hello Van Diep, 
  
 
I read this after you got out of trouble. 
  
According to your ceph osd tree, it looks like your problems started when the 
ceph orchestrator created osd.29 on node 'cephgw03' because it looks very 
unlikely that you created a 100MB OSD on a node that's named after "GW". 
  
You may have added the 'osds' label to the 'cephgw03' node and/or played with 
the service_type:osd and/or added the cephgw03 node to the crushmap, which 
triggered the creation of osd.29 by the orchestrator. 
cephgw03 node being part of the 'default' root bucket, other OSDs legitimately 
started to send objects to osd.29, way to small to accommodate them, PGs then 
becoming 'backfill_toofull'. 
  
To get out of this situation, you could have: 
  
$ ceph osd crush add-bucket closet root 
$ ceph osd crush move cephgw03 root=closet 
  
This would have moved 'cephgw03' node out of the 'default' root and probably 
fixed your problem instantly.  
 
Regards,  
 
Frédéric.  

   

-Message original-

De: Anthony 
à: nguyenvandiep 
Cc: ceph-users 
Envoyé: samedi 24 février 2024 16:24 CET
Sujet : [ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

There ya go. 

You have 4 hosts, one of which appears to be down and have a single OSD that is 
so small as to not be useful. Whatever cephgw03 is, it looks like a mistake. 
OSDs much smaller than, say, 1TB often aren’t very useful. 

Your pools appear to be replicated, size=3. 

So each of your cephosd* hosts stores one replica of each RADOS object. 

You added the 10TB spinners to only two of your hosts, which means that they’re 
only being used as though they were 4TB OSDs. That’s part of what’s going on. 

You want to add a 10TB spinner to cephosd02. That will help your situation 
significantly. 

After that, consider adding a cephosd04 host. Having at least one more failure 
domain than replicas lets you better use uneven host capacities. 




> On Feb 24, 2024, at 10:06 AM, nguyenvand...@baoviet.com.vn wrote: 
> 
> Hi Mr Anthony, 
> 
> pls check the output 
> 
> https://anotepad.com/notes/s7nykdmc 
> ___ 
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How does mclock work?

2024-01-16 Thread Frédéric Nass

Sridhar, 
  
Thanks a lot for this explantation. It's clearer now. 
  
So at the end of the day (at least with balanced profile) it's a lower bound 
and no upper limit and a balanced distribution between client and cluster IOPS. 
   
 
Regards, 
Frédéric.  

   

-Message original-

De: Sridhar 
à: Frédéric 
Cc: ceph-users 
Envoyé: mercredi 10 janvier 2024 08:15 CET
Sujet : Re: [ceph-users] How does mclock work?

  
Hello Frédéric, 
  
Please see answers below. 
    
Could someone please explain how mclock works regarding reads and writes? Does 
mclock intervene on both read and write iops? Or only on reads or only on 
writes?  
  
mClock schedules both read and write ops. 
    
And what type of underlying hardware performance is calculated and considered 
by mclock? Seems to be only write performance.  
  
Random write performance is considered for setting the maximum IOPS capacity of 
an OSD. This along with the sequential bandwidth 
capability of the OSD is used to calculate the cost per IO that is internally 
used by mClock for scheduling Ops. In addition, the mClock 
profiles use the capacity information to allocate reservation and limit for 
different classes of service (for e.g., client, background-recovery, 
scrub, snaptrim etc.). 
  
The write performance is used to set a lower bound on the amount of bandwidth 
to be allocated for different classes of services. For e.g., 
the 'balanced' profile allocates 50% of the OSD's IOPS capacity to cllent ops. 
In other words, a minimum guarantee of 50% of the OSD's 
bandwidth is allocated to client ops (read or write). If you look at the 
'balanced' profile, there is no upper limit set for client ops (i.e. set to 
MAX) which means that reads can potentially use the maximum possible bandwidth 
(i.e., not contrained by max IOPS capacity) if there 
are no other competing ops.  
  
Please see 
https://docs.ceph.com/en/reef/rados/configuration/mclock-config-ref/#built-in-profiles
 for more information about mClock profiles. 
    
The mclock documentation shows HDDs and SSDs specific configuration options 
(capacity and sequential bandwidth) but nothing regarding hybrid setups and 
these configuration options do not distinguish reads and writes. But read and 
write performance are often not in par for a single drive and even less when 
using hybrid setups. 
  
With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock 
only considers write performance, it may fail to properly schedule read iops 
(does mclock schedule read iops?) as the calculated iops capacity would be way 
too high for reads. 
  
With HDD only setups (RocksDB+WAL+Data on HDD), if mclock only considers write 
performance, the OSD may not take advantage of higher read performance. 
  
Can someone please shed some light on this?  
  
As mentioned above, as long as there are no competing ops, the mClock profiles 
ensure that there is nothing constraining client 
ops from using the full available bandwidth of an OSD for both reads and writes 
regardless of the type of setup (hybrid, HDD, 
SSD) being employed. The important aspect is to ensure that the set IOPS 
capacity for the OSD reflects a fairly accurate 
representation of the underlying device capability. This is because the 
reservation criteria based the IOPS capacity helps 
maintain an acceptable level of performance with other active competing ops. 
  
You could run some synthetic benchmarks to ensure that read and write 
performance are along expected lines with the 
default mClock profile to confirm the above. 
  
I hope this helps. 
  
-Sridhar
     
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-12 Thread Frédéric Nass

Samuel, 
  
Hard to tell for sure since this bug hit different major versions of the 
kernel, at least RHEL's from what I know. The only way to tell is to check for 
num_cgroups in /proc/cgroups:

 
 
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1  
Otherwise, you'd have to check the sources of the kernel you're using against 
the patch that fixed this bug. Unfortunately, I can't spot the upstream patch 
that fixed this issue since RH BZs related to this bug are private. Maybe 
someone here can spot it. 
   
 
Regards, 
Frédéric.  

  

-Message original-

De: huxiaoyu 
à: Frédéric 
Cc: ceph-users 
Envoyé: vendredi 12 janvier 2024 09:25 CET
Sujet : Re: Re: [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

 
Dear Frederic, 
  
Thanks a lot for the suggestions. We are using the valilla Linux 4.19 LTS 
version. Do you think we may be suffering from the same bug? 
  
best regards, 
  
Samuel 
  
   huxia...@horebdata.cn    From: Frédéric Nass Date: 2024-01-12 09:19 To: 
huxiaoyu CC: ceph-users Subject: Re: [ceph-users] Ceph Nautilous 14.2.22 slow 
OSD memory leak?  Hello,   We've had a similar situation recently where 
OSDs would use way more memory than osd_memory_target and get OOM killed by the 
kernel. It was due to a kernel bug related to cgroups [1].   If num_cgroups 
below keeps increasing then you may hit this bug.
 
  
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1 
  
If you hit this bug, upgrading OSDs nodes kernels should get you through. If 
you can't access the Red Hat KB [1], let me know your current nodes kernel 
version and I'll check for you. 
  Regards,
Frédéric. 
 
 
  
[1] https://access.redhat.com/solutions/7014337 
De: huxiaoyu 
à: ceph-users 
Envoyé: mercredi 10 janvier 2024 19:21 CET
Sujet : [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

Dear Ceph folks, 

I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, one 
with replication 3, and the other with EC 4+2. After around 400 days runing 
quietly and smoothly, recently the two clusters occured with similar problems: 
some of OSDs consume ca 18 GB while the memory target is setting at 2GB. 

What could wrong in the background? Does it mean any slow OSD memory leak 
issues with 14.2.22 which i do not know yet? 

I would be highly appreciated if some some provides any clues, ideas, comments 
.. 

best regards, 

Samuel 



huxia...@horebdata.cn 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-12 Thread Frédéric Nass

Hello Torkil, 
  
We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with the 
below rule: 
  
 
rule ec54 { 
        id 3 
        type erasure 
        min_size 3 
        max_size 9 
        step set_chooseleaf_tries 5 
        step set_choose_tries 100 
        step take default class hdd 
        step choose indep 0 type datacenter 
        step chooseleaf indep 3 type host 
        step emit 
} 
  
Works fine. The only difference I see with your EC rule is the fact that we set 
min_size and max_size but I doubt this has anything to do with your situation. 
  
Since the cluster still complains about "Pool cephfs.hdd.data has 1024 
placement groups, should have 2048", did you run "ceph osd pool set 
cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set 
cephfs.hdd.data pg_num 2048"? [1] 
  
Might be that the pool still has 1024 PGs. 
    
 
Regards,
Frédéric. 
  
[1] 
https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups
  

   

-Message original-

De: Torkil 
à: ceph-users 
Cc: Ruben 
Envoyé: vendredi 12 janvier 2024 09:00 CET
Sujet : [ceph-users] 3 DC with 4+5 EC not quite working

We are looking to create a 3 datacenter 4+5 erasure coded pool but can't 
quite get it to work. Ceph version 17.2.7. These are the hosts (there 
will eventually be 6 hdd hosts in each datacenter): 

-33 886.00842 datacenter 714 
-7 209.93135 host ceph-hdd1 

-69 69.86389 host ceph-flash1 
-6 188.09579 host ceph-hdd2 

-3 233.57649 host ceph-hdd3 

-12 184.54091 host ceph-hdd4 
-34 824.47168 datacenter DCN 
-73 69.86389 host ceph-flash2 
-2 201.78067 host ceph-hdd5 

-81 288.26501 host ceph-hdd6 

-31 264.56207 host ceph-hdd7 

-36 1284.48621 datacenter TBA 
-77 69.86389 host ceph-flash3 
-21 190.83224 host ceph-hdd8 

-29 199.08838 host ceph-hdd9 

-11 193.85382 host ceph-hdd10 

-9 237.28154 host ceph-hdd11 

-26 187.19536 host ceph-hdd12 

-4 206.37102 host ceph-hdd13 

We did this: 

ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd 
plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default 
crush-failure-domain=datacenter crush-device-class=hdd 

ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd 
ceph osd pool set cephfs.hdd.data allow_ec_overwrites true 
ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn 

Didn't quite work: 

" 
[WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg 
incomplete 
pg 33.0 is creating+incomplete, acting 
[104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool 
cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for 
'incomplete') 
" 

I then manually changed the crush rule from this: 

" 
rule cephfs.hdd.data { 
id 7 
type erasure 
step set_chooseleaf_tries 5 
step set_choose_tries 100 
step take default class hdd 
step chooseleaf indep 0 type datacenter 
step emit 
} 
" 

To this: 

" 
rule cephfs.hdd.data { 
id 7 
type erasure 
step set_chooseleaf_tries 5 
step set_choose_tries 100 
step take default class hdd 
step choose indep 0 type datacenter 
step chooseleaf indep 3 type host 
step emit 
} 
" 

Based on some testing and dialogue I had with Red Hat support last year 
when we were on RHCS, and it seemed to work. Then: 

ceph fs add_data_pool cephfs cephfs.hdd.data 
ceph fs subvolumegroup create hdd --pool_layout cephfs.hdd.data 

I started copying data to the subvolume and increased pg_num a couple of 
times: 

ceph osd pool set cephfs.hdd.data pg_num 256 
ceph osd pool set cephfs.hdd.data pg_num 2048 

But at some point it failed to activate new PGs eventually leading to this: 

" 
[WARN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs 
mds.cephfs.ceph-flash1.agdajf(mds.0): 64 slow metadata IOs are 
blocked > 30 secs, oldest blocked for 25455 secs 
[WARN] MDS_TRIM: 1 MDSs behind on trimming 
mds.cephfs.ceph-flash1.agdajf(mds.0): Behind on trimming 
(997/128) max_segments: 128, num_segments: 997 
[WARN] PG_AVAILABILITY: Reduced data availability: 5 pgs inactive 
pg 33.6f6 is stuck inactive for 8h, current state 
activating+remapped, last acting [50,79,116,299,98,219,164,124,421] 
pg 33.6fa is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[17,408,NONE,196,223,290,73,39,11] 
pg 33.705 is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[33,273,71,NONE,411,96,28,7,161] 
pg 33.721 is stuck inactive for 7h, current state 
activating+remapped, last acting [283,150,209,423,103,325,118,142,87] 
pg 33.726 is stuck inactive for 11h, current state 
activating+undersized+degraded+remapped, last acting 
[234,NONE,416,121,54,141,277,265,19] 
[WARN] PG_DEGRADED: Degraded data redundancy: 1818/1282640036 objects 
degraded (0.000%), 3 pgs degraded, 3 pgs undersized 
pg 33.6fa is stuck undersized for 7h, current state 
activating+undersized+degraded+remapped, last acting 
[17,408,NONE,196,223,290,73,39,11] 
pg 33.705 is stuck undersized for 

[ceph-users] Re: Ceph Nautilous 14.2.22 slow OSD memory leak?

2024-01-12 Thread Frédéric Nass

Hello, 
  
We've had a similar situation recently where OSDs would use way more memory 
than osd_memory_target and get OOM killed by the kernel. 
It was due to a kernel bug related to cgroups [1]. 
  
If num_cgroups below keeps increasing then you may hit this bug.
 
  
$ cat /proc/cgroups | grep -e subsys -e blkio | column -t 
   #subsys_name  hierarchy  num_cgroups  enabled 
   blkio         4          1099         1 
  
If you hit this bug, upgrading OSDs nodes kernels should get you through. If 
you can't access the Red Hat KB [1], let me know your current nodes kernel 
version and I'll check for you. 
  Regards,
Frédéric.  
 
  
[1] https://access.redhat.com/solutions/7014337 

-Message original-

De: huxiaoyu 
à: ceph-users 
Envoyé: mercredi 10 janvier 2024 19:21 CET
Sujet : [ceph-users] Ceph Nautilous 14.2.22 slow OSD memory leak?

Dear Ceph folks, 

I am responsible for two Ceph clusters, running Nautilius 14.2.22 version, one 
with replication 3, and the other with EC 4+2. After around 400 days runing 
quietly and smoothly, recently the two clusters occured with similar problems: 
some of OSDs consume ca 18 GB while the memory target is setting at 2GB. 

What could wrong in the background? Does it mean any slow OSD memory leak 
issues with 14.2.22 which i do not know yet? 

I would be highly appreciated if some some provides any clues, ideas, comments 
.. 

best regards, 

Samuel 



huxia...@horebdata.cn 
___ 
ceph-users mailing list -- ceph-users@ceph.io 
To unsubscribe send an email to ceph-users-le...@ceph.io   
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How does mclock work?

2024-01-09 Thread Frédéric Nass

  
 
Hello, 
  
Could someone please explain how mclock works regarding reads and writes? Does 
mclock intervene on both read and write iops? Or only on reads or only on 
writes? 
  
And what type of underlying hardware performance is calculated and considered 
by mclock? Seems to be only write performance. 
  
The mclock documentation shows HDDs and SSDs specific configuration options 
(capacity and sequential bandwidth) but nothing regarding hybrid setups and 
these configuration options do not distinguish reads and writes. But read and 
write performance are often not in par for a single drive and even less when 
using hybrid setups. 
  
With hybrid setups (RocksDB+WAL on SSDs or NVMes and Data on HDD), if mclock 
only considers write performance, it may fail to properly schedule read iops 
(does mclock schedule read iops?) as the calculated iops capacity would be way 
too high for reads. 
  
With HDD only setups (RocksDB+WAL+Data on HDD), if mclock only considers write 
performance, the OSD may not take advantage of higher read performance. 
  
Can someone please shed some light on this? 
  
Best regards,   
 
Frédéric Nass 

Sous-direction Infrastructures et Services
Direction du Numérique 
Université de Lorraine
Tél : +33 3 72 74 11 35  
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

2023-09-01 Thread Frédéric Nass
Hello, 

This message to inform you that DELL has released a new firmwares for these SSD 
drives to fix the 70.000 POH issue: 

[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
 | Toshiba A3B4 for model number(s) PX02SMF020, PX02SMF040, PX02SMF080 and 
PX02SMB160. ] 
[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=31jmh=rt
 | Toshiba A4B4 for model number(s) PX02SSF010, PX02SSF020, PX02SSF040 and 
PX02SSB080. ] [ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=69j5f=w12r2=poweredge-r730xd
 ] 
[ 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=tc8kc=rt
 | Toshiba A5B4 for model number(s) PX03SNF020, PX03SNF080 and PX03SNB160. ] 

Based on our recent experience, this firmware gets dead SSD drives back to life 
with their data (after the upgrade, you may need to import foreign config by 
pressing 'F' key on the next start) 

Many thanks to DELL French TAMs and DELL engineering for providing this 
firmware in a short time. 

Best regards, 
Frédéric. 

- Le 19 Juin 23, à 10:46, Frédéric Nass  a 
écrit : 

> Hello,

> This message does not concern Ceph itself but a hardware vulnerability which 
> can
> lead to permanent loss of data on a Ceph cluster equipped with the same
> hardware in separate fault domains.

> The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD 
> drives
> of the 13G generation of DELL servers are subject to a vulnerability which
> renders them unusable after 70,000 hours of operation, i.e. approximately 7
> years and 11 months of activity.

> This topic has been discussed here:
> https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438

> The risk is all the greater since these disks may die at the same time in the
> same server leading to the loss of all data in the server.

> To date, DELL has not provided any firmware fixing this vulnerability, the
> latest firmware version being "A3B3" released on Sept. 12, 2016:
> https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k

> If your have servers running these drives, check their uptime. If they are 
> close
> to the 70,000 hour limit, replace them immediately.

> The smartctl tool does not report the uptime for these SSDs, but if you have
> HDDs in the server, you can query their SMART status and get their uptime,
> which should be about the same as the SSDs.
> The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the
> iSCSI bus number).

> We have informed DELL about this but have no information yet on the arrival 
> of a
> fix.

> We have lost 6 disks, in 3 different servers, in the last few weeks. Our
> observation shows that the drives don't survive full shutdown and restart of
> the machine (power off then power on in iDrac), but they may also die during a
> single reboot (init 6) or even while the machine is running.

> Fujitsu released a corrective firmware in June 2021 but this firmware is most
> certainly not applicable to DELL drives:
> https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf

> Regards,
> Frederic

> Sous-direction Infrastructure and Services
> Direction du Numérique
> Université de Lorraine
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation

2023-06-19 Thread Frédéric Nass

Hello, 

This message does not concern Ceph itself but a hardware vulnerability which 
can lead to permanent loss of data on a Ceph cluster equipped with the same 
hardware in separate fault domains. 

The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives 
of the 13G generation of DELL servers are subject to a vulnerability which 
renders them unusable after 70,000 hours of operation, i.e. approximately 7 
years and 11 months of activity. 

This topic has been discussed here: 
https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438
 

The risk is all the greater since these disks may die at the same time in the 
same server leading to the loss of all data in the server. 

To date, DELL has not provided any firmware fixing this vulnerability, the 
latest firmware version being "A3B3" released on Sept. 12, 2016: 
https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k 

If your have servers running these drives, check their uptime. If they are 
close to the 70,000 hour limit, replace them immediately. 

The smartctl tool does not report the uptime for these SSDs, but if you have 
HDDs in the server, you can query their SMART status and get their uptime, 
which should be about the same as the SSDs. 
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the 
iSCSI bus number). 

We have informed DELL about this but have no information yet on the arrival of 
a fix. 

We have lost 6 disks, in 3 different servers, in the last few weeks. Our 
observation shows that the drives don't survive full shutdown and restart of 
the machine (power off then power on in iDrac), but they may also die during a 
single reboot (init 6) or even while the machine is running. 

Fujitsu released a corrective firmware in June 2021 but this firmware is most 
certainly not applicable to DELL drives: 
https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf 

Regards, 
Frederic 

Sous-direction Infrastructure and Services 
Direction du Numérique 
Université de Lorraine 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crushmap rule for multi-datacenter erasure coding

2023-04-05 Thread Frédéric Nass
Hello Michel,

What you need is:

step choose indep 0 type datacenter
step chooseleaf indep 2 type host
step emit

I think you're right about the need to tweak the crush rule by editing the 
crushmap directly.

Regards
Frédéric.

- Le 3 Avr 23, à 18:34, Michel Jouvin michel.jou...@ijclab.in2p3.fr a écrit 
:

> Hi,
> 
> We have a 3-site Ceph cluster and would like to create a 4+2 EC pool
> with 2 chunks per datacenter, to maximise the resilience in case of 1
> datacenter being down. I have not found a way to create an EC profile
> with this 2-level allocation strategy. I created an EC profile with a
> failure domain = datacenter but it doesn't work as, I guess, it would
> like to ensure it has always 5 OSDs up (to ensure that the pools remains
> R/W) where with a failure domain = datacenter, the guarantee is only 4.
> My idea was to create a 2-step allocation and a failure domain=host to
> achieve our desired configuration, with something like the following in
> the crushmap rule:
> 
> step choose indep 3 datacenter
> step chooseleaf indep x host
> step emit
> 
> Is it the right approach? If yes, what should be 'x'? Would 0 work?
> 
> From what I have seen, there is no way to create such a rule with the
> 'ceph osd crush' commands: I have to download the current CRUSHMAP, edit
> it and upload the modified version. Am I right?
> 
> Thanks in advance for your help or suggestions. Best regards,
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: iscsi target lun error

2023-01-12 Thread Frédéric Nass
Hi Xiubo, Randy,

This is due to ' host.containers.internal' being added to the 
container's /etc/hosts since Podman 4.1+.

The workaround consists of either downgrading Podman package to v4.0 (on RHEL8, 
dnf downgrade podman-4.0.2-6.module+el8.6.0+14877+f643d2d6) or adding the 
--no-hosts option to 'podman run' command in /var/lib/ceph/$(ceph 
fsid)/iscsi.iscsi.test-iscsi1.xx/unit.run and restart the iscsi container 
service.

[1] and [2] could well have the same cause. RHCS Block Device Guide [3] quotes 
RHEL 8.4 as a prerequisites. I don't know what was the version of Podman in 
RHEL 8.4 at the time, but with RHEL 8.7 and Podman 4.2, it's broken.

I'll open a RHCS case today to have it fixed and have other containers like 
grafana, prometheus, etc. being checked against this new podman behavior.

Regards,
Frédéric.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1979449
[2] https://tracker.ceph.com/issues/57018
[3] 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/5/html-single/block_device_guide/index#prerequisites_9

- Le 21 Nov 22, à 6:45, Xiubo Li xiu...@redhat.com a écrit :

> On 15/11/2022 23:44, Randy Morgan wrote:
>> You are correct I am using the cephadm to create the iscsi portals.
>> The cluster had been one I was learning a lot with and I wondered if
>> it was because of the number of creations and deletions of things, so
>> I rebuilt the cluster, now I am getting this response even when
>> creating my first iscsi target.   Here is the output of the gwcli ls:
>>
>> sh-4.4# gwcli ls
>> o- /
>> 
>> [...]
>>   o- cluster
>> 
>> [Clusters: 1]
>>   | o- ceph
>> .
>> [HEALTH_WARN]
>>   |   o- pools
>> .
>> [Pools: 8]
>>   |   | o- .rgw.root
>>  [(x3),
>> Commit: 0.00Y/71588776M (0%), Used: 1323b]
>>   |   | o- cephfs_data
>> .. [(x3),
>> Commit: 0.00Y/71588776M (0%), Used: 1639b]
>>   |   | o- cephfs_metadata
>> .. [(x3), Commit:
>> 0.00Y/71588776M (0%), Used: 3434b]
>>   |   | o- default.rgw.control
>> .. [(x3), Commit:
>> 0.00Y/71588776M (0%), Used: 0.00Y]
>>   |   | o- default.rgw.log
>> .. [(x3), Commit:
>> 0.00Y/71588776M (0%), Used: 3702b]
>>   |   | o- default.rgw.meta
>> .. [(x3), Commit:
>> 0.00Y/71588776M (0%), Used: 382b]
>>   |   | o- device_health_metrics
>>  [(x3), Commit:
>> 0.00Y/71588776M (0%), Used: 0.00Y]
>>   |   | o- rhv-ceph-ssd
>> . [(x3), Commit:
>> 0.00Y/7868560896K (0%), Used: 511746b]
>>   |   o- topology
>> ..
>> [OSDs: 36,MONs: 3]
>>   o- disks
>> ..
>> [0.00Y, Disks: 0]
>>   o- iscsi-targets
>> ..
>> [DiscoveryAuth: None, Targets: 1]
>>     o- iqn.2001-07.com.ceph:1668466555428
>> ... [Auth:
>> None, Gateways: 1]
>>   o- disks
>> .
>> [Disks: 0]
>>   o- gateways
>> ...
>> [Up: 1/1, Portals: 1]
>>   | o- host.containers.internal
>> 
>> [192.168.105.145 (UP)]
> 
> Please manually remove this gateway before doing further steps.
> 
> It should be a bug in cephadm and you can raise one tracker for this.
> 
> Thanks
> 
> 
>> o- host-groups
>> .
>> [Groups : 0]
>>   o- hosts
>> ..
>> [Auth: ACL_ENABLED, Hosts: 0]
>> sh-4.4#
>>
>> Randy
>>
>> On 11/9/2022 6:36 PM, Xiubo Li wrote:
>>>
>>> On 10/11/2022 02:21, Randy Morgan wrote:
 I am trying to create a second iscsi target and I keep getting an
 error when I create the second target:


    Failed to update target 'iqn.2001-07.com.ceph:1667946365517'

 disk 

[ceph-users] Re: Increase the recovery throughput

2022-12-26 Thread Frédéric Nass
Hi Monish,

You might also want to check the values of osd_recovery_sleep_* if they are not 
the default ones.

Regards,
Frédéric.

- Le 12 Déc 22, à 11:32, Monish Selvaraj mon...@xaasability.com a écrit :

> Hi Eugen,
> 
> We tried that already. the osd_max_backfills is in 24 and the
> osd_recovery_max_active is in 20.
> 
> On Mon, Dec 12, 2022 at 3:47 PM Eugen Block  wrote:
> 
>> Hi,
>>
>> there are many threads dicussing recovery throughput, have you tried
>> any of the solutions? First thing to try is to increase
>> osd_recovery_max_active and osd_max_backfills. What are the current
>> values in your cluster?
>>
>>
>> Zitat von Monish Selvaraj :
>>
>> > Hi,
>> >
>> > Our ceph cluster consists of 20 hosts and 240 osds.
>> >
>> > We used the erasure-coded pool with cache-pool concept.
>> >
>> > Some time back 2 hosts went down and the pg are in a degraded state. We
>> got
>> > the 2 hosts back up in some time. After the pg is started recovering but
>> it
>> > takes a long time ( months )  . While this was happening we had the
>> cluster
>> > with 664.4 M objects and 987 TB data. The recovery status is not changed;
>> > it remains 88 pgs degraded.
>> >
>> > During this period, we increase the pg size from 256 to 512 for the
>> > data-pool ( erasure-coded pool ).
>> >
>> > We also observed (one week ) the recovery to be very slow, the current
>> > recovery around 750 Mibs.
>> >
>> > Is there any way to increase this recovery throughput ?
>> >
>> > *Ceph-version : quincy*
>> >
>> > [image: image.png]
>>
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Do not use VMware Storage I/O Control with Ceph iSCSI GWs!

2022-01-26 Thread Frédéric Nass

Hi,

For anyone using VMware ESXi (6.7) with Ceph iSCSI GWs (Nautilus), I 
thought you might benefit from our experience: I have finally identified 
what was causing a permanent ~500 MB/s and ~4k iops load on our cluster, 
specifically on one of our RBD image used as a VMware Datastore and it 
was Storage I/O control. Not sure whether this is a bug that could be 
taken care of on the ceph side (as a misinterpretation of a SCSI 
instruction that the ESXi would replay madly) but disabling Storage I/O 
Control definitely solved the problem. By disabling I mean choosing 
"Disable Storage I/O Control **and** statistics collection" on each 
Datastore.


Regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 18:28, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 11:59 AM Frédéric Nass
 wrote:


Le 25/01/2022 à 14:48, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
 wrote:

Hello,

I've just heard about storage classes and imagined how we could use them
to migrate all S3 objects within a placement pool from an ec pool to a
replicated pool (or vice-versa) for data resiliency reasons, not to save
space.

It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage
class to another
3. we can set a user's default_storage_class to have all new objects
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we
can delete the old storage class, rename the new storage class to
STANDARD so that it's been used by default and unset any user's
default_storage_class setting.

i don't think renaming the storage class will work the way you're
hoping. this storage class string is stored in each object and used to
locate its data, so renaming it could render the transitioned objects
unreadable

Hello Casey,

Thanks for pointing that out.

Do you believe this scenario would work if stopped at step 3.? (keeping
default_storage_class set on users's profiles and not renaming the new
storage class to STANDARD. Could we delete the STANDARD storage class
btw since we would not use it anymore?).

If there is no way to define the default storage class of a placement
pool without naming it STANDARD could we imaging transitioning all
objects again by:

4. deleting the storage class named STANDARD
5. creating a new one named STANDARD (using a ceph pool of the same data
placement scheme than the one used by the temporary storage class
created above)

instead of deleting/recreating STANDARD, you could probably just
modify it's data pool. only do this once you're certain that there are
no more objects in the old data pool. you might need to wait for
garbage collection to clean up the tail objects there too (or force it
with 'radosgw-admin gc process --include-all')


Interesting scenario. So in the end we'd have objects named after both 
storage classes in the same ceph pool, the old ones named after the new 
storage class name and the new ones being written after the STANDARD 
storage class, right?





6. transitioning all objects again to the new STANDARD storage class.
Then delete the temporary storage class.

i think this step 6 would run into the
https://tracker.ceph.com/issues/50974 that Konstantin shared - if the
two storage classes have the same pool name, the transition doesn't
actually take effect. you might consider leaving this 'temporary'
storage class around, but pointing the defaults back at STANDARD


Well, in step 6., I'd thought about using another new pool for the 
recreated STANDARD storage class (to avoid the issue shared by 
Konstantin , thanks to him btw) and move all objects to this new pool 
again in a new global transition.


But, I understand you'd recommend avoiding deleting/recreating STANDARD 
and just modify the STANDARD data pool after GC execution, am I right?


Frédéric.




?

Best regards,

Frédéric.


Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 14:48, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
 wrote:

Hello,

I've just heard about storage classes and imagined how we could use them
to migrate all S3 objects within a placement pool from an ec pool to a
replicated pool (or vice-versa) for data resiliency reasons, not to save
space.

It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage
class to another
3. we can set a user's default_storage_class to have all new objects
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we
can delete the old storage class, rename the new storage class to
STANDARD so that it's been used by default and unset any user's
default_storage_class setting.

i don't think renaming the storage class will work the way you're
hoping. this storage class string is stored in each object and used to
locate its data, so renaming it could render the transitioned objects
unreadable


Hello Casey,

Thanks for pointing that out.

Do you believe this scenario would work if stopped at step 3.? (keeping 
default_storage_class set on users's profiles and not renaming the new 
storage class to STANDARD. Could we delete the STANDARD storage class 
btw since we would not use it anymore?).


If there is no way to define the default storage class of a placement 
pool without naming it STANDARD could we imaging transitioning all 
objects again by:


4. deleting the storage class named STANDARD
5. creating a new one named STANDARD (using a ceph pool of the same data 
placement scheme than the one used by the temporary storage class 
created above)
6. transitioning all objects again to the new STANDARD storage class. 
Then delete the temporary storage class.


?

Best regards,

Frédéric.




Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS keyrings for K8s

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 12:09, Frédéric Nass a écrit :


Hello Michal,

With cephfs and a single filesystem shared across multiple k8s 
clusters, you should subvolumegroups to limit data exposure. You'll 
find an example of how to use subvolumegroups in the ceph-csi-cephfs 
helm chart [1]. Essentially you just have to set the subvolumeGroup to 
whatever you like and then create the associated cephfs keyring with 
the following caps:


ceph auth get-or-create client.cephfs.k8s-cluster-1.admin mon "allow 
r" osd "allow rw tag cephfs *=*" mds "allow rw 
path=/volumes/csi-k8s-cluster-1" mgr "allow rw" -o 
/etc/ceph/client.cephfs.k8s-cluster-1.admin.keyring


    caps: [mds] allow rw path=/volumes/csi-k8s-cluster-1
    caps: [mgr] allow rw
    caps: [mon] allow r
    caps: [osd] allow rw tag cephfs *=*

The subvolume group will be created by ceph-csi-cephfs if I remember 
correctly but you can also take care of this on the ceph side with 
'ceph fs subvolumegroup create cephfs csi-k8s-cluster-1'.
PVs will then be created as subvolumes in this subvolumegroup. To list 
them, use 'ceph fs subvolume ls cephfs --group_name=csi-k8s-cluster-1'.


To achieve the same goal with RBD images, you should use rados 
namespaces. The current helm chart [2] seems to lack information about 
the radosNamespace setting but it works effectively considering you 
set it as below:


csiConfig:
  - clusterID: ""
    monitors:
  - ""
  - ""
    radosNamespace: "k8s-cluster-1"

ceph auth get-or-create client.rbd.name.admin mon "profile rbd" osd 
"allow rwx pool  object_prefix rbd_info, allow rwx pool 
 namespace k8s-cluster-1" mgr "profile rbd 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.admin.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Sorry, the admin caps should read:

    caps: [mgr] profile rbd pool= namespace=k8s-cluster-1
    caps: [mon] profile rbd
    caps: [osd] allow rwx pool  object_prefix rbd_info, 
allow rwx pool  namespace k8s-cluster-1


Regards,

Frédéric.



ceph auth get-or-create client.rbd.name.user mon "profile rbd" osd 
"allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.user.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Capabilities required for ceph-csi-cephfs and ceph-csi-rbd are 
described here [3].


This should get you started. Let me know if you see any clever/safer 
caps to use.


Regards,

Frédéric.

[1] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-cephfs/values.yaml#L20
[2] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-rbd/values.yaml#L20

[3] https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS keyrings for K8s

2022-01-25 Thread Frédéric Nass

Hello Michal,

With cephfs and a single filesystem shared across multiple k8s clusters, 
you should subvolumegroups to limit data exposure. You'll find an 
example of how to use subvolumegroups in the ceph-csi-cephfs helm chart 
[1]. Essentially you just have to set the subvolumeGroup to whatever you 
like and then create the associated cephfs keyring with the following caps:


ceph auth get-or-create client.cephfs.k8s-cluster-1.admin mon "allow r" 
osd "allow rw tag cephfs *=*" mds "allow rw 
path=/volumes/csi-k8s-cluster-1" mgr "allow rw" -o 
/etc/ceph/client.cephfs.k8s-cluster-1.admin.keyring


    caps: [mds] allow rw path=/volumes/csi-k8s-cluster-1
    caps: [mgr] allow rw
    caps: [mon] allow r
    caps: [osd] allow rw tag cephfs *=*

The subvolume group will be created by ceph-csi-cephfs if I remember 
correctly but you can also take care of this on the ceph side with 'ceph 
fs subvolumegroup create cephfs csi-k8s-cluster-1'.
PVs will then be created as subvolumes in this subvolumegroup. To list 
them, use 'ceph fs subvolume ls cephfs --group_name=csi-k8s-cluster-1'.


To achieve the same goal with RBD images, you should use rados 
namespaces. The current helm chart [2] seems to lack information about 
the radosNamespace setting but it works effectively considering you set 
it as below:


csiConfig:
  - clusterID: ""
    monitors:
  - ""
  - ""
    radosNamespace: "k8s-cluster-1"

ceph auth get-or-create client.rbd.name.admin mon "profile rbd" osd 
"allow rwx pool  object_prefix rbd_info, allow rwx pool 
 namespace k8s-cluster-1" mgr "profile rbd 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.admin.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


ceph auth get-or-create client.rbd.name.user mon "profile rbd" osd 
"allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.user.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Capabilities required for ceph-csi-cephfs and ceph-csi-rbd are described 
here [3].


This should get you started. Let me know if you see any clever/safer 
caps to use.


Regards,

Frédéric.

[1] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-cephfs/values.yaml#L20
[2] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-rbd/values.yaml#L20

[3] https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

Le 20/01/2022 à 09:26, Michal Strnad a écrit :

Hi,

We are using CephFS in our Kubernetes clusters and now we are trying 
to optimize permissions/caps in keyrings. Every guide which we found 
contains something like - Create the file system by specifying the 
desired settings for the metadata pool, data pool and admin keyring 
with access to the entire file system ... Is there better way where we 
don't need admin key, but restricted key only? What are you using in 
your environments?


Multiple file systems isn't option for us.

Thanks for your help

Regards,
Michal Strnad


___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass

Hello,

I've just heard about storage classes and imagined how we could use them 
to migrate all S3 objects within a placement pool from an ec pool to a 
replicated pool (or vice-versa) for data resiliency reasons, not to save 
space.


It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage 
class to another
3. we can set a user's default_storage_class to have all new objects 
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we 
can delete the old storage class, rename the new storage class to 
STANDARD so that it's been used by default and unset any user's 
default_storage_class setting.


Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: osd_memory_target=level0 ?

2021-09-30 Thread Frédéric Nass

Hi,

As Christian said, osd_memory_target has nothing to do with rocksdb 
levels and will certainly not decide when overspilling occurs. With that 
said, I doubt any of us here ever gave 32GB of RAM to any OSD, so in 
case you're not sure that OSDs can handle that much memory correctly, I 
would advise you to lower this value to something more conservative like 
4GB or 8GB of RAM. Just make sure you system doesn't make use of swap. 
Also, since your clients do a lot of reads, check the value of 
bluefs_buffered_io. It's default value changed a few times in the past 
and got back to true recently. It might really help to have it set to true.


Regarding overspilling, unless you tuned bluestore_rocksdb_options with 
custom max_bytes_for_level_base and max_bytes_for_level_multiplier, I 
think levels should still be roughly 3GB, 30GB and 300GB. I suppose you 
gave 600GB+ NVMe block.db partitions to each one of the 6 SSDs so you'd 
be good with that for most workloads I guess.


Have you checked bluestore_min_alloc_size, bluestore_min_alloc_size_hdd, 
bluestore_min_alloc_size_ssd of your OSDs ? If I'm not mistaken, the 
default 32k value has now changed to 4k. If your OSDs were created with 
32k alloc size then it might explain the unexpected overspilling with a 
lot of objects in the cluster.


Hope that helps,

Regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

Le 30/09/2021 à 10:02, Szabo, Istvan (Agoda) a écrit :

Hi Christian,

Yes, I very clearly know what is spillover, read that github leveled document 
in the last couple of days every day multiple time. (Answers for your questions 
are after the cluster background information).

About the cluster:
- users are doing continuously put/head/delete operations
- cluster iops: 10-50k read, 5000 write iops
- throughput: 142MiB/s  write and 662 MiB/s read
- Not containerized deployment, 3 cluster in multisite
- 3x mon/mgr/rgw (5 rgw in each mon, altogether 15 behind haproxy vip)

7 nodes and in each node the following config:
- 1x 1.92TB nvme for index pool
- 6x 15.3 TB osd SAS SSD (hpe VO015360JWZJN read intensive ssd, SKU P19911-B21 
in this document:https://h20195.www2.hpe.com/v2/getpdf.aspx/a1288enw.pdf)
- 2x 1.92TB nvme  block.db for the 6 ssd (model: HPE KCD6XLUL1T92 SKU: 
P20131-B21 in this 
documenthttps://h20195.www2.hpe.com/v2/getpdf.aspx/a1288enw.pdf)
- osd deployed with dmcrypt
- data pool is on ec 4:2 other pools are on the ssds with 3 replica

Config file that we have on all nodes + on the mon nodes has the rgw definition 
also:
[global]
cluster network = 192.168.199.0/24
fsid = 5a07ec50-4eee-4336-aa11-46ca76edcc24
mon host = 
[v2:10.118.199.1:3300,v1:10.118.199.1:6789],[v2:10.118.199.2:3300,v1:10.118.199.2:6789],[v2:10.118.199.3:3300,v1:10.118.199.3:6789]
mon initial members = mon-2s01,mon-2s02,mon-2s03
osd pool default crush rule = -1
public network = 10.118.199.0/24
rgw_relaxed_s3_bucket_names = true
rgw_dynamic_resharding = false
rgw_enable_apis = s3, s3website, swift, swift_auth, admin, sts, iam, pubsub, 
notifications
#rgw_bucket_default_quota_max_objects = 1126400

[mon]
mon_allow_pool_delete = true
mon_pg_warn_max_object_skew = 0
mon_osd_nearfull_ratio = 70

[osd]
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_op_priority = 1
osd_memory_target = 31490694621
# due to osd reboots, the below configs has been added to survive the suicide 
timeout
osd_scrub_during_recovery = true
osd_op_thread_suicide_timeout=3000
osd_op_thread_timeout=120

Stability issue that I mean:
- Pg increase still in progress, hasn’t been finished from 32-128 on the 
erasure coded data pool. 103 currently and the degraded objects are always 
stuck almost when finished, but at the end osd dies and start again the 
recovery process.
- compaction is happening all the time, so all the nvme drives are generating 
iowait continuously because it is 100% utilized (iowait is around 1-3). If I 
try to compact with ceph tell osd.x compact that is impossible, it will never 
finish, only with ctrl+c.
- At the beginning when we didn't have so much spilledover disks, I didn't mind 
it actually I was happy of the spillover because the underlaying ssd can take 
some load from the nvme, but after the osds started to reboot and I'd say 
started to collapse 1 by 1. When I monitor which osds are collapsing, it was 
always the one which was spillovered. This op thread and suicide timeout can 
keep a bit longer the osds up.
- Now ALL rgw started to die once 1 specific osd goes down, and this make total 
outage. In the logs there isn't anything about this, neither message, nor rgw 
log just like timeout the connections. This is unacceptable from the user's 
perspective that thay need to wait 1.5 hour until my manual compaction finished 
and I can start the osd.

Current cluster state ceph -s:
health: HEALTH_ERR
 12 OSD(s) experiencing BlueFS spillover
 4

[ceph-users] Re: Cephfs metadata and MDS on same node

2021-03-26 Thread Frédéric Nass

Hi Jesper,

It could make sens only if:

1. the metadata the client's asking for was not already cached in RAM

2. the metadata pool was hosted on very low latency devices like NVMes

3. you could make sure that each client's metadata requests would be 
served from PGs for which the primary OSD is local to the MDS the 
client's talking to which in realy life is impossible to achieve as you 
cannot pin cephfs trees and their related metadata objects to specific PGs.


Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass

Direction du Numérique
Sous-Direction Infrastructures et Services
Université de Lorraine.

Le 09/03/2021 à 16:03, Jesper Lykkegaard Karlsen a écrit :

Dear Ceph’ers

I am about to upgrade MDS nodes for Cephfs in the Ceph cluster (erasure code 
8+3 ) I am administrating.

Since they will get plenty of memory and CPU cores, I was wondering if it would 
be a good idea to move metadata OSDs (NVMe's currently on OSD nodes together 
with cephfs_data ODS (HDD)) to the MDS nodes?

Configured as:

4 x MDS with each a metadata OSD and configured with 4 x replication

so each metadata OSD would have a complete copy of metadata.

I know MDS, stores al lot of metadata in RAM, but if metadata OSDs were on MDS 
nodes, would that not bring down latency?

Anyway, I am just asking for your opinion on this? Pros and cons or even better 
somebody who actually have tried this?

Best regards,
Jesper

--
Jesper Lykkegaard Karlsen
Scientific Computing
Centre for Structural Biology
Department of Molecular Biology and Genetics
Aarhus University
Gustav Wieds Vej 10
8000 Aarhus C

E-mail: je...@mbg.au.dk<mailto:je...@mbg.au.dk>
Tlf:+45 50906203

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Cordialement,

Frédéric Nass

Direction du Numérique
Sous-Direction Infrastructures et Services
Université de Lorraine.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS max_file_size

2021-03-25 Thread Frédéric Nass

Hi,

While looking at something else in the documentation, I came across 
this: 
https://docs.ceph.com/en/latest/cephfs/administration/#maximum-file-sizes-and-performance


"CephFS enforces the maximum file size limit at the point of appending 
to files or setting their size. It does not affect how anything is 
stored. When users create a file of an enormous size (without 
necessarily writing any data to it), some operations (such as deletes) 
cause the MDS to have to do a large number of operations to check if any 
of the RADOS objects within the range that could exist (according to the 
file size) really existed. The max_file_size setting prevents users from 
creating files that appear to be eg. exabytes in size, causing load on 
the MDS as it tries to enumerate the objects during operations like 
stats or deletes."


Thought it might help.

--
Cordialement,

Frédéric Nass

Direction du Numérique
Sous-Direction Infrastructures et Services
Université de Lorraine.

Le 11/12/2020 à 20:41, Paul Mezzanini a écrit :

 From how I understand it, that setting is a rev-limiter to prevent users from 
creating HUGE sparse files and then wasting cluster resources firing off 
deletes.

We have ours set to 32T and haven't seen any issues with large files.

--
Paul Mezzanini
Sr Systems Administrator / Engineer, Research Computing
Information & Technology Services
Finance & Administration
Rochester Institute of Technology
o:(585) 475-3245 | pfm...@rit.edu

CONFIDENTIALITY NOTE: The information transmitted, including attachments, is
intended only for the person(s) or entity to which it is addressed and may
contain confidential and/or privileged material. Any review, retransmission,
dissemination or other use of, or taking of any action in reliance upon this
information by persons or entities other than the intended recipient is
prohibited. If you received this in error, please contact the sender and
destroy any copies of this information.



From: Adam Tygart 
Sent: Friday, December 11, 2020 1:59 PM
To: Mark Schouten
Cc: ceph-users; Patrick Donnelly
Subject: [ceph-users] Re: CephFS max_file_size

I've had this set to 16TiB for several years now.

I've not seen any ill effects.

--
Adam

On Fri, Dec 11, 2020 at 12:56 PM Patrick Donnelly  wrote:

Hi Mark,

On Fri, Dec 11, 2020 at 4:21 AM Mark Schouten  wrote:

There is a default limit of 1TiB for the max_file_size in CephFS. I altered 
that to 2TiB, but I now got a request for storing a file up to 7TiB.

I'd expect the limit to be there for a reason, but what is the risk of setting 
that value to say 10TiB?

There is no known downside. Let us know how it goes!

--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Cordialement,

Frédéric Nass

Direction du Numérique
Sous-Direction Infrastructures et Services
Université de Lorraine.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Outage (Nautilus) - 14.2.11

2020-12-16 Thread Frédéric Nass

Hi Suresh,

24 HDDs backed by only by 2 NVMes looks like a high ratio. What triggers 
my bell in your post is "upgraded from Luminous to Nautilus" and 
"Elasticsearch" which mainly reads to index data and also "memory leak".


You might want to take a look at the current value of bluefs_buffered_io 
: ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep 
bluefs_buffered_io


In Nautilus, the default value of bluefs_buffered_io changed to false to 
avoid unexplained excessive swap usage (when true) by the OSDs, but this 
change induced a very high load on fast devices SSDs//NVMes hosting 
RocksDB and WAL.


Considering the high ratio of HHDs per NVMes that you're using, I 
wouldn't be suprised if your NVMes were toping on an iostat (%util) due 
to bluefs_buffered_io now being false. If so, change it to true (and 
lower osd_mermory_target and vm.swappiness), and keep an eye on swap usage.


Regards,

Frédéric.

Le 15/12/2020 à 21:44, Suresh Rama a écrit :

Dear All,

We have a 38 node HP Apollo cluster with 24 3.7T Spinning disk and 2 NVME
for journal. This is one of our 13 clusters which was upgraded from
Luminous to Nautilus (14.2.11).  When one of our openstack customers  uses
elastic search (they offer Logging as a Service) to their end users
reported IO latency issues, our SME rebooted two nodes that he felt were
doing memory leak.  The reboot didn't help rather worsen the situation and
he went ahead and recycled the entire cluster one node a time as to fix the
slow ops reported by OSDs.   This caused a huge issue and MONS were not
able to withstand the spam and started crashing.

1) We audited the network (inspecting TOR, iperf, MTR) and nothing was
indicating any issue but OSD logs were keep complaining about
BADAUTHORIZER

  2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1c1e20 0x7fea3c1bcdf0 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER
2020-12-13 15:32:31.607 7fea5e3a2700  0 --1- 10.146.126.200:0/464096978 >>
v1:10.146.127.122:6809/1803700 conn(0x7fea3c1ba990 0x7fea3c1bf600 :-1
s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2
connect got BADAUTHORIZER

2) Made sure no clock skew and we use timesyncd.   After taking out a
couple of OSDs that were indicating slow in the ceph health response the
situation didn't improve.  After 3 days of troubleshooting we upgraded the
MONS to 14.2.15 and seems the situation improved a little but still
reporting  61308 slow ops which we really struggled to isolate with bad
OSDs as moving a couple of them didn't improve.  One of the MON(2) failed
to join the cluster and always doing compact and never was able to join
(see the size below).  I suspect that could be because the key value store
information between 1 and 3 is not up to date with 2.At times, we had
to stop and start to compress to get a better response from Ceph MON
(keeping them running in one single MON).

root@pistoremon-as-c01:~# du -sh /var/lib/ceph/mon
391G /var/lib/ceph/mon
root@pistoremon-as-c03:~# du -sh /var/lib/ceph/mon
337G /var/lib/ceph/mon
root@pistoremon-as-c02:~# du -sh /var/lib/ceph/mon
13G /var/lib/ceph/mon

  cluster:
 id: bac20301-d458-4828-9dd9-a8406acf5d0f
 health: HEALTH_WARN
 noout,noscrub,nodeep-scrub flag(s) set
 1 pools have many more objects per pg than average
 10969 pgs not deep-scrubbed in time
 46 daemons have recently crashed
 61308 slow ops, oldest one blocked for 2572 sec, daemons
[mon.pistoremon-as-c01,mon.pistoremon-as-c03] have slow ops.
 mons pistoremon-as-c01,pistoremon-as-c03 are using a lot of
disk space
 1/3 mons down, quorum pistoremon-as-c01,pistoremon-as-c03

   services:
 mon: 3 daemons, quorum pistoremon-as-c01,pistoremon-as-c03 (age 52m),
out of quorum: pistoremon-as-c02
 mgr: pistoremon-as-c01(active, since 2h), standbys: pistoremon-as-c03,
pistoremon-as-c02
 osd: 911 osds: 888 up (since 68m), 888 in
  flags noout,noscrub,nodeep-scrub
 rgw: 2 daemons active (pistorergw-as-c01, pistorergw-as-c02)

   task status:

   data:
 pools:   17 pools, 32968 pgs
 objects: 62.98M objects, 243 TiB
 usage:   748 TiB used, 2.4 PiB / 3.2 PiB avail
 pgs: 32968 active+clean

   io:
 client:   56 MiB/s rd, 95 MiB/s wr, 1.78k op/s rd, 4.27k op/s wr

3) When looking through ceph.log on the mon with tailf, I was getting a lot
of different time stamp reported in the ceph logs in MON1 which is master.
Confused on why the live log report various timestamps?

stat,write 2166784~4096] snapc 0=[] ondisk+write+known_if_redirected
e951384) 

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass
Regarding RocksDB compaction, if you were in a situation were RocksDB 
had overspilled to HDDs (if your cluster is using an hybrid setup), the 
compaction should have move the bits back to fast devices. So it might 
have helped in this situation too.


Regards,

Frédéric.

Le 16/12/2020 à 09:57, Frédéric Nass a écrit :

Hi Sefan,

This has me thinking that the issue your cluster may be facing is 
probably with bluefs_buffered_io set to true, as this has been 
reported to induce excessive swap usage (and OSDs flapping or OOMing 
as consequences) in some versions starting from Nautilus I believe.


Can you check the value of bluefs_buffered_io that OSDs are currently 
using ? : ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config 
show | grep bluefs_buffered_io


Can you check the kernel value of vm.swappiness ? : sysctl 
vm.swappiness (default value is 30)


And describe your OSD nodes ? # of HDDS and SSDs/NVMes and HDD/SSD 
ratio, and how much memory they have ?


You should be able to avoid swap usage by setting bluefs_buffered_io 
to false but your cluster / workload might not allow that performance 
and stability wise.
Or you may be able to workaround the excessive swap usage (when 
bluefs_buffered_io is set to true) by lowering vm.swappiness or 
disabling the swap.


Regards,

Frédéric.

Le 14/12/2020 à 22:12, Stefan Wild a écrit :

Hi Frédéric,

Thanks for the additional input. We are currently only running RGW on 
the cluster, so no snapshot removal, but there have been plenty of 
remappings with the OSDs failing (all of them at first during and 
after the OOM incident, then one-by-one). I haven't had a chance to 
look into or test the bluefs_buffered_io setting, but will do that 
next. Initial results from compacting all OSDs' RocksDBs look 
promising (thank you, Igor!). Things have been stable for the past 
two hours, including the two OSDs with issues (one in reboot loop, 
the other with some heartbeats missed), while 15 degraded PGs are 
backfilling.


The ballooning of each OSD to over 15GB memory right after the 
initial crash was even with osd_memory_target set to 2GB. The only 
thing that helped at that point was to temporarily add enough swap 
space to fit 12 x 15GB and let them do their thing. Once they had all 
booted, memory usage went back down to normal levels.


I will report back here with more details when the cluster is 
hopefully back to a healthy state.


Thanks,
Stefan



On 12/14/20, 3:35 PM, "Frédéric Nass" 
 wrote:


 Hi Stefan,

 Initial data removal could also have resulted from a snapshot 
removal
 leading to OSDs OOMing and then pg remappings leading to more 
removals

 after OOMed OSDs rejoined the cluster and so on.

 As mentioned by Igor : "Additionally there are users' reports that
 recent default value's modification for bluefs_buffered_io 
setting has
 negative impact (or just worsen existing issue with massive 
removal) as

 well. So you might want to switch it back to true."

 We're some of them. Our cluster suffered from a severe 
performance drop

 during snapshot removal right after upgrading to Nautilus, due to
 bluefs_buffered_io being set to false by default, with slow 
requests

 observed around the cluster.
 Once back to true (can be done with ceph tell osd.* injectargs
 '--bluefs_buffered_io=true') snap trimming would be fast again 
so as

 before the upgrade, with no more slow requests.

 But of course we've seen the excessive memory swap usage 
described here

 : https://github.com/ceph/ceph/pull/34224
 So we lower osd_memory_target from 8MB to 4MB and haven't 
observed any

 swap usage since then. You can also have a look here :
 https://github.com/ceph/ceph/pull/38044

 What you need to look at to understand if your cluster would 
benefit

 from changing bluefs_buffered_io back to true is the %util of your
 RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if 
you're

 using SSD RocksDB devices) and look at the %util of the device with
 bluefs_buffered_io=false and with bluefs_buffered_io=true. If with
 bluefs_buffered_io=false, the %util is over 75% most of the 
time, then

 you'd better change it to true. :-)

 Regards,

 Frédéric.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-16 Thread Frédéric Nass

Hi Sefan,

This has me thinking that the issue your cluster may be facing is 
probably with bluefs_buffered_io set to true, as this has been reported 
to induce excessive swap usage (and OSDs flapping or OOMing as 
consequences) in some versions starting from Nautilus I believe.


Can you check the value of bluefs_buffered_io that OSDs are currently 
using ? : ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show 
| grep bluefs_buffered_io


Can you check the kernel value of vm.swappiness ? : sysctl vm.swappiness 
(default value is 30)


And describe your OSD nodes ? # of HDDS and SSDs/NVMes and HDD/SSD 
ratio, and how much memory they have ?


You should be able to avoid swap usage by setting bluefs_buffered_io to 
false but your cluster / workload might not allow that performance and 
stability wise.
Or you may be able to workaround the excessive swap usage (when 
bluefs_buffered_io is set to true) by lowering vm.swappiness or 
disabling the swap.


Regards,

Frédéric.

Le 14/12/2020 à 22:12, Stefan Wild a écrit :

Hi Frédéric,

Thanks for the additional input. We are currently only running RGW on the 
cluster, so no snapshot removal, but there have been plenty of remappings with 
the OSDs failing (all of them at first during and after the OOM incident, then 
one-by-one). I haven't had a chance to look into or test the bluefs_buffered_io 
setting, but will do that next. Initial results from compacting all OSDs' 
RocksDBs look promising (thank you, Igor!). Things have been stable for the 
past two hours, including the two OSDs with issues (one in reboot loop, the 
other with some heartbeats missed), while 15 degraded PGs are backfilling.

The ballooning of each OSD to over 15GB memory right after the initial crash 
was even with osd_memory_target set to 2GB. The only thing that helped at that 
point was to temporarily add enough swap space to fit 12 x 15GB and let them do 
their thing. Once they had all booted, memory usage went back down to normal 
levels.

I will report back here with more details when the cluster is hopefully back to 
a healthy state.

Thanks,
Stefan



On 12/14/20, 3:35 PM, "Frédéric Nass"  wrote:

 Hi Stefan,

 Initial data removal could also have resulted from a snapshot removal
 leading to OSDs OOMing and then pg remappings leading to more removals
 after OOMed OSDs rejoined the cluster and so on.

 As mentioned by Igor : "Additionally there are users' reports that
 recent default value's modification for bluefs_buffered_io setting has
 negative impact (or just worsen existing issue with massive removal) as
 well. So you might want to switch it back to true."

 We're some of them. Our cluster suffered from a severe performance drop
 during snapshot removal right after upgrading to Nautilus, due to
 bluefs_buffered_io being set to false by default, with slow requests
 observed around the cluster.
 Once back to true (can be done with ceph tell osd.* injectargs
 '--bluefs_buffered_io=true') snap trimming would be fast again so as
 before the upgrade, with no more slow requests.

 But of course we've seen the excessive memory swap usage described here
 : https://github.com/ceph/ceph/pull/34224
 So we lower osd_memory_target from 8MB to 4MB and haven't observed any
 swap usage since then. You can also have a look here :
 https://github.com/ceph/ceph/pull/38044

 What you need to look at to understand if your cluster would benefit
 from changing bluefs_buffered_io back to true is the %util of your
 RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're
 using SSD RocksDB devices) and look at the %util of the device with
 bluefs_buffered_io=false and with bluefs_buffered_io=true. If with
 bluefs_buffered_io=false, the %util is over 75% most of the time, then
 you'd better change it to true. :-)

 Regards,

 Frédéric.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass
I forgot to mention "If with bluefs_buffered_io=false, the %util is over 
75% most of the time ** during data removal (like snapshot removal) **, 
then you'd better change it to true."


Regards,

Frédéric.

Le 14/12/2020 à 21:35, Frédéric Nass a écrit :

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.


As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) 
as well. So you might want to switch it back to true."


We're some of them. Our cluster suffered from a severe performance 
drop during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.


But of course we've seen the excessive memory swap usage described 
here : https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044


What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if 
you're using SSD RocksDB devices) and look at the %util of the device 
with bluefs_buffered_io=false and with bluefs_buffered_io=true. If 
with bluefs_buffered_io=false, the %util is over 75% most of the time, 
then you'd better change it to true. :-)


Regards,

Frédéric.

Le 14/12/2020 à 12:47, Stefan Wild a écrit :

Hi Igor,

Thank you for the detailed analysis. That makes me hopeful we can get 
the cluster back on track. No pools have been removed, but yes, due 
to the initial crash of multiple OSDs and the subsequent issues with 
individual OSDs we’ve had substantial PG remappings happening 
constantly.


I will look up the referenced thread(s) and try the offline DB 
compaction. It would be amazing if that does the trick.


Will keep you posted, here.

Thanks,
Stefan


From: Igor Fedotov 
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io 

Subject: Re: [ceph-users] Re: OSD reboot loop after running out of 
memory


Hi Stefan,

given the crash backtrace in your log I presume some data removal is in
progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
(KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
(BlueStore::_do_omap_clear(BlueStore::TransContext*,
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
(BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
(BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
(ObjectStore::queue_transaction(boost::intrusive_ptr&, 


ceph::os::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data
rebalancing triggered PG migration (and hence source PG removal) for 
you?


Highly likely you're facing a well known issue with RocksDB/BlueFS
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers suicide
timeout.

We've had multiple threads on the issue in this mailing list - the
latest one is at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/ 



For now the good enough workaround is manual offline DB compaction for
all the OSDs (this might have temporary effect though as the removal
proceeds).

Additionally there are users' reports that recent default value's
modification  for bluefs_buffered_io se

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Frédéric Nass

Hi Stefan,

Initial data removal could also have resulted from a snapshot removal 
leading to OSDs OOMing and then pg remappings leading to more removals 
after OOMed OSDs rejoined the cluster and so on.


As mentioned by Igor : "Additionally there are users' reports that 
recent default value's modification for bluefs_buffered_io setting has 
negative impact (or just worsen existing issue with massive removal) as 
well. So you might want to switch it back to true."


We're some of them. Our cluster suffered from a severe performance drop 
during snapshot removal right after upgrading to Nautilus, due to 
bluefs_buffered_io being set to false by default, with slow requests 
observed around the cluster.
Once back to true (can be done with ceph tell osd.* injectargs 
'--bluefs_buffered_io=true') snap trimming would be fast again so as 
before the upgrade, with no more slow requests.


But of course we've seen the excessive memory swap usage described here 
: https://github.com/ceph/ceph/pull/34224
So we lower osd_memory_target from 8MB to 4MB and haven't observed any 
swap usage since then. You can also have a look here : 
https://github.com/ceph/ceph/pull/38044


What you need to look at to understand if your cluster would benefit 
from changing bluefs_buffered_io back to true is the %util of your 
RocksDBD devices on an iostat. Run an iostat -dmx 1 /dev/sdX (if you're 
using SSD RocksDB devices) and look at the %util of the device with 
bluefs_buffered_io=false and with bluefs_buffered_io=true. If with 
bluefs_buffered_io=false, the %util is over 75% most of the time, then 
you'd better change it to true. :-)


Regards,

Frédéric.

Le 14/12/2020 à 12:47, Stefan Wild a écrit :

Hi Igor,

Thank you for the detailed analysis. That makes me hopeful we can get the 
cluster back on track. No pools have been removed, but yes, due to the initial 
crash of multiple OSDs and the subsequent issues with individual OSDs we’ve had 
substantial PG remappings happening constantly.

I will look up the referenced thread(s) and try the offline DB compaction. It 
would be amazing if that does the trick.

Will keep you posted, here.

Thanks,
Stefan


From: Igor Fedotov 
Sent: Monday, December 14, 2020 6:39:28 AM
To: Stefan Wild ; ceph-users@ceph.io 
Subject: Re: [ceph-users] Re: OSD reboot loop after running out of memory

Hi Stefan,

given the crash backtrace in your log I presume some data removal is in
progress:

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  3:
(KernelDevice::direct_read_unaligned(unsigned long, unsigned long,
char*)+0xd8) [0x5587b9364a48]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  4:
(KernelDevice::read_random(unsigned long, unsigned long, char*,
bool)+0x1b3) [0x5587b93653e3]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  5:
(BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long,
char*)+0x674) [0x5587b9328cb4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  19:
(BlueStore::_do_omap_clear(BlueStore::TransContext*,
boost::intrusive_ptr&)+0xa2) [0x5587b922f0e2]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  20:
(BlueStore::_do_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr)+0xc65) [0x5587b923b555]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  21:
(BlueStore::_remove(BlueStore::TransContext*,
boost::intrusive_ptr&,
boost::intrusive_ptr&)+0x64) [0x5587b923c3b4]
...

Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  24:
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
ceph::os::Transaction&&, boost::intrusive_ptr,
ThreadPool::TPHandle*)+0x85) [0x5587b8dcf745]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  25:
(PG::do_delete_work(ceph::os::Transaction&)+0xb2e) [0x5587b8e269ee]
Dec 12 21:58:38 ceph-tpa-server1 bash[784256]:  26:
(PeeringState::Deleting::react(PeeringState::DeleteSome const&)+0x3e)
[0x5587b8fd6ede]
...

Did you initiate some large pool removal recently? Or may be data
rebalancing triggered PG migration (and hence source PG removal) for you?

Highly likely you're facing a well known issue with RocksDB/BlueFS
performance issues caused by massive data removal.

So your OSDs are just processing I/O very slowly which triggers suicide
timeout.

We've had multiple threads on the issue in this mailing list - the
latest one is at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YBHNOSWW72ZVQ6PD5NABEEYRDMX7OZTT/

For now the good enough workaround is manual offline DB compaction for
all the OSDs (this might have temporary effect though as the removal
proceeds).

Additionally there are users' reports that recent default value's
modification  for bluefs_buffered_io setting has negative impact (or
just worsen existing issue with massive removal) as well. So you might
want to switch it back to true.

As for OSD.10 - can't say for sure as I haven't seen its' logs but I
think it's experiencing the same issue which might eventually lead it
into unresponsive state as well. Just grep its log for 

[ceph-users] Re: NoSuchKey on key that is visible in s3 list/radosgw bk

2020-11-23 Thread Frédéric Nass

Hi Denis,

You might want to look at rgw_gc_obj_min_wait from [1] and try 
increasing the default value of 7200s (2 hours) to whatever suits your 
need < 2^64.
Just remind that at some point you'll have to get these objects 
processed by the gc. Or manually through the API [2].


One thing that comes to mind regarding the "last night's missing object" 
is maybe it was multi-part re-written and the re-write failed somehow 
and the object was then enlisted by the gc. But that supposes this 
particular object sometimes gets re-written which may not be the case.


Regards,

Frédéric.

[1] https://docs.ceph.com/en/latest/radosgw/config-ref/
[2] 
https://docs.ceph.com/en/latest/dev/radosgw/admin/adminops_nonimplemented/#manually-processes-garbage-collection-items


Le 18/11/2020 à 11:27, Denis Krienbühl a écrit :

By the way, since there’s some probability that this is a GC refcount issue, 
would it be possible and sane to somehow slow the GC down or disable it 
altogether? Is that something we could implement on our end as a stop-gap 
measure to prevent dataloss?


On 18 Nov 2020, at 10:46, Denis Krienbühl  wrote:

I can now confirm that last night’s missing object was a multi-part file.


On 18 Nov 2020, at 10:01, Janek Bevendorff  
wrote:

Sorry, it's radosgw-admin object stat --bucket=BUCKETNAME --object=OBJECTNAME (forgot the 
"object" there)

On 18/11/2020 09:58, Janek Bevendorff wrote:

The object, a Docker layer, that went missing has not been touched in 2 months. 
It worked for a while, but then suddenly went missing.

Was the object a multipart object? You can check by running radosgw-admin stat --bucket=BUCKETNAME --object=OBJECTNAME. 
It should say something "ns": "multipart" in the output. If it says "ns": 
"shadow", it's a single-part object.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-13 Thread Frédéric Nass

Hi Jeff,

I understand the idea behind patch [1] but it breaks the operation of overlayfs 
with cephfs. Should the patch be abandoned and tests be modified or should 
overlayfs code be adapted to work with cephfs, if that's possible?

Either way, it'd be nice if overlayfs could work again with cephfs out of the 
box without requiring users to patch and build their own kernels past 5.4+.

Regards,

Frédéric.

[1] https://www.spinics.net/lists/ceph-devel/msg46183.html (CCing Greg)

PS : Please forgive me if you received this message twice. My previous message 
was flagged spam due to the dynamic IP address of my router that was seen in a 
spam campain in the past, so I sent it again.

- Le 9 Nov 20, à 19:52, Jeff Layton jlay...@kernel.org a écrit :

> Yes, you'd have to apply the patch to that kernel yourself. No RHEL7
> kernels have that patch (so far). Newer RHEL8 kernels _do_ if that's an
> option for you.
> -- Jeff
> 
> On Mon, 2020-11-09 at 19:21 +0100, Frédéric Nass wrote:
>> I feel lucky to have you on this one. ;-) Do you mean applying a
>> specific patch on 3.10 kernel? Or is this one too old to have it working
>> anyways.
>> 
>> Frédéric.
>> 
>> Le 09/11/2020 à 19:07, Luis Henriques a écrit :
>> > Frédéric Nass  writes:
>> > 
>> > > Hi Luis,
>> > > 
>> > > Thanks for your help. Sorry I forgot about the kernel details. This is 
>> > > latest
>> > > RHEL 7.9.
>> > > 
>> > > ~/ uname -r
>> > > 3.10.0-1160.2.2.el7.x86_64
>> > > 
>> > > ~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
>> > > CONFIG_TMPFS_XATTR=y
>> > > 
>> > > upper directory /upperdir is using xattrs
>> > > 
>> > > ~/ ls -l /dev/mapper/vg0-racine
>> > > lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0
>> > > 
>> > > ~/ cat /proc/fs/ext4/dm-0/options | grep xattr
>> > > user_xattr
>> > > 
>> > > ~/ setfattr -n user.name -v upperdir /upperdir
>> > > 
>> > > ~/ getfattr -n user.name /upperdir
>> > > getfattr: Suppression des « / » en tête des chemins absolus
>> > > # file: upperdir
>> > > user.name="upperdir"
>> > > 
>> > > Are you able to modify the content of a snapshot directory using 
>> > > overlayfs on
>> > > your side?
>> > [ Cc'ing Jeff ]
>> > 
>> > Yes, I'm able to do that using a *recent* kernel.  I got curious and after
>> > some digging I managed to reproduce the issue with kernel 5.3.  The
>> > culprit was commit e09580b343aa ("ceph: don't list vxattrs in
>> > listxattr()"), in 5.4.
>> > 
>> > Getting a bit more into the whole rabbit hole, it looks like
>> > ovl_copy_xattr() will try to copy all the ceph-related vxattrs.  And that
>> > won't work (for ex. for ceph.dir.entries).
>> > 
>> > Can you try cherry-picking this commit into your kernel to see if that
>> > fixes it for you?
>> > 
>> > Cheers,
> 
> --
> Jeff Layton 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-11 Thread Frédéric Nass

Hi Jeff,

I understand the idea behind patch [1] but it breaks the operation of 
overlayfs with cephfs. Should the patch be abandoned and tests be 
modified or should overlayfs code be adapted to work with cephfs, if 
that's possible?


Either way, it'd be nice if overlayfs could work again with cephfs out 
of the box without requiring users to patch and build their own kernels 
past 5.4+. :-)


Regards,

Frédéric.

[1] https://www.spinics.net/lists/ceph-devel/msg46183.html

Le 09/11/2020 à 19:52, Jeff Layton a écrit :

Yes, you'd have to apply the patch to that kernel yourself. No RHEL7
kernels have that patch (so far). Newer RHEL8 kernels _do_ if that's an
option for you.
-- Jeff

On Mon, 2020-11-09 at 19:21 +0100, Frédéric Nass wrote:

I feel lucky to have you on this one. ;-) Do you mean applying a
specific patch on 3.10 kernel? Or is this one too old to have it working
anyways.

Frédéric.

Le 09/11/2020 à 19:07, Luis Henriques a écrit :

Frédéric Nass  writes:


Hi Luis,

Thanks for your help. Sorry I forgot about the kernel details. This is latest
RHEL 7.9.

~/ uname -r
3.10.0-1160.2.2.el7.x86_64

~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
CONFIG_TMPFS_XATTR=y

upper directory /upperdir is using xattrs

~/ ls -l /dev/mapper/vg0-racine
lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0

~/ cat /proc/fs/ext4/dm-0/options | grep xattr
user_xattr

~/ setfattr -n user.name -v upperdir /upperdir

~/ getfattr -n user.name /upperdir
getfattr: Suppression des « / » en tête des chemins absolus
# file: upperdir
user.name="upperdir"

Are you able to modify the content of a snapshot directory using overlayfs on
your side?

[ Cc'ing Jeff ]

Yes, I'm able to do that using a *recent* kernel.  I got curious and after
some digging I managed to reproduce the issue with kernel 5.3.  The
culprit was commit e09580b343aa ("ceph: don't list vxattrs in
listxattr()"), in 5.4.

Getting a bit more into the whole rabbit hole, it looks like
ovl_copy_xattr() will try to copy all the ceph-related vxattrs.  And that
won't work (for ex. for ceph.dir.entries).

Can you try cherry-picking this commit into your kernel to see if that
fixes it for you?

Cheers,

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-09 Thread Frédéric Nass
I feel lucky to have you on this one. ;-) Do you mean applying a 
specific patch on 3.10 kernel? Or is this one too old to have it working 
anyways.


Frédéric.

Le 09/11/2020 à 19:07, Luis Henriques a écrit :

Frédéric Nass  writes:


Hi Luis,

Thanks for your help. Sorry I forgot about the kernel details. This is latest
RHEL 7.9.

~/ uname -r
3.10.0-1160.2.2.el7.x86_64

~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
CONFIG_TMPFS_XATTR=y

upper directory /upperdir is using xattrs

~/ ls -l /dev/mapper/vg0-racine
lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0

~/ cat /proc/fs/ext4/dm-0/options | grep xattr
user_xattr

~/ setfattr -n user.name -v upperdir /upperdir

~/ getfattr -n user.name /upperdir
getfattr: Suppression des « / » en tête des chemins absolus
# file: upperdir
user.name="upperdir"

Are you able to modify the content of a snapshot directory using overlayfs on
your side?

[ Cc'ing Jeff ]

Yes, I'm able to do that using a *recent* kernel.  I got curious and after
some digging I managed to reproduce the issue with kernel 5.3.  The
culprit was commit e09580b343aa ("ceph: don't list vxattrs in
listxattr()"), in 5.4.

Getting a bit more into the whole rabbit hole, it looks like
ovl_copy_xattr() will try to copy all the ceph-related vxattrs.  And that
won't work (for ex. for ceph.dir.entries).

Can you try cherry-picking this commit into your kernel to see if that
fixes it for you?

Cheers,

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-09 Thread Frédéric Nass

Luis,

I gave RHEL 8 and kernel 4.18 a try and it's working perfectly! \o/

Same commands, same mount options. Does anyone know why and if there's 
any chances I can have this working with CentOS/RHEL 7 and 3.10 kernel?


Best regards,

Frédéric.

Le 09/11/2020 à 15:04, Frédéric Nass a écrit :

Hi Luis,

Thanks for your help. Sorry I forgot about the kernel details. This is 
latest RHEL 7.9.


~/ uname -r
3.10.0-1160.2.2.el7.x86_64

~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
CONFIG_TMPFS_XATTR=y

upper directory /upperdir is using xattrs

~/ ls -l /dev/mapper/vg0-racine
lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0

~/ cat /proc/fs/ext4/dm-0/options | grep xattr
user_xattr

~/ setfattr -n user.name -v upperdir /upperdir

~/ getfattr -n user.name /upperdir
getfattr: Suppression des « / » en tête des chemins absolus
# file: upperdir
user.name="upperdir"

Are you able to modify the content of a snapshot directory using 
overlayfs on your side?


Frédéric.

Le 09/11/2020 à 12:39, Luis Henriques a écrit :

Frédéric Nass  writes:


Hello,

I would like to use a cephfs snapshot as a read/write volume without 
having to
clone it first as the cloning operation is - if I'm not mistaken - 
still

inefficient as of now. This is for a data restore use case with Moodle
application needing a writable data directory to start.

The idea that came to mind was to use overlayFS with cephfs set up as a
read-only lower layer and a writable local directory set up as an upper
layer. With this set up, any modifications to the read-only 
.snap/testsnap
directory would normally go to the upper directory making the 
snapshot directory
somehow writable to the Moodle application. While this works fine 
when a local
read-only filesystem is set up as the lower layer, it fails when 
cephfs is set
up as the lower layer. Any modifications to the .snap/testsnap tree 
in the

/cephfs-snap directory fails with an "Operation not supported".

$ mkdir /cephfs /upperdir /workdir /cephfs-snap

$ mount -t ceph 100.74.191.129:/volumes/group1/subvolume1/ /cephfs -o
name=admin,secretfile=/etc/ceph/admin.secret

$ mount -t overlay overlay -o
redirect_dir=on,lowerdir=/cephfs/.snap/testsnap,upperdir=/upperdir,workdir=/workdir 


/cephfs-snap

$ ls /cephfs-snap
usr

$ touch /cephfs-snap/foo.txt    < writing outside the 
lowerdir

succeeds

$ ls /cephfs-snap
foo.txt  usr

$ ls /usr/etc

$ touch /cephfs-snap/usr/etc/foo    < writing inside the 
lowerdir

fails
touch: impossible de faire un touch « /cephfs-snap/usr/etc/foo »: 
Opération non

supportée

I tried to mount the whole cephfs tree read-only (-o ro), tried to 
disable ACLs
(-o noacl) as seen here [1] but of no help. Mounting with ceph-fuse 
didn't help
either. There's been a recent discussion about this here [2] between 
Greg and

Robert but with no real solution.

I just commented on that bug tracker and, although I'm not really 100%
sure, I suspect that the tmpfs on that system has been compiled without
xattr support.


Did someone manage to do this?

I couldn't reproduce you're problem.  Is it possible that your upper dir
doesn't support xatttrs either?  Also, kernel client details would help.

Cheers,
--
Luis


Regards,

Frédéric.

[1] https://blog.fai-project.org/posts/overlayfs/
[2] https://tracker.ceph.com/issues/44821
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OverlayFS with Cephfs to mount a snapshot read/write

2020-11-09 Thread Frédéric Nass

Hi Luis,

Thanks for your help. Sorry I forgot about the kernel details. This is 
latest RHEL 7.9.


~/ uname -r
3.10.0-1160.2.2.el7.x86_64

~/ grep CONFIG_TMPFS_XATTR /boot/config-3.10.0-1160.2.2.el7.x86_64
CONFIG_TMPFS_XATTR=y

upper directory /upperdir is using xattrs

~/ ls -l /dev/mapper/vg0-racine
lrwxrwxrwx 1 root root 7  6 mars   2020 /dev/mapper/vg0-racine -> ../dm-0

~/ cat /proc/fs/ext4/dm-0/options | grep xattr
user_xattr

~/ setfattr -n user.name -v upperdir /upperdir

~/ getfattr -n user.name /upperdir
getfattr: Suppression des « / » en tête des chemins absolus
# file: upperdir
user.name="upperdir"

Are you able to modify the content of a snapshot directory using 
overlayfs on your side?


Frédéric.

Le 09/11/2020 à 12:39, Luis Henriques a écrit :

Frédéric Nass  writes:


Hello,

I would like to use a cephfs snapshot as a read/write volume without having to
clone it first as the cloning operation is - if I'm not mistaken - still
inefficient as of now. This is for a data restore use case with Moodle
application needing a writable data directory to start.

The idea that came to mind was to use overlayFS with cephfs set up as a
read-only lower layer and a writable local directory set up as an upper
layer. With this set up, any modifications to the read-only .snap/testsnap
directory would normally go to the upper directory making the snapshot directory
somehow writable to the Moodle application. While this works fine when a local
read-only filesystem is set up as the lower layer, it fails when cephfs is set
up as the lower layer. Any modifications to the .snap/testsnap tree in the
/cephfs-snap directory fails with an "Operation not supported".

$ mkdir /cephfs /upperdir /workdir /cephfs-snap

$ mount -t ceph 100.74.191.129:/volumes/group1/subvolume1/ /cephfs -o
name=admin,secretfile=/etc/ceph/admin.secret

$ mount -t overlay overlay -o
redirect_dir=on,lowerdir=/cephfs/.snap/testsnap,upperdir=/upperdir,workdir=/workdir
/cephfs-snap

$ ls /cephfs-snap
usr

$ touch /cephfs-snap/foo.txt    < writing outside the lowerdir
succeeds

$ ls /cephfs-snap
foo.txt  usr

$ ls /usr/etc

$ touch /cephfs-snap/usr/etc/foo    < writing inside the lowerdir
fails
touch: impossible de faire un touch « /cephfs-snap/usr/etc/foo »: Opération non
supportée

I tried to mount the whole cephfs tree read-only (-o ro), tried to disable ACLs
(-o noacl) as seen here [1] but of no help. Mounting with ceph-fuse didn't help
either. There's been a recent discussion about this here [2] between Greg and
Robert but with no real solution.

I just commented on that bug tracker and, although I'm not really 100%
sure, I suspect that the tmpfs on that system has been compiled without
xattr support.


Did someone manage to do this?

I couldn't reproduce you're problem.  Is it possible that your upper dir
doesn't support xatttrs either?  Also, kernel client details would help.

Cheers,
--
Luis


Regards,

Frédéric.

[1] https://blog.fai-project.org/posts/overlayfs/
[2] https://tracker.ceph.com/issues/44821
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OverlayFS with Cephfs to mount a snapshot read/write

2020-11-09 Thread Frédéric Nass

Hello,

I would like to use a cephfs snapshot as a read/write volume without 
having to clone it first as the cloning operation is - if I'm not 
mistaken - still inefficient as of now. This is for a data restore use 
case with Moodle application needing a writable data directory to start.


The idea that came to mind was to use overlayFS with cephfs set up as a 
read-only lower layer and a writable local directory set up as an upper 
layer. With this set up, any modifications to the read-only 
.snap/testsnap directory would normally go to the upper directory making 
the snapshot directory somehow writable to the Moodle application. While 
this works fine when a local read-only filesystem is set up as the lower 
layer, it fails when cephfs is set up as the lower layer. Any 
modifications to the .snap/testsnap tree in the /cephfs-snap directory 
fails with an "Operation not supported".


$ mkdir /cephfs /upperdir /workdir /cephfs-snap

$ mount -t ceph 100.74.191.129:/volumes/group1/subvolume1/ /cephfs -o 
name=admin,secretfile=/etc/ceph/admin.secret


$ mount -t overlay overlay -o 
redirect_dir=on,lowerdir=/cephfs/.snap/testsnap,upperdir=/upperdir,workdir=/workdir 
/cephfs-snap


$ ls /cephfs-snap
usr

$ touch /cephfs-snap/foo.txt    < writing outside the 
lowerdir succeeds


$ ls /cephfs-snap
foo.txt  usr

$ ls /usr/etc

$ touch /cephfs-snap/usr/etc/foo    < writing inside the 
lowerdir fails
touch: impossible de faire un touch « /cephfs-snap/usr/etc/foo »: 
Opération non supportée


I tried to mount the whole cephfs tree read-only (-o ro), tried to 
disable ACLs (-o noacl) as seen here [1] but of no help. Mounting with 
ceph-fuse didn't help either. There's been a recent discussion about 
this here [2] between Greg and Robert but with no real solution.


Did someone manage to do this?

Regards,

Frédéric.

[1] https://blog.fai-project.org/posts/overlayfs/
[2] https://tracker.ceph.com/issues/44821
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io