Re: [ceph-users] Package availability for Debian / Ubuntu

2019-05-16 Thread Christian Balzer


Hello,

It's now May and nothing has changed here or in the tracker for the
related Bionic issue.

At this point in time it feels like Redhat/DIY or bust, neither are very
enticing prospects.

Definitely not going to deploy a Stretch and Luminous cluster next in July.

Christian

On Thu, 20 Dec 2018 16:19:54 + Matthew Vernon wrote:

> Hi,
> 
> Since the "where are the bionic packages for Luminous?" question remains 
> outstanding, I thought I'd look at the question a little further.
> 
> The TL;DR is:
> 
> Jewel: built for Ubuntu trusty & xenial ; Debian jessie & stretch
> 
> Luminous: built for Ubuntu trusty & xenial ; Debian jessie & stretch
> 
> Mimic: built for Ubuntu xenial & bionic ; no Debian releases
> 
> (in the other cases, a single ceph-deploy package is shipped).
> 
> I don't _think_ this is what you're trying to achieve? In particular, do 
> you really only want to provide bionic package for Mimic? It feels like 
> your build machinery isn't quite doing what you want here, given you've 
> previously spoken about building bionic packages for Luminous...
> 
> In more detail:
> 
> Packages for Ceph jewel:
> precise has 1 Packages. No ceph package found
> trusty has 47 Packages. Ceph version 10.2.11-1trusty
> xenial has 47 Packages. Ceph version 10.2.11-1xenial
> bionic has 1 Packages. No ceph package found
> wheezy has 1 Packages. No ceph package found
> jessie has 47 Packages. Ceph version 10.2.11-1~bpo80+1
> stretch has 47 Packages. Ceph version 10.2.11-1~bpo90+1
> 
> Packages for Ceph luminous:
> precise has 1 Packages. No ceph package found
> trusty has 63 Packages. Ceph version 12.2.10-1trusty
> xenial has 63 Packages. Ceph version 12.2.10-1xenial
> bionic has 1 Packages. No ceph package found
> wheezy has 1 Packages. No ceph package found
> jessie has 63 Packages. Ceph version 12.2.10-1~bpo80+1
> stretch has 63 Packages. Ceph version 12.2.10-1~bpo90+1
> 
> Packages for Ceph mimic:
> precise has 1 Packages. No ceph package found
> trusty has 1 Packages. No ceph package found
> xenial has 63 Packages. Ceph version 13.2.2-1xenial
> bionic has 63 Packages. Ceph version 13.2.2-1bionic
> wheezy has 1 Packages. No ceph package found
> jessie has 1 Packages. No ceph package found
> stretch has 1 Packages. No ceph package found
> 
> If you want to re-run these tests, the attached hacky shell script does it.
> 
> Regards,
> 
> Matthew
> 
> 
> 
> -- 
>  The Wellcome Sanger Institute is operated by Genome Research 
>  Limited, a charity registered in England with number 1021457 and a 
>  company registered in England with number 2742969, whose registered 
>  office is 215 Euston Road, London, NW1 2BE. 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart
I ended up backing up the journals of the MDS ranks, recover_dentries for both 
of them, resetting the journals and session table. It is back up. The recover 
dentries stage didn't show any errors, so I'm not even sure why the MDS was 
asserting about duplicate inodes.

--
Adam

On Thu, May 16, 2019, 13:52 Adam Tygart mailto:mo...@ksu.edu>> 
wrote:
Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart 
mailto:mo...@ksu.edu>> wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Samba vfs_ceph or kernel client

2019-05-16 Thread Maged Mokhtar


Thanks a lot for the clarification.  /Maged


On 16/05/2019 17:23, David Disseldorp wrote:

Hi Maged,

On Fri, 10 May 2019 18:32:15 +0200, Maged Mokhtar wrote:


What is the recommended way for Samba gateway integration: using
vfs_ceph or mounting CephFS via kernel client ? i tested the kernel
solution in a ctdb setup and gave good performance, does it have any
limitations relative to vfs_ceph ?

At this stage kernel-backed and vfs_ceph-backed shares are pretty
similar feature wise. ATM kernel backed shares have the performance
advantage of page-cache + async vfs_default dispatch. vfs_ceph will
likely gain more features in future as cross-protocol share-mode locks
and leases can be supported without the requirement for a kernel
interface.

Cheers, David


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lost OSD from PCIe error, recovered, HOW to restore OSD process

2019-05-16 Thread Mark Lehrer
> Steps 3-6 are to get the drive lvm volume back

How much longer will we have to deal with LVM?  If we can migrate non-LVM
drives from earlier versions, how about we give ceph-volume the ability to
create non-LVM OSDs directly?



On Thu, May 16, 2019 at 1:20 PM Tarek Zegar  wrote:

> FYI for anyone interested, below is how to recover from a someone removing
> a NVME drive (the first two steps show how mine were removed and brought
> back)
> Steps 3-6 are to get the drive lvm volume back AND get the OSD daemon
> running for the drive
>
> 1. echo 1 > /sys/block/nvme0n1/device/device/remove
> 2. echo 1 > /sys/bus/pci/rescan
> 3. vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
> ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841
> 4. ceph auth add osd.122 osd 'allow *' mon 'allow rwx' -i
> /var/lib/ceph/osd/ceph-122/keyring
> 5. ceph-volume lvm activate --all
> 6. You should see the drive somewhere in the ceph tree, move it to the
> right host
>
> Tarek
>
>
>
> [image: Inactive hide details for "Tarek Zegar" ---05/15/2019 10:32:27
> AM---TLDR; I activated the drive successfully but the daemon won]"Tarek
> Zegar" ---05/15/2019 10:32:27 AM---TLDR; I activated the drive successfully
> but the daemon won't start, looks like it's complaining abo
>
> From: "Tarek Zegar" 
> To: Alfredo Deza 
> Cc: ceph-users 
> Date: 05/15/2019 10:32 AM
> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error, recovered,
> to restore OSD process
> Sent by: "ceph-users" 
> --
>
>
>
> TLDR; I activated the drive successfully but the daemon won't start, looks
> like it's complaining about mon config, idk why (there is a valid ceph.conf
> on the host). Thoughts? I feel like it's close. Thank you
>
> I executed the command:
> ceph-volume lvm activate --all
>
>
> It found the drive and activated it:
> --> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a
> 
> --> ceph-volume lvm activate successful for osd ID: 122
>
>
>
> However, systemd would not start the OSD process 122:
> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
> 14:16:13.862 71970700 -1 monclient(hunting): handle_auth_bad_method
> server allowed_methods [2] but i only support [2]
> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
> 14:16:13.862 7116f700 -1 monclient(hunting): handle_auth_bad_method
> server allowed_methods [2] but i only support [2]
> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]:* failed to fetch
> mon config (--no-mon-config to skip)*
> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
> Main process exited, code=exited, status=1/FAILURE
> May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service: 
> *Failed
> with result 'exit-code'.*
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
> Service hold-off time over, scheduling restart.
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
> Scheduled restart job, restart counter is at 3.
> -- Subject: Automatic restarting of a unit has been scheduled
> -- Defined-By: systemd
> -- Support: *http://www.ubuntu.com/support*
> 
> --
> -- Automatic restarting of the unit ceph-osd@122.service has been
> scheduled, as the result for
> -- the configured Restart= setting for the unit.
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object
> storage daemon osd.122.
> -- Subject: Unit ceph-osd@122.service has finished shutting down
> -- Defined-By: systemd
> -- Support: *http://www.ubuntu.com/support*
> 
> --
> -- Unit ceph-osd@122.service has finished shutting down.
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
> Start request repeated too quickly.
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
> Failed with result 'exit-code'.
> May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: *Failed to start Ceph
> object storage daemon osd.122*
>
>
>
> [image: Inactive hide details for Alfredo Deza ---05/15/2019 08:27:13
> AM---On Tue, May 14, 2019 at 7:24 PM Bob R ]Alfredo
> Deza ---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R <
> b...@drinksbeer.org> wrote: >
>
> From: Alfredo Deza 
> To: Bob R 
> Cc: Tarek Zegar , ceph-users  >
> Date: 05/15/2019 08:27 AM
> Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error, recovered,
> to restore OSD process
> --
>
>
>
> On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
> >
> > Does 'ceph-volume lvm list' show it? If so you can try to activate it
> with 'ceph-volume lvm activate 122 74b01ec2--124d--427d--9812--e437f90261d4'
>
> Good suggestion. If `ceph-volume lvm list` can see it, it can probably
> activate it again. You can activate it with the OSD ID + OSD FSID, or
> do:
>
> ceph-volume lvm activate --all
>
> You didn't say if the OSD wasn't coming up after trying to start it
> (the systemd unit 

Re: [ceph-users] Lost OSD from PCIe error, recovered, HOW to restore OSD process

2019-05-16 Thread Tarek Zegar

FYI for anyone interested, below is how to recover from a someone removing
a NVME drive (the first two steps show how mine were removed and brought
back)
Steps 3-6 are to get the drive lvm volume back AND get the OSD daemon
running for the drive

 1. echo 1 > /sys/block/nvme0n1/device/device/remove
 2.   echo 1 > /sys/bus/pci/rescan
 3.   vgcfgrestore ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841
 4.   ceph auth add osd.122 osd 'allow *' mon 'allow rwx'
-i /var/lib/ceph/osd/ceph-122/keyring
 5. ceph-volume lvm activate --all
 6.   You should see the drive somewhere in the ceph tree, move it to the
right host

Tarek





From:   "Tarek Zegar" 
To: Alfredo Deza 
Cc: ceph-users 
Date:   05/15/2019 10:32 AM
Subject:[EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error,
recovered, to restore OSD process
Sent by:"ceph-users" 



TLDR; I activated the drive successfully but the daemon won't start, looks
like it's complaining about mon config, idk why (there is a valid ceph.conf
on the host). Thoughts? I feel like it's close. Thank you

I executed the command:
ceph-volume lvm activate --all


It found the drive and activated it:
--> Activating OSD ID 122 FSID a151bea5-d123-45d9-9b08-963a511c042a

--> ceph-volume lvm activate successful for osd ID: 122



However, systemd would not start the OSD process 122:
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 71970700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: 2019-05-15
14:16:13.862 7116f700 -1 monclient(hunting): handle_auth_bad_method
server allowed_methods [2] but i only support [2]
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 ceph-osd[757237]: failed to fetch
mon config (--no-mon-config to skip)
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Main process exited, code=exited, status=1/FAILURE
May 15 14:16:13 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Service hold-off time over, scheduling restart.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Scheduled restart job, restart counter is at 3.
-- Subject: Automatic restarting of a unit has been scheduled
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Automatic restarting of the unit ceph-osd@122.service has been
scheduled, as the result for
-- the configured Restart= setting for the unit.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Stopped Ceph object
storage daemon osd.122.
-- Subject: Unit ceph-osd@122.service has finished shutting down
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- Unit ceph-osd@122.service has finished shutting down.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Start request repeated too quickly.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: ceph-osd@122.service:
Failed with result 'exit-code'.
May 15 14:16:14 pok1-qz1-sr1-rk001-s20 systemd[1]: Failed to start Ceph
object storage daemon osd.122



Inactive hide details for Alfredo Deza ---05/15/2019 08:27:13 AM---On Tue,
May 14, 2019 at 7:24 PM Bob R  Alfredo Deza
---05/15/2019 08:27:13 AM---On Tue, May 14, 2019 at 7:24 PM Bob R
 wrote: >

From: Alfredo Deza 
To: Bob R 
Cc: Tarek Zegar , ceph-users 
Date: 05/15/2019 08:27 AM
Subject: [EXTERNAL] Re: [ceph-users] Lost OSD from PCIe error, recovered,
to restore OSD process



On Tue, May 14, 2019 at 7:24 PM Bob R  wrote:
>
> Does 'ceph-volume lvm list' show it? If so you can try to activate it
with 'ceph-volume lvm activate 122
74b01ec2--124d--427d--9812--e437f90261d4'

Good suggestion. If `ceph-volume lvm list` can see it, it can probably
activate it again. You can activate it with the OSD ID + OSD FSID, or
do:

ceph-volume lvm activate --all

You didn't say if the OSD wasn't coming up after trying to start it
(the systemd unit should still be there for ID 122), or if you tried
rebooting and that OSD didn't come up.

The systemd unit is tied to both the ID and FSID of the OSD, so it
shouldn't matter if the underlying device changed since ceph-volume
ensures it is the right one every time it activates.
>
> Bob
>
> On Tue, May 14, 2019 at 7:35 AM Tarek Zegar  wrote:
>>
>> Someone nuked and OSD that had 1 replica PGs. They accidentally did echo
1 > /sys/block/nvme0n1/device/device/remove
>> We got it back doing a echo 1 > /sys/bus/pci/rescan
>> However, it reenumerated as a different drive number (guess we didn't
have udev rules)
>> They restored the LVM volume (vgcfgrestore
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841 ; vgchange -ay
ceph-8c81b2a3-6c8e-4cae-a3c0-e2d91f82d841)
>>
>> lsblk
>> nvme0n2 259:9 0 1.8T 0 diskc
>>

Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart
Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart  wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is it possible to hide slow ops resulting from bugs?

2019-05-16 Thread Jean-Philippe Méthot
Hi,

Lately we’ve had to deal with https://tracker.ceph.com/issues/24531 
 which constantly trigger slow ops 
warning messages in our ceph health. As per the bug report, these appear to be 
only cosmetic and in no way affect the workings of the cluster. Is there a way 
to deactivate this alert in particular so that ceph health can report actual 
issues and as a result, prevent Nagios from throwing alerts every time this bug 
happen? I would deactivate them only until this bug is fixed.

Best regards,

Jean-Philippe Méthot
Openstack system administrator
Administrateur système Openstack
PlanetHoster inc.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Uwe Sauter
You could also edit your ceph-mon@.service (assuming systemd) to depend on chrony and add a line 
"ExecStartPre=/usr/bin/sleep 30" to stall the startup to give chrony a chance to sync before the Mon is started.




Am 16.05.19 um 17:38 schrieb Stefan Kooman:

Quoting Jan Kasprzak (k...@fi.muni.cz):


OK, many responses (thanks for them!) suggest chrony, so I tried it:
With all three mons running chrony and being in sync with my NTP server
with offsets under 0.0001 second, I rebooted one of the mons:

There still was the HEALTH_WARN clock_skew message as soon as
the rebooted mon starts responding to ping. The cluster returns to
HEALTH_OK about 95 seconds later.

According to "ntpdate -q my.ntp.server", the initial offset
after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
of HEALTH_WARN is inside Ceph, with mons being already synchronized.

So the result is that chrony indeed synchronizes faster,
but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
detected".

I guess now the workaround now is to ignore the warning, and wait
for two minutes before rebooting another mon.


You can tune the "mon_timecheck_skew_interval" which by default is set
to 30 seconds. See [1] and look for "timecheck" to find the different
options.

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Stefan Kooman
Quoting Jan Kasprzak (k...@fi.muni.cz):

>   OK, many responses (thanks for them!) suggest chrony, so I tried it:
> With all three mons running chrony and being in sync with my NTP server
> with offsets under 0.0001 second, I rebooted one of the mons:
> 
>   There still was the HEALTH_WARN clock_skew message as soon as
> the rebooted mon starts responding to ping. The cluster returns to
> HEALTH_OK about 95 seconds later.
> 
>   According to "ntpdate -q my.ntp.server", the initial offset
> after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
> but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
> of HEALTH_WARN is inside Ceph, with mons being already synchronized.
> 
>   So the result is that chrony indeed synchronizes faster,
> but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
> detected".
> 
>   I guess now the workaround now is to ignore the warning, and wait
> for two minutes before rebooting another mon.

You can tune the "mon_timecheck_skew_interval" which by default is set
to 30 seconds. See [1] and look for "timecheck" to find the different
options.

Gr. Stefan

[1]:
http://docs.ceph.com/docs/master/rados/configuration/mon-config-ref/

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Samba vfs_ceph or kernel client

2019-05-16 Thread David Disseldorp
Hi Maged,

On Fri, 10 May 2019 18:32:15 +0200, Maged Mokhtar wrote:

> What is the recommended way for Samba gateway integration: using 
> vfs_ceph or mounting CephFS via kernel client ? i tested the kernel 
> solution in a ctdb setup and gave good performance, does it have any 
> limitations relative to vfs_ceph ?

At this stage kernel-backed and vfs_ceph-backed shares are pretty
similar feature wise. ATM kernel backed shares have the performance
advantage of page-cache + async vfs_default dispatch. vfs_ceph will
likely gain more features in future as cross-protocol share-mode locks
and leases can be supported without the requirement for a kernel
interface.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Huge rebalance after rebooting OSD host (Mimic)

2019-05-16 Thread kas
huang jun wrote:
: do you have osd's crush location changed after reboot?

I am not sure which reboot do you mean, but to sum up what I wrote
in previous messages i this thread, it probably went as follows:

- reboot of the OSD server
- the server goes up with wrong hostname "localhost"
- new CRUSH node host=localhost root=default gets created
- OSDs on this host get moved under this new node, outside the previous
CRUSH hierarchy
- cluster starts rebalancing
- one more reboot of that OSD server
- this time it got up with correct hostname
- OSDs on this host get moved under the correct CRUSH host node
- rebalance continues (don't know why), some PGs get stuck in
"activating" state.
- after a while, I restarted the OSD processes on that host, and PGs
get unstuck
- several hours ago, rebalance finished.

The problem of incorrect hostname is probably somewhere in my configuration,
outside CEPH. However, I think OSDs cannot get stuck in the "activating"
state indefinitely.

-Yenya

: kas  于2019年5月15日周三 下午10:39写道:
: >
: > kas wrote:
: > :   Marc,
: > :
: > : Marc Roos wrote:
: > : : Are you sure your osd's are up and reachable? (run ceph osd tree on
: > : : another node)
: > :
: > :   They are up, because all three mons see them as up.
: > : However, ceph osd tree provided the hint (thanks!): The OSD host went back
: > : with hostname "localhost" instead of the correct one for some reason.
: > : So the OSDs moved themselves to a new HOST=localhost CRUSH node directly
: > : under the CRUSH root. I rebooted the OSD host once again, and it went up
: > : again with the correct hostname, and the "ceph osd tree" output looks sane
: > : now. So I guess we have a reason for such a huge rebalance.
: > :
: > :   However, even though the OSD tree is back in the normal state,
: > : the rebalance is still going on, and there are even inactive PGs,
: > : with some Ceph clients being stuck seemingly forever:
: > :
: > : health: HEALTH_ERR
: > : 1964645/3977451 objects misplaced (49.395%)
: > : Reduced data availability: 11 pgs inactive
: >
: > Wild guessing what to do, I went to the rebooted OSD host and ran
: > systemctl restart ceph-osd.target
: > - restarting all OSD processes. The previously inactive (activating) pgs
: > went to the active state, and Ceph clients got unstuck. Now I see
: > HEALTH_ERR with backfill_toofull only, which I consider a normal state
: > during Ceph Mimic rebalance.
: >
: > It would be interesting to know why some of the PGs went stuck,
: > and why did restart help. FWIW, I have a "ceph pg query" output for
: > one of the 11 inactive PGs.
: >
: > -Yenya
: >
: > ---
: > # ceph pg 23.4f5 query
: > {
: > "state": "activating+remapped",
: > "snap_trimq": "[]",
: > "snap_trimq_len": 0,
: > "epoch": 104015,
: > "up": [
: > 70,
: > 72,
: > 27
: > ],
: > "acting": [
: > 25,
: > 27,
: > 79
: > ],
: > "backfill_targets": [
: > "70",
: > "72"
: > ],
: > "acting_recovery_backfill": [
: > "25",
: > "27",
: > "70",
: > "72",
: > "79"
: > ],
: > "info": {
: > "pgid": "23.4f5",
: > "last_update": "103035'4667973",
: > "last_complete": "103035'4667973",
: > "log_tail": "102489'4664889",
: > "last_user_version": 4667973,
: > "last_backfill": "MAX",
: > "last_backfill_bitwise": 1,
: > "purged_snaps": [],
: > "history": {
: > "epoch_created": 406,
: > "epoch_pool_created": 406,
: > "last_epoch_started": 103086,
: > "last_interval_started": 103085,
: > "last_epoch_clean": 96881,
: > "last_interval_clean": 96880,
: > "last_epoch_split": 0,
: > "last_epoch_marked_full": 0,
: > "same_up_since": 103095,
: > "same_interval_since": 103095,
: > "same_primary_since": 95398,
: > "last_scrub": "102517'4667556",
: > "last_scrub_stamp": "2019-05-15 01:07:28.978979",
: > "last_deep_scrub": "102491'4666011",
: > "last_deep_scrub_stamp": "2019-05-08 07:20:08.253942",
: > "last_clean_scrub_stamp": "2019-05-15 01:07:28.978979"
: > },
: > "stats": {
: > "version": "103035'4667973",
: > "reported_seq": "2116838",
: > "reported_epoch": "104015",
: > "state": "activating+remapped",
: > "last_fresh": "2019-05-15 16:19:44.530005",
: > "last_change": "2019-05-15 14:56:04.248887",
: > "last_active": "2019-05-15 14:56:02.579506",
: > "last_peered": "2019-05-15 14:56:01.401941",
: > "last_clean": "2019-05-15 14:53:39.291350",
: > "last_became_active": "2019-05-15 14:55:54.163102",

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Yan, Zheng
On Thu, May 16, 2019 at 4:10 PM Frank Schilder  wrote:
>
> Dear Yan and Stefan,
>
> thanks for the additional information, it should help reproducing the issue.
>
> The pdsh command executes a bash script that echoes a few values to stdout. 
> Access should be read-only, however, we still have the FS mounted with atime 
> enabled, so there is probably meta data write and synchronisation per access. 
> Files accessed are ssh auth-keys in .ssh and the shell script. The shell 
> script was located in the home-dir of the user and, following your 
> explanations, to reproduce the issue I will create a directory with many 
> entries and execute a test with the many-clients single-file-read load on it.
>

try setting mds_bal_split_rd and mds_bal_split_wr to very large value.
which prevent mds from splitting hot dirfrag

Regards
Yan, Zheng

> I hope it doesn't take too long.
>
> Thanks for your input!
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Yan, Zheng 
> Sent: 16 May 2019 09:35
> To: Frank Schilder
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> On Thu, May 16, 2019 at 2:52 PM Frank Schilder  wrote:
> >
> > Dear Yan,
> >
> > OK, I will try to trigger the problem again and dump the information 
> > requested. Since it is not easy to get into this situation and I usually 
> > need to resolve it fast (its not a test system), is there anything else 
> > worth capturing?
> >
>
> just
>
> ceph daemon mds.x dump_ops_in_flight
> ceph daemon mds.x dump cache /tmp/cachedump.x
>
> > I will get back as soon as it happened again.
> >
> > In the meantime, I would be grateful if you could shed some light on the 
> > following questions:
> >
> > - Is there a way to cancel an individual operation in the queue? It is a 
> > bit harsh to have to fail an MDS for that.
>
> no
>
> > - What is the fragmentdir operation doing in a single MDS setup? I thought 
> > this was only relevant if multiple MDS daemons are active on a file system.
> >
>
> It splits large directory to smaller parts.
>
>
> > =
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > 
> > From: Yan, Zheng 
> > Sent: 16 May 2019 05:50
> > To: Frank Schilder
> > Cc: Stefan Kooman; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops 
> > (MDS bug?)
> >
> > > [...]
> > > This time I captured the MDS ops list (log output does not really contain 
> > > more info than this list). It contains 12 ops and I will include it here 
> > > in full length (hope this is acceptable):
> > >
> >
> > Your issues were caused by stuck internal op fragmentdir.  Can you
> > dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Frank Schilder
Dear Yan,

it is difficult to push the MDS to err in this special way. Is it advisable or 
not to increase the likelihood and frequency of dirfrag operations by tweaking 
some of the parameters mentioned here: 
http://docs.ceph.com/docs/mimic/cephfs/dirfrags/. If so, what would reasonable 
values be, keeping in mind that we are in a pilot production phase already and 
need to maintain integrity of user data?

Is there any counter showing if such operations happened at all?

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 09:35
To: Frank Schilder
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Thu, May 16, 2019 at 2:52 PM Frank Schilder  wrote:
>
> Dear Yan,
>
> OK, I will try to trigger the problem again and dump the information 
> requested. Since it is not easy to get into this situation and I usually need 
> to resolve it fast (its not a test system), is there anything else worth 
> capturing?
>

just

ceph daemon mds.x dump_ops_in_flight
ceph daemon mds.x dump cache /tmp/cachedump.x

> I will get back as soon as it happened again.
>
> In the meantime, I would be grateful if you could shed some light on the 
> following questions:
>
> - Is there a way to cancel an individual operation in the queue? It is a bit 
> harsh to have to fail an MDS for that.

no

> - What is the fragmentdir operation doing in a single MDS setup? I thought 
> this was only relevant if multiple MDS daemons are active on a file system.
>

It splits large directory to smaller parts.


> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Yan, Zheng 
> Sent: 16 May 2019 05:50
> To: Frank Schilder
> Cc: Stefan Kooman; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> > [...]
> > This time I captured the MDS ops list (log output does not really contain 
> > more info than this list). It contains 12 ops and I will include it here in 
> > full length (hope this is acceptable):
> >
>
> Your issues were caused by stuck internal op fragmentdir.  Can you
> dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Marc Roos


Hmmm, looks like diskpart is of, reports the same about a volume, that 
fsutil fsinfo ntfsinfo c: report 512 (in this case correct, because it 
is on a ssd)
Anyone knows how to use fsutil with a path mounted disk (without drive 
letter)?


-Original Message-
From: Marc Roos 
Sent: donderdag 16 mei 2019 13:46
To: aderumier; trent.lloyd
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix


I am not sure if it is possible to run fsutil on disk without drive 
letter, but mounted on path. 
So I used:
diskpart
select volume 3
Filesystems

And gives me this: 
Current File System

  Type : NTFS
  Allocation Unit Size : 4096
  Flags : 

File Systems Supported for Formatting

  Type : NTFS (Default)
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 
32K, 64K

  Type : FAT32
  Allocation Unit Sizes: 4096, 8192 (Default), 16K, 32K, 64K

  Type : REFS
  Allocation Unit Sizes: 4096 (Default), 64K

So it looks like it detects 4k correctly? But I do not have the  in the disk of 
libvirt and have the WD with 512e:

[@c01 ~]# smartctl -a /dev/sdb | grep 'Sector Size'
Sector Sizes: 512 bytes logical, 4096 bytes physical

CentOS Linux release 7.6.1810 (Core)
ceph version 12.2.12
libvirt-4.5.0



-Original Message-
From: Trent Lloyd [mailto:trent.ll...@canonical.com]
Sent: donderdag 16 mei 2019 9:57
To: Alexandre DERUMIER
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

For libvirt VMs, first you need to add "" to the relevant 
 sections, and then stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat 
variety anyway) are from late 2018 or so. There was a bug fixed around 
July 2018, before that date, the physical_block_size=4096 parameter is 
not used by the Windows VirtIO driver (it was supposed to be, but did 
not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312 

After that, you can check if Windows is correctly recognizing the 
physical block size,

Start cmd.exe with "Run as administrator", then run fsutil fsinfo 
ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"



Lastly at least for Windows itself this makes it do 4096-byte writes 
"most of the time", however some applications including Exchange have 
special handling of the sector size. I'm not really sure how MSSQL 
handles it, for example, it may or may not work correctly if you switch 
to 4096 bytes after installation - you may have to create new data files 
or something for it to do 4k segments - or not. Hopefully the MSSQL 
documentation has some information about that.

It is also possible to set logical_block_size=4096 as well as
physical_block_size=4096 ("4k native") however this absolutely causes 
problems with some software (e.g. exchange) if you convert an existing 
installation between the two. If you try to use 4k native mode, ideally 
you would want to do a fresh install, to avoid any such issues. Or 
again, refer to the docs and test it. Just beware it may cause issues if 
you try to switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug 
filestore = 10" enabled, it will print out how many of the operations 
were unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb


You can just enable debug filestore = 10 dynamically on 1 OSD for about
5 minutes, turn it off, and process the log. And you could compare 
before/after. I haven't written an equivalent tool for BlueStore 
unfortunately if you are already in the modern world :) I also didnt' 
check maybe debug osd or something also has the writes and offsets, so I 
could write a generic tool to cover both cases, but also I have not done 
that.



Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER 
wrote:


Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm 
seeing improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably 

sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + 
SimpleMessenger) which while it is EOL, the issue is reproducible on 

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Marc Roos


I am not sure if it is possible to run fsutil on disk without drive 
letter, but mounted on path. 
So I used:
diskpart
select volume 3
Filesystems

And gives me this: 
Current File System

  Type : NTFS
  Allocation Unit Size : 4096
  Flags : 

File Systems Supported for Formatting

  Type : NTFS (Default)
  Allocation Unit Sizes: 512, 1024, 2048, 4096 (Default), 8192, 16K, 
32K, 64K

  Type : FAT32
  Allocation Unit Sizes: 4096, 8192 (Default), 16K, 32K, 64K

  Type : REFS
  Allocation Unit Sizes: 4096 (Default), 64K

So it looks like it detects 4k correctly? But I do not have the  in the disk of 
libvirt and have the WD with 512e:

[@c01 ~]# smartctl -a /dev/sdb | grep 'Sector Size'
Sector Sizes: 512 bytes logical, 4096 bytes physical

CentOS Linux release 7.6.1810 (Core)
ceph version 12.2.12
libvirt-4.5.0



-Original Message-
From: Trent Lloyd [mailto:trent.ll...@canonical.com] 
Sent: donderdag 16 mei 2019 9:57
To: Alexandre DERUMIER
Cc: ceph-users
Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

For libvirt VMs, first you need to add "" to the relevant 
 sections, and then stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat 
variety anyway) are from late 2018 or so. There was a bug fixed around 
July 2018, before that date, the physical_block_size=4096 parameter is 
not used by the Windows VirtIO driver (it was supposed to be, but did 
not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312 

After that, you can check if Windows is correctly recognizing the 
physical block size,

Start cmd.exe with "Run as administrator", then run fsutil fsinfo 
ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"



Lastly at least for Windows itself this makes it do 4096-byte writes 
"most of the time", however some applications including Exchange have 
special handling of the sector size. I'm not really sure how MSSQL 
handles it, for example, it may or may not work correctly if you switch 
to 4096 bytes after installation - you may have to create new data files 
or something for it to do 4k segments - or not. Hopefully the MSSQL 
documentation has some information about that.

It is also possible to set logical_block_size=4096 as well as 
physical_block_size=4096 ("4k native") however this absolutely causes 
problems with some software (e.g. exchange) if you convert an existing 
installation between the two. If you try to use 4k native mode, ideally 
you would want to do a fresh install, to avoid any such issues. Or 
again, refer to the docs and test it. Just beware it may cause issues if 
you try to switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug 
filestore = 10" enabled, it will print out how many of the operations 
were unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb


You can just enable debug filestore = 10 dynamically on 1 OSD for about 
5 minutes, turn it off, and process the log. And you could compare 
before/after. I haven't written an equivalent tool for BlueStore 
unfortunately if you are already in the modern world :) I also didnt' 
check maybe debug osd or something also has the writes and offsets, so I 
could write a generic tool to cover both cases, but also I have not done 
that.



Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER  
wrote:


Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm 
seeing improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" 
writes from Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably 
sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS 
HDD) with NVMe Journals. The primary workload is Windows guests backed 
by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + 
SimpleMessenger) which while it is EOL, the issue is reproducible on 
current versions and also on BlueStore however for different reasons 
than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier 
performance, the numbers change a little bit depending on the exact 
situation but roughly 80% of I/O was happening in a "reasonable" time of 
0-200ms but 5-20% of I/O operations were taking excessively long 
anywhere from 500ms through to 10-20+ seconds. However the normal 
metrics for commit and apply latency were 

[ceph-users] Repairing PG inconsistencies — Ceph Documentation - where's the text?

2019-05-16 Thread Stuart Longland
Hi all,

I've got a placement group on a cluster that just refuses to clear
itself up.  Long story short, one of my storage nodes (combined OSD+MON
with a single OSD disk) in my 3-node storage cluster keeled over, and in
the short term, I'm running its OSD in a USB HDD dock on one of the
remaining nodes.  (I have replacement hardware coming.)

This evening, it seems the OSD daemon looking after that disk hiccupped,
and went into zombie mode, the only way I could get that OSD working
again was to reboot the host.  After it came back up, I had 4 placement
groups "damaged", Ceph has managed to clean up 3 of them, but one
remains stubbornly stuck (on a disk *not* connected via USB):

> 2019-05-16 20:44:16.608770 7f4326ff0700 -1 
> bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000 checksum 
> at blob offset 0x0, got 0x6706be76, expected 0x6ee89d7d, device location 
> [0x1ac156~1000], logical extent 0x0~1000, object 
> #7:0c2fe490:::rbd_data.b48c12ae8944a.0faa:head#

As this is Bluestore, it's not clear what I should do to resolve that,
so I thought I'd "RTFM" before asking here:
http://docs.ceph.com/docs/luminous/rados/operations/pg-repair/

Maybe there's a secret hand-shake my web browser doesn't know about or
maybe the page is written in invisible ink, but that page appears blank
to me.
-- 
Stuart Longland (aka Redhatter, VK4MSL)

I haven't lost my mind...
  ...it's backed up on a tape somewhere.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-16 Thread Jan Kasprzak
Konstantin Shalygin wrote:
: >how do you deal with the "clock skew detected" HEALTH_WARN message?
: >
: >I think the internal RTC in most x86 servers does have 1 second resolution
: >only, but Ceph skew limit is much smaller than that. So every time I reboot
: >one of my mons (for kernel upgrade or something), I have to wait for several
: >minutes for the system clock to synchronize over NTP, even though ntpd
: >has been running before reboot and was started during the system boot again.
: 
: Definitely you should use chrony with iburst.

OK, many responses (thanks for them!) suggest chrony, so I tried it:
With all three mons running chrony and being in sync with my NTP server
with offsets under 0.0001 second, I rebooted one of the mons:

There still was the HEALTH_WARN clock_skew message as soon as
the rebooted mon starts responding to ping. The cluster returns to
HEALTH_OK about 95 seconds later.

According to "ntpdate -q my.ntp.server", the initial offset
after reboot is about 0.6 s (which is the reason of HEALTH_WARN, I think),
but it gets under 0.0001 s in about 25 seconds. The remaining ~50 seconds
of HEALTH_WARN is inside Ceph, with mons being already synchronized.

So the result is that chrony indeed synchronizes faster,
but nevertheless I still have about 95 seconds of HEALTH_WARN "clock skew
detected".

I guess now the workaround now is to ignore the warning, and wait
for two minutes before rebooting another mon.

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
sir_clive> I hope you don't mind if I steal some of your ideas?
 laryross> As far as stealing... we call it sharing here.   --from rcgroups
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph -s finds 4 pools but ceph osd lspools says no pool which is the expected answer

2019-05-16 Thread Rainer Krienke
Hello Greg,

thank you very much for your hint.
If I should see this problem again I will try to restart the ceph-mgr
daemon and see if this helps.

Rainer

> 
> I don't really see how this particular error can happen and be
> long-lived, but if you restart the ceph-mgr it will probably resolve
> itself.
> ("ceph osd lspools" looks directly at the OSDMap in the monitor,
> whereas the "ceph -s" data output is generated from the manager's
> pgmap, but there's a tight link where the pgmap gets updated and
> removes dead pools on every new OSDMap the manager sees and I can't
> see how that would go wrong.)
> -Greg
> 
> 
>> Thanks
>> Rainer
>> --
>> Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse  1
>> 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312
>> PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287
>> 1001312
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1
56070 Koblenz, Tel: +49261287 1312 Fax +49261287 100 1312
Web: http://userpages.uni-koblenz.de/~krienke
PGP: http://userpages.uni-koblenz.de/~krienke/mypgp.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Frank Schilder
Dear Yan and Stefan,

thanks for the additional information, it should help reproducing the issue.

The pdsh command executes a bash script that echoes a few values to stdout. 
Access should be read-only, however, we still have the FS mounted with atime 
enabled, so there is probably meta data write and synchronisation per access. 
Files accessed are ssh auth-keys in .ssh and the shell script. The shell script 
was located in the home-dir of the user and, following your explanations, to 
reproduce the issue I will create a directory with many entries and execute a 
test with the many-clients single-file-read load on it.

I hope it doesn't take too long.

Thanks for your input!

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 09:35
To: Frank Schilder
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

On Thu, May 16, 2019 at 2:52 PM Frank Schilder  wrote:
>
> Dear Yan,
>
> OK, I will try to trigger the problem again and dump the information 
> requested. Since it is not easy to get into this situation and I usually need 
> to resolve it fast (its not a test system), is there anything else worth 
> capturing?
>

just

ceph daemon mds.x dump_ops_in_flight
ceph daemon mds.x dump cache /tmp/cachedump.x

> I will get back as soon as it happened again.
>
> In the meantime, I would be grateful if you could shed some light on the 
> following questions:
>
> - Is there a way to cancel an individual operation in the queue? It is a bit 
> harsh to have to fail an MDS for that.

no

> - What is the fragmentdir operation doing in a single MDS setup? I thought 
> this was only relevant if multiple MDS daemons are active on a file system.
>

It splits large directory to smaller parts.


> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Yan, Zheng 
> Sent: 16 May 2019 05:50
> To: Frank Schilder
> Cc: Stefan Kooman; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> > [...]
> > This time I captured the MDS ops list (log output does not really contain 
> > more info than this list). It contains 12 ops and I will include it here in 
> > full length (hope this is acceptable):
> >
>
> Your issues were caused by stuck internal op fragmentdir.  Can you
> dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Trent Lloyd
For libvirt VMs, first you need to add "" to the relevant  sections, and then
stop/start the VM to apply the change.

Then you need to make sure your VirtIO drivers (the Fedora/Red Hat variety
anyway) are from late 2018 or so. There was a bug fixed around July 2018,
before that date, the physical_block_size=4096 parameter is not used by the
Windows VirtIO driver (it was supposed to be, but did not work).

Relevant links:
https://bugzilla.redhat.com/show_bug.cgi?id=1428641
https://github.com/virtio-win/kvm-guest-drivers-windows/pull/312

After that, you can check if Windows is correctly recognizing the physical
block size,

Start cmd.exe with "Run as administrator", then run
fsutil fsinfo ntfsinfo c:

It should show "Bytes Per Physical Sector : 4096"


Lastly at least for Windows itself this makes it do 4096-byte writes "most
of the time", however some applications including Exchange have special
handling of the sector size. I'm not really sure how MSSQL handles it, for
example, it may or may not work correctly if you switch to 4096 bytes after
installation - you may have to create new data files or something for it to
do 4k segments - or not. Hopefully the MSSQL documentation has some
information about that.

It is also possible to set logical_block_size=4096 as well as
physical_block_size=4096 ("4k native") however this absolutely causes
problems with some software (e.g. exchange) if you convert an existing
installation between the two. If you try to use 4k native mode, ideally you
would want to do a fresh install, to avoid any such issues. Or again, refer
to the docs and test it. Just beware it may cause issues if you try to
switch to 4k native.

As a final note you can use this tool to process an OSD log with "debug
filestore = 10" enabled, it will print out how many of the operations were
unaligned:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb

You can just enable debug filestore = 10 dynamically on 1 OSD for about 5
minutes, turn it off, and process the log. And you could compare
before/after. I haven't written an equivalent tool for BlueStore
unfortunately if you are already in the modern world :) I also didnt' check
maybe debug osd or something also has the writes and offsets, so I could
write a generic tool to cover both cases, but also I have not done that.


Hope that helps.

Regards,
Trent

On Thu, 16 May 2019 at 14:52, Alexandre DERUMIER 
wrote:

> Many thanks for the analysis !
>
>
> I'm going to test with 4K on heavy mssql database to see if I'm seeing
> improvement on ios/latency.
> I'll report results in this thread.
>
>
> - Mail original -
> De: "Trent Lloyd" 
> À: "ceph-users" 
> Envoyé: Vendredi 10 Mai 2019 09:59:39
> Objet: [ceph-users] Poor performance for 512b aligned "partial" writes
> from Windows guests in OpenStack + potential fix
>
> I recently was investigating a performance problem for a reasonably sized
> OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
> NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
> volumes.
> This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
> while it is EOL, the issue is reproducible on current versions and also on
> BlueStore however for different reasons than FileStore.
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long anywhere
> from 500ms through to 10-20+ seconds. However the normal metrics for commit
> and apply latency were normal, and in fact, this latency was hard to spot
> in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit" (to
> journal) stage where it was written to the journal and it is OK to return
> to the client and then the "apply" (to disk) stage where it was flushed to
> disk and confirmed so that the data could be purged from the journal.
> However there is really a third stage in the middle where FileStore submits
> the I/O to the operating system and this is done before the lock on the
> object is released. Until that succeeds another operation cannot write to
> the same object (generally being a 4MB area of the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of MS or
> more inside of pwritev() which was blocking inside of the kernel. Normally
> we expect pwritev() to be buffered I/O into the page cache and return quite
> fast however in this case the kernel was in a few percent of cases blocking
> with the stack trace included at the end of the e-mail [1]. My finding from
> that stack is that inside __block_write_begin_int we see a call to
> out_of_line_wait_on_bit call which is really an inlined call for
> wait_on_buffer which occurs in linux/fs/buffer.c in the section around line
> 2000-2024 with the 

Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Stefan Kooman
Quoting Frank Schilder (fr...@dtu.dk):
> Dear Stefan,
> 
> thanks for the fast reply. We encountered the problem again, this time in a 
> much simpler situation; please see below. However, let me start with your 
> questions first:
> 
> What bug? -- In a single-active MDS set-up, should there ever occur an 
> operation with "op_name": "fragmentdir"?

Yes, see http://docs.ceph.com/docs/mimic/cephfs/dirfrags/. If you would
have multiple active MDS the load could be shared among those.

There are some parameters that might need to be tuned in your
environment. But Zheng Yan is an expert in this matter, so maybe after
analysis of the mds dump cache it might reveal what is the culprit.

> Upgrading: The problem described here is the only issue we observe.
> Unless the problem is fixed upstream, upgrading won't help us and
> would be a bit of a waste of time. If someone can confirm that this
> problem is fixed in a newer version, we will do it. Otherwise, we
> might prefer to wait until it is.

Keeping your systems up to date generally improves stability. You might
prevent hitting issues when your workload changes in the future. First
testing new releases on a test system is recommended though.

> 
> News on the problem. We encountered it again when one of our users executed a 
> command in parallel with pdsh on all our ~500 client nodes. This command 
> accesses the same file from all these nodes pretty much simultaneously. We 
> did this quite often in the past, but this time, the command got stuck and we 
> started observing the MDS health problem again. Symptoms:

This command, does that incur writes, reads or a combination of both on
files in this directory? I wonder if you might prevent this from
happening when tuning "Activity thresholds". Especially when you say it
is load (# clients) dependend.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS bug?)

2019-05-16 Thread Frank Schilder
Dear Yan,

OK, I will try to trigger the problem again and dump the information requested. 
Since it is not easy to get into this situation and I usually need to resolve 
it fast (its not a test system), is there anything else worth capturing?

I will get back as soon as it happened again.

In the meantime, I would be grateful if you could shed some light on the 
following questions:

- Is there a way to cancel an individual operation in the queue? It is a bit 
harsh to have to fail an MDS for that.
- What is the fragmentdir operation doing in a single MDS setup? I thought this 
was only relevant if multiple MDS daemons are active on a file system.

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Yan, Zheng 
Sent: 16 May 2019 05:50
To: Frank Schilder
Cc: Stefan Kooman; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
bug?)

> [...]
> This time I captured the MDS ops list (log output does not really contain 
> more info than this list). It contains 12 ops and I will include it here in 
> full length (hope this is acceptable):
>

Your issues were caused by stuck internal op fragmentdir.  Can you
dump mds cache and send the output to us?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

2019-05-16 Thread Alexandre DERUMIER
Many thanks for the analysis !


I'm going to test with 4K on heavy mssql database to see if I'm seeing 
improvement on ios/latency.
I'll report results in this thread.


- Mail original -
De: "Trent Lloyd" 
À: "ceph-users" 
Envoyé: Vendredi 10 Mai 2019 09:59:39
Objet: [ceph-users] Poor performance for 512b aligned "partial" writes from 
Windows guests in OpenStack + potential fix

I recently was investigating a performance problem for a reasonably sized 
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with NVMe 
Journals. The primary workload is Windows guests backed by Cinder RBD volumes. 
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which 
while it is EOL, the issue is reproducible on current versions and also on 
BlueStore however for different reasons than FileStore. 

Generally the Ceph cluster was suffering from very poor outlier performance, 
the numbers change a little bit depending on the exact situation but roughly 
80% of I/O was happening in a "reasonable" time of 0-200ms but 5-20% of I/O 
operations were taking excessively long anywhere from 500ms through to 10-20+ 
seconds. However the normal metrics for commit and apply latency were normal, 
and in fact, this latency was hard to spot in the performance metrics available 
in jewel. 

Previously I more simply considered FileStore to have the "commit" (to journal) 
stage where it was written to the journal and it is OK to return to the client 
and then the "apply" (to disk) stage where it was flushed to disk and confirmed 
so that the data could be purged from the journal. However there is really a 
third stage in the middle where FileStore submits the I/O to the operating 
system and this is done before the lock on the object is released. Until that 
succeeds another operation cannot write to the same object (generally being a 
4MB area of the disk). 

I found that the fstore_op threads would get stuck for hundreds of MS or more 
inside of pwritev() which was blocking inside of the kernel. Normally we expect 
pwritev() to be buffered I/O into the page cache and return quite fast however 
in this case the kernel was in a few percent of cases blocking with the stack 
trace included at the end of the e-mail [1]. My finding from that stack is that 
inside __block_write_begin_int we see a call to out_of_line_wait_on_bit call 
which is really an inlined call for wait_on_buffer which occurs in 
linux/fs/buffer.c in the section around line 2000-2024 with the comment "If we 
issued read requests - let them complete." ( [ 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 | 
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
 ] ) 

My interpretation of that code is that for Linux to store a write in the page 
cache, it has to have the entire 4K page as that is the granularity of which it 
tracks the dirty state and it needs the entire 4K page to later submit back to 
the disk. Since we wrote a part of the page, and the page wasn't already in the 
cache, it has to fetch the remainder of the page from the disk. When this 
happens, it blocks waiting for this read to complete before returning from the 
pwritev() call - hence our normally buffered write blocks. This holds up the 
tp_fstore_op thread, of which there are (by default) only 2-4 such threads 
trying to process several hundred operations per second. Additionally the size 
of the osd_op_queue is bounded, and operations do not clear out of this queue 
until the tp_fstore_op thread is done. Which ultimately means that not only are 
these partial writes delayed but it knocks on to delay other writes behind them 
because of the constrained thread pools. 

What was further confusing to this, is that I could easily reproduce this in a 
test deployment using an rbd benchmark that was only writing to a total disk 
size of 256MB which I would easily have expected to fit in the page cache: 
rbd create -p rbd --size=256M bench2 
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total 256M 
--io-pattern rand 

This is explained by the fact that on secondary OSDs (at least, there was some 
refactoring of fadvise which I have not fully understood as of yet), FileStore 
is using fadvise FADVISE_DONTNEED on the objects after write which causes the 
kernel to immediately discard them from the page cache without any regard to 
their statistics of being recently/frequently used. The motivation for this 
addition appears to be that on a secondary OSD we don't service reads (only 
writes) and so therefor we can optimize memory usage by throwing away this 
object and in theory leaving more room in the page cache for objects which we 
are primary for and expect to actually service reads from a client for. 
Unfortunately this behavior does not take into account partial writes, where we 
now pathologically throw away the cached copy instantly such that a write 

Re: [ceph-users] RBD Pool size doubled after upgrade to Nautilus and PG Merge

2019-05-16 Thread Wido den Hollander


On 5/12/19 4:21 PM, Thore Krüss wrote:
> Good evening,
> after upgrading our cluster yesterday to Nautilus (14.2.1) and pg-merging an
> imbalanced pool we noticed that the number of objects in the pool has dubled
> (rising synchronously with the merge progress).
> 
> What happened there? Was this to be expected? Is it a bug? Will ceph
> housekeeping take care of it eventually?
> 

Has the PG merge already finished or is it still running?

Is it only the amount of objects or also the size in kB/MB/TB ?

Wido

> Best regards
> Thore
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Grow bluestore PV/LV

2019-05-16 Thread Michael Andersen
Thanks! I'm on mimic for now, but I'll give it a shot on a test nautilus
cluster.

On Wed, May 15, 2019 at 10:58 PM Yury Shevchuk  wrote:

> Hello Michael,
>
> growing (expanding) bluestore OSD is possible since Nautilus (14.2.0)
> using bluefs-bdev-expand tool as discussed in this thread:
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-April/034116.html
>
> -- Yury
>
> On Wed, May 15, 2019 at 10:03:29PM -0700, Michael Andersen wrote:
> > Hi
> >
> > After growing the size of an OSD's PV/LV, how can I get bluestore to see
> > the new space as available? It does notice the LV has changed size, but
> it
> > sees the new space as occupied.
> >
> > This is the same question as:
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023893.html
> > and
> > that original poster spent a lot of effort in explaining exactly what he
> > meant, but I could not find a reply to his email.
> >
> > Thanks
> > Michael
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com