Re: [ceph-users] MDS is Readonly
try running "rados -p touch 1002fc5d22d." before mds restart On Thu, May 3, 2018 at 2:31 AM, Pavan, Krish wrote: > > > We have ceph 12.2.4 cephfs with two active MDS server and directory are > pinned to MDS servers. Yesterday MDS server crashed. Once all fuse clients > have unmounted, we bring back MDS online. Both MDS are active now. > > > > Once It came back, we started to see one MDS is Readonly. > > … > > 2018-05-01 23:41:22.765920 7f71481b8700 1 mds.0.cache.dir(0x1002fc5d22d) > commit error -2 v 3 > > 2018-05-01 23:41:22.765964 7f71481b8700 -1 log_channel(cluster) log [ERR] : > failed to commit dir 0x1002fc5d22d object, errno -2 > > 2018-05-01 23:41:22.765974 7f71481b8700 -1 mds.0.222755 unhandled write > error (2) No such file or directory, force readonly... > > 2018-05-01 23:41:22.766013 7f71481b8700 1 mds.0.cache force file system > read-only > > 2018-05-01 23:41:22.766019 7f71481b8700 0 log_channel(cluster) log [WRN] : > force file system read-only > > …. > > > > It health waring I see > > health: HEALTH_WARN > > 1 MDSs are read only > > 1 MDSs behind on trimming > > > > There is no error on OSDs on metadata pool > > Will ceph daemon mds.x scrub_path / force recursive repair will fix?.Or > offline data-scan need to be done > > > > > > > > Regards > > Krish > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CentOS release 7.4.1708 and selinux-policy-base >= 3.13.1-166.el7_4.9
Hi all. We try to setup our first CentOS 7.4.1708 CEPH cluster, based on Luminous 12.2.5. What we get is: Error: Package: 2:ceph-selinux-12.2.5-0.el7.x86_64 (Ceph-Luminous) Requires: selinux-policy-base >= 3.13.1-166.el7_4.9 __Host infos__: root> lsb_release -d Description:CentOS Linux release 7.4.1708 (Core) root@> uname -a Linux 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux __Question__: Where can I find the elinux-policy-base-3.13.1-166.el7_4.9 package? Regards Anton ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences
Hi Nick, On Tue, May 1, 2018 at 4:50 PM, Nick Fisk wrote: > Hi all, > > > > Slowly getting round to migrating clusters to Bluestore but I am interested > in how people are handling the potential change in write latency coming from > Filestore? Or maybe nobody is really seeing much difference? > > > > As we all know, in Bluestore, writes are not double written and in most > cases go straight to disk. Whilst this is awesome for people with pure SSD > or pure HDD clusters as the amount of overhead is drastically reduced, for > people with HDD+SSD journals in Filestore land, the double write had the > side effect of acting like a battery backed cache, accelerating writes when > not under saturation. > > > > In some brief testing I am seeing Filestore OSD’s with NVME journal show an > average apply latency of around 1-2ms whereas some new Bluestore OSD’s in > the same cluster are showing 20-40ms. I am fairly certain this is due to > writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is > very lightly loaded, this is not anything being driven into saturation. > > > > I know there is a deferred write tuning knob which adjusts the cutover for > when an object is double written, but at the default of 32kb, I suspect a > lot of IO’s even in the 1MB area are still drastically slower going straight > to disk than if double written to NVME 1st. Has anybody else done any > investigation in this area? Is there any long turn harm at running a cluster > deferring writes up to 1MB+ in size to mimic the Filestore double write > approach? > > > > I also suspect after looking through github that deferred writes only happen > when overwriting an existing object or blob (not sure which case applies), > so new allocations are still written straight to disk. Can anyone confirm? > > > > PS. If your spinning disks are connected via a RAID controller with BBWC > then you are not affected by this. We saw this behavior even on Areca 1883, which does buffer HDD writes. The way out was to put WAL and DB on NVMe drives and that solved performance problems. -- Alex Gorbachev Storcium > > > > Thanks, > > Nick > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS is Readonly
We have ceph 12.2.4 cephfs with two active MDS server and directory are pinned to MDS servers. Yesterday MDS server crashed. Once all fuse clients have unmounted, we bring back MDS online. Both MDS are active now. Once It came back, we started to see one MDS is Readonly. ... 2018-05-01 23:41:22.765920 7f71481b8700 1 mds.0.cache.dir(0x1002fc5d22d) commit error -2 v 3 2018-05-01 23:41:22.765964 7f71481b8700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1002fc5d22d object, errno -2 2018-05-01 23:41:22.765974 7f71481b8700 -1 mds.0.222755 unhandled write error (2) No such file or directory, force readonly... 2018-05-01 23:41:22.766013 7f71481b8700 1 mds.0.cache force file system read-only 2018-05-01 23:41:22.766019 7f71481b8700 0 log_channel(cluster) log [WRN] : force file system read-only It health waring I see health: HEALTH_WARN 1 MDSs are read only 1 MDSs behind on trimming There is no error on OSDs on metadata pool Will ceph daemon mds.x scrub_path / force recursive repair will fix?.Or offline data-scan need to be done Regards Krish ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Announcing mountpoint, August 27-28, 2018
Our first mountpoint is coming! Software-defined Storage (SDS) is changing the traditional way we think of storage. Decoupling software from hardware allows you to choose your hardware vendors and provides enterprises with more flexibility. Attend mountpoint on August 27 - 28, 2018 in Vancouver, BC, before Open Source Summit North America for this first time event. We are joining forces with the Ceph and Gluster communities, SDS experts, and partners to bring to you an exciting 2 day event. Help lead the conversation on open source software defined storage and share your knowledge! Our CFP is open on May 3rd through June 15th, 2018. More details available, including sponsorship: http://mountpoint.io/ -- Amye Scavarda | a...@redhat.com | Gluster Community Lead ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proper procedure to replace DB/WAL SSD
Le dimanche 08 avril 2018 à 20:40 +, Jens-U. Mozdzen a écrit : > sorry for bringing up that old topic again, but we just faced a > corresponding situation and have successfully tested two migration > scenarios. Thank you very much for this update, as I needed to do exactly that, due to an SSD crash triggering hardware replacement. The block.db on the crashed SSD were lost, so the whole two OSDs depending on it were re-created. I also replaced two other bad SSDs before they failed, thus needed to effectively replace DB/WAL devices on the live cluster (2 SSDs on 2 hosts and 4 OSDs). > it is possible to move a separate WAL/DB to a new device, whilst > without changing the size. We have done this for multiple OSDs, > using > only existing (mainstream :) ) tools and have documented the > procedure > in > http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-b > lock-db/ > . It will *not* allow to separate WAL / DB after OSD creation, nor > does it allow changing the DB size. The lost OSD were still backfilling when I did the above procedure (data redundancy was high enough to risk losing one more node). I even mis-typed the "ceph osd set noout" command ("ceph osd unset noout" instead, effectively a no-op), and replaced 2 OSDs of a single host at the same time (thus taking more time than the 10 minutes before kicking the OSDs out, triggering even more data movement). Everything went cleanly though, thanks to your detailed commands, which I ran one at a time, thinking twice before each [Enter]. I digged a bit into the LVM tags : * make a backup of all pv/vg/lv config : vgcfgbackup * check the backed-up tags : grep tags /etc/lvm/backup/* I then noticed that : * there are lots of "ceph.*=" tags * tags are still present on the old DB/WAL LVs (since I didn't remove them) * tags are absent from the new DB/WAL LVs (ditto, I didn't create them), which may be a problem later on... * I changed the ceph.db_device= tag, but there is also a ceph.db_uuid= tag which was not changed, and may or may not trigger a problem upon reboot (I don't know if this UUID is part of the dd'ed data) You effectively helped a lot! Thanks. -- Nicolas Huillard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] GDPR encryption at rest
At 'rest' is talking about data on it's own, not being accessed through an application. Encryption at rest is most commonly done by encrypting the block device with something like dmcrypt. It's anything that makes having the physical disk useless without being able to decrypt it. You can also just encrypt a folder with sensitive information which would also be encryption at rest. Encryption not at rest would be like putting a secure layer between the data and the users that access it, like HTTPS/SSL. On Wed, May 2, 2018 at 11:25 AM Alfredo Deza wrote: > On Wed, May 2, 2018 at 11:12 AM, David Turner > wrote: > > I've heard conflicting opinions if GDPR requires data to be encrypted at > > rest, but enough of our customers believe that it is that we're looking > at > > addressing it in our clusters. I had a couple questions about the state > of > > encryption in ceph. > > > > 1) My experience with encryption in Ceph is dmcrypt, is this still the > > standard method or is there something new with bluestore? > > Standard, yes. > > > 2) Assuming dmcrypt is still the preferred option, is it fully > > supported/tested in ceph-volume? There were problems with this when > > ceph-volume was initially released, but I believe those have been > resolved. > > It is fully supported, but only with LUKS. The initial release of > ceph-volume didn't have dmcrypt support. > > > 3) Any other thoughts about encryption at rest? I have an upgrade path > to > > get to encryption (basically the same as getting to bluestore from > > filestore). > > Not sure what you mean by 'rest'. The ceph-volume encryption would > give you the same type of encryption that was provided by ceph-disk > with the only "gotcha" being it is LUKS (plain is not supported for > newly encrypted devices) > > > > > Thanks for your comments. > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] GDPR encryption at rest
On Wed, May 2, 2018 at 11:12 AM, David Turner wrote: > I've heard conflicting opinions if GDPR requires data to be encrypted at > rest, but enough of our customers believe that it is that we're looking at > addressing it in our clusters. I had a couple questions about the state of > encryption in ceph. > > 1) My experience with encryption in Ceph is dmcrypt, is this still the > standard method or is there something new with bluestore? Standard, yes. > 2) Assuming dmcrypt is still the preferred option, is it fully > supported/tested in ceph-volume? There were problems with this when > ceph-volume was initially released, but I believe those have been resolved. It is fully supported, but only with LUKS. The initial release of ceph-volume didn't have dmcrypt support. > 3) Any other thoughts about encryption at rest? I have an upgrade path to > get to encryption (basically the same as getting to bluestore from > filestore). Not sure what you mean by 'rest'. The ceph-volume encryption would give you the same type of encryption that was provided by ceph-disk with the only "gotcha" being it is LUKS (plain is not supported for newly encrypted devices) > > Thanks for your comments. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] GDPR encryption at rest
I've heard conflicting opinions if GDPR requires data to be encrypted at rest, but enough of our customers believe that it is that we're looking at addressing it in our clusters. I had a couple questions about the state of encryption in ceph. 1) My experience with encryption in Ceph is dmcrypt, is this still the standard method or is there something new with bluestore? 2) Assuming dmcrypt is still the preferred option, is it fully supported/tested in ceph-volume? There were problems with this when ceph-volume was initially released, but I believe those have been resolved. 3) Any other thoughts about encryption at rest? I have an upgrade path to get to encryption (basically the same as getting to bluestore from filestore). Thanks for your comments. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore on HDD+SSD sync write latency experiences
Hi Nick, On 5/1/2018 11:50 PM, Nick Fisk wrote: Hi all, Slowly getting round to migrating clusters to Bluestore but I am interested in how people are handling the potential change in write latency coming from Filestore? Or maybe nobody is really seeing much difference? As we all know, in Bluestore, writes are not double written and in most cases go straight to disk. Whilst this is awesome for people with pure SSD or pure HDD clusters as the amount of overhead is drastically reduced, for people with HDD+SSD journals in Filestore land, the double write had the side effect of acting like a battery backed cache, accelerating writes when not under saturation. In some brief testing I am seeing Filestore OSD’s with NVME journal show an average apply latency of around 1-2ms whereas some new Bluestore OSD’s in the same cluster are showing 20-40ms. I am fairly certain this is due to writes exhibiting the latency of the underlying 7.2k disk. Note, cluster is very lightly loaded, this is not anything being driven into saturation. I know there is a deferred write tuning knob which adjusts the cutover for when an object is double written, but at the default of 32kb, I suspect a lot of IO’s even in the 1MB area are still drastically slower going straight to disk than if double written to NVME 1^st . Has anybody else done any investigation in this area? Is there any long turn harm at running a cluster deferring writes up to 1MB+ in size to mimic the Filestore double write approach? This should work fine with low load but be careful when load is raising. RocksDB and corresponding stuff around it might become a bottleneck in this scenario. I also suspect after looking through github that deferred writes only happen when overwriting an existing object or blob (not sure which case applies), so new allocations are still written straight to disk. Can anyone confirm? "small" writes (length < min_alloc_size) are direct if they go to unused chunk (4K or more depending on checksum settings) of an existing mutable block and write length > bluestore_prefer_deferred_size only. E.g. appending with 4K data blocks to an object at HDD will trigger deferred mode for the first of every 16 writes (given that default min_alloc_size for HDD is 64K). Rest 15 go direct. "big" writes are unconditionally deferred if length <= bluestore_prefer_deferred_size. PS. If your spinning disks are connected via a RAID controller with BBWC then you are not affected by this. Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph scrub logs: _scan_snaps no head for $object?
Hi, Quoting Stefan Kooman (ste...@bit.nl): > Hi, > > We see the following in the logs after we start a scrub for some osds: > > ceph-osd.2.log:2017-12-14 06:50:47.180344 7f0f47db2700 0 > log_channel(cluster) log [DBG] : 1.2d8 scrub starts > ceph-osd.2.log:2017-12-14 06:50:47.180915 7f0f47db2700 -1 osd.2 pg_epoch: > 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] > local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f > 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 > lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no > head for 1:1b518155:::rbd_data.620652ae8944a.0126:29 (have MIN) > ceph-osd.2.log:2017-12-14 06:50:47.180929 7f0f47db2700 -1 osd.2 pg_epoch: > 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] > local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f > 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 > lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no > head for 1:1b518155:::rbd_data.620652ae8944a.0126:14 (have MIN) > ceph-osd.2.log:2017-12-14 06:50:47.180941 7f0f47db2700 -1 osd.2 pg_epoch: > 11897 pg[1.2d8( v 11890'165209 (3221'163647,11890'165209] > local-lis/les=11733/11734 n=67 ec=132/132 lis/c 11733/11733 les/c/f > 11734/11734/0 11733/11733/11733) [2,45,31] r=0 lpr=11733 crt=11890'165209 > lcod 11890'165208 mlcod 11890'165208 active+clean+scrubbing] _scan_snaps no > head for 1:1b518155:::rbd_data.620652ae8944a.0126:a (have MIN) > ceph-osd.2.log:2017-12-14 06:50:47.214198 7f0f43daa700 0 > log_channel(cluster) log [DBG] : 1.2d8 scrub ok > > So finally it logs "scrub ok", but what does " _scan_snaps no head for ..." > mean? > Does this indicate a problem? Still seeing this issue on a freshly installed luminous cluster. I *think* it either has to do with "cloned" RBDs that get snapshots by themselves or RBDs that are cloned from a snapshot. Any dev that wants to debug this behaviour if I'm able to reliably reproduce this? Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com