Re: [ceph-users] Proper procedure to replace DB/WAL SSD

2018-04-08 Thread Jens-U. Mozdzen

Hi *,

sorry for bringing up that old topic again, but we just faced a  
corresponding situation and have successfully tested two migration  
scenarios.


Zitat von ceph-users-requ...@lists.ceph.com:

Date: Sat, 24 Feb 2018 06:10:16 +
From: David Turner 
To: Nico Schottelius 
Cc: Caspar Smit , ceph-users

Subject: Re: [ceph-users] Proper procedure to replace DB/WAL SSD
Message-ID:

Content-Type: text/plain; charset="utf-8"

Caspar, it looks like your idea should work. Worst case scenario seems like
the osd wouldn't start, you'd put the old SSD back in and go back to the
idea to weight them to 0, backfilling, then recreate the osds. Definitely
with a try in my opinion, and I'd love to hear your experience after.

Nico, it is not possible to change the WAL or DB size, location, etc after
osd creation.


it is possible to move a separate WAL/DB to a new device, whilst  
without changing the size. We have done this for multiple OSDs, using  
only existing (mainstream :) ) tools and have documented the procedure  
in  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/migrating-bluestores-block-db/  
. It will *not* allow to separate WAL / DB after OSD creation, nor  
does it allow changing the DB size.


As we faced a failing WAL/DB SSD during one of the moves (fatal read  
errors from the DB block device), we also established a procedure to  
initialize the OSD to "empty" during that operation, so that the OSD  
will get re-filled without changing the OSD map:  
http://heiterbiswolkig.blogs.nde.ag/2018/04/08/resetting-an-existing-bluestore-osd/


HTH
Jens

PS: Live WAL/DB migration is something that can be done easily when  
using logical volumes, which is why I'd highly recommend to go that  
route, instead of using partitions. LVM not only helps when the SSDs  
reach their EOL, but with live changes to load balancing (WAL/DB LVs  
distributing across multiple SSDs), too.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-02-16 Thread Jens-U. Mozdzen

Dear list, hello Jason,

you may have seen my message on the Ceph mailing list about RDB pool  
migration - it's a common subject that pools were created in a  
sub-optimum fashion and i. e. pgnum is (not yet) reducible, so we're  
looking into means to "clone" an RBD pool into a new pool within the  
same cluster (including snapshots).


We had looked into creating a tool for this job, but soon noticed that  
we're duplicating basic functionality of rbd-mirror. So we tested the  
following, which worked out nicely:


- create a test cluster (Ceph cluster plus an Openstack cluster using  
an RBD pool) and some Openstack instances


- create a second Ceph test cluster

- stop Openstack

- use rbd-mirror to clone the RBD pool from the first to the second  
Ceph cluster (IOW aborting rbd-mirror once the initial coping was done)


- recreate the RDB pool on the first cluster

- use rbd-mirror to clone the mirrored pool back to the (newly  
created) pool on the first cluster


- start Openstack and work with the (recreated) pool on the first cluster

So using rbd-mirror, we could clone an RBD pool's content to a  
differently structured pool on the same cluster - by using an  
intermediate cluster.


@Jason: Looking at the commit history for rbd-mirror, it seems you  
might be able to shed some light on this: Do you see an easy way to  
modify rbd-mirror in such a fashion that instead of mirroring to a  
pool on a different cluster (having the same pool name as the  
original), mirroring would be to a pool on the *same* cluster,  
(obviously having a pool different name)?


From the "rbd cppool" perspective, a one-shot mode of operation would  
be fully sufficient - but looking at the code, I have not even been  
able to identify the spots where we might "cut away" the networking  
part, so that rbd-mirror might do an intra-cluster job.


Are you able to judge how much work would need to be done, in order to  
create a one-shot, intra-cluster version of rbd-mirror? Might it even  
be something that could be a simple enhancement?


Thank you for any information and / or opinion you care to share!

With regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] udev rule or script to auto add bcache devices?

2018-01-23 Thread Jens-U. Mozdzen

Hi Stefan,

Zitat von Stefan Priebe - Profihost AG:

Hello,

bcache didn't supported partitions on the past so that a lot of our osds
have their data directly on:
/dev/bcache[0-9]

But that means i can't give them the needed part type of
4fbd7e29-9d25-41b8-afd0-062c0ceff05d and that means that the activation
with udev und ceph-disk does not work.

Had anybody already fixed this or hacked something together?


we had this running for filestore OSDs for quite some time (on  
Luminous and before), but have recently moved on to Bluestore,  
omitting bcache and instead putting block.db on partitions of the SSD  
devices (or rather partitions on an MD-RAID1 made out of two Toshiba  
PX02SMF020).


We simply mounted the OSD file system via label at boot time per fstab  
entries, and had the OSDs started via systemd. In case this matters:  
For historic reasons, the actual mount point wasn't in  
/var/lib/ceph/osd, but a different directory, with according symlinks  
set up under /var/lib/ceph/osd/.


How many OSDs do you run per bcache SSD caching device? Even at just  
4:1, we ran into i/o bottlenecks (using above MD-RAID1 as the caching  
device), hence moving on to Bluestore. The same hardware now provides  
a much more responsive storage subsystem, which of course may be very  
specific to our work load and setup.


Regards
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating to new pools

2018-01-19 Thread Jens-U. Mozdzen

Hi *,

Zitat von ceph-users-requ...@lists.ceph.com:

Hi *,

facing the problem to reduce the number of PGs for a pool, I've found
various information and suggestions, but no "definite guide" to handle
pool migration with Ceph 12.2.x. This seems to be a fairly common
problem when having to deal with "teen-age clusters", so consolidated
information would be a real help. I'm willing to start writing things
up, but don't want to duplicate information. So:

Are there any documented "operational procedures" on how to migrate

- an RBD pool (with snapshots created by Openstack)

- a CephFS data pool

- a CephFS metadata pool

to a different volume, in order to be able to utilize pool settings
that cannot be changed on an existing pool?

---

RBD pools: From what I've read, RBD snapshots are "broken" after using
"rados cppool" to move the content of an "RBD pool" to a new pool.


after having read up on "rbd-mirror" and having had a glimpse at its  
code, it seems that preserving clones, snapshots and their  
relationships has been solved for cluster-to-cluster migration.


Is this really correct? If so, it might be possible to extend the code  
in a fashion that will allow a one-shot, intracluster pool-to-pool  
migration as a spin-off to rbd-mirror.


I was thinking along the following lines:

- run rbd-mirror in a stand-alone fashion, just stating source and  
destination pool


- leave it to the cluster admin to take RDB "offline", so that the  
pool content does not change during the copy (no RBD journaling  
involved)


- check that the destination pool is empty. In a first version,  
cumulative migrates (joining multiple source pools with distinct image  
names) would complicate things ;)


- sync all content from source to destination pool, in a one-shot fashion

- done

Anyone out there who can judge the chances of that approach, better  
than me? I'd be willing to spend development time on this, but  
starting from scratch will be rather hard, so pointers at where to  
look at within the rbd-mirror code would be more than welcome.


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-17 Thread Jens-U. Mozdzen

Zitat von Jens-U. Mozdzen

Hi Alfredo,

thank you for your comments:

Zitat von Alfredo Deza :

On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
keeping the OSD number? There have been a number of messages on the list,
reporting problems, and my experience is the same. (Removing the existing
OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
- this basically says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the  
following

steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat
/var/lib/ceph/osd/ceph-999/fsid) 999)


Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.



6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
/var/lib/ceph/osd/ceph-999/keyring)


I at first tried to follow the documented steps (without my steps 5
and 6), which did not work for me. The documented approach failed with
"init authentication >> failed: (1) Operation not permitted", because
actually ceph-volume did not add the auth entry for me.

But even after manually adding the authentication, the "ceph-volume"
approach failed, as the OSD was still marked "destroyed" in the osdmap
epoch as used by ceph-osd (see the commented messages from
ceph-osd.999.log below).



7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
--osd-id 999 --data /dev/sdzz)


You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642


If I read that bug description correctly, you're confirming why I
needed step #6 above (manually adding the OSD auth entry. But even if
ceph-volume had added it, the ceph-osd.log entries suggest that
starting the OSD would still have failed, because of accessing the
wrong osdmap epoch.

To me it seems like I'm hitting a bug outside of ceph-volume [...]


just for the records (and search engines), this was confirmed to be a  
bug, see http://tracker.ceph.com/issues/22673


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing cache tier for RBD pool

2018-01-16 Thread Jens-U. Mozdzen

Hello Mike,

Zitat von Mike Lovell :

On Mon, Jan 8, 2018 at 6:08 AM, Jens-U. Mozdzen  wrote:

Hi *,
[...]
1. Does setting the cache mode to "forward" lead to above situation of
remaining locks on hot-storage pool objects? Maybe the clients' unlock
requests are forwarded to the cold-storage pool, leaving the hot-storage
objects locked? If so, this should be documented and it'd seem impossible
to cleanly remove a cache tier during live operations.

2. What is the significant difference between "rados
cache-flush-evict-all" and separate "cache-flush" and "cache-evict" cycles?
Or is it some implementation error that leads to those "file not found"
errors with "cache-flush-evict-all", while the manual cycles work
successfully?

Thank you for any insight you might be able to share.

Regards,
Jens



i've removed a cache tier in environments a few times. the only locked
files i ran into were the rbd_directory and rbd_header objects for each
volume. the rbd_headers for each rbd volume are locked as long as the vm is
running. every time i've tried to remove a cache tier, i shutdown all of
the vms before starting the procedure and there wasn't any problem getting
things flushed+evicted. so i can't really give any further insight into
what might have happened other than it worked for me. i set the cache-mode
to forward everytime before flushing and evicting objects.


while your report doesn't confirm my suspicion expressed in my  
question 1, it at least is another example where removing the cache  
worked *after stopping all instances*, rather than "live". If, OTOH,  
this limitation is confirmed, it should be added to the docs.


Out of curiosity: Do you have any other users for the pool? After  
stopping all VMs (and the image-related services on our Openstack  
control nodes), my pool was without access, so I saw no need to put  
the caching tier to "forward" mode.



i don't think there really is a significant technical difference between
the cache-flush-evict-all command and doing separate cache-flush and
cache-evict on individual objects. my understanding is
cache-flush-evict-all is just a short cut to getting everything in the
cache flushed and evicted. did the cache-flush-evict-all error on some
objects where the separate operations succeeded? you're description doesn't
say if there was but then you say you used both styles during your second
attempt.


It was actually that every run of "cache-flush-evict-all" did report  
errors on all remaining objects, while running the loop manually  
(issue flush for every objects, then issue evict for every remaining  
object) did work flawlessly. That's why my question 2 came up.


The objects I saw seemed related to the images stored in the pool, not  
any "management data" (like the suggested hitset persistence).



on a different note, you say that your cluster is on 12.2 but the cache
tiers were created on an earlier version. which version was the cache tier
created on? how well did the upgrade process work? i am curious since the
production clusters i have using a cache tier are still on 10.2 and i'm
about to begin the process of testing the upgrade to 12.2. any info on that
experience you can share would be helpful.


I *believe* the cache was created on 10.2, but cannot recall for sure.  
I remember having had similar problems in those earlier days with a  
previous instance of that caching tier, but many root causes were "on  
my side of the keyboard". The cache tier I was trying to remove  
recently was created from scratch after those problems, and upgrading  
to the latest release via the recommended intermediate version steps  
was problem-free. At least when focusing on the subject of cache tiers  
;)


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-10 Thread Jens-U. Mozdzen

Hi Alfredo,

thank you for your comments:

Zitat von Alfredo Deza :

On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
keeping the OSD number? There have been a number of messages on the list,
reporting problems, and my experience is the same. (Removing the existing
OSD and creating a new one does work for me.)

I'm working on an Ceph 12.2.2 cluster and tried following
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd
- this basically says

1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the following
steps (assuming OSD ID "999" on /dev/sdzz):

1. Stop the old OSD via systemd (osd-node # systemctl stop
ceph-osd@999.service)

2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
volume group

3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999
--yes-i-really-mean-it)

5. create a new OSD entry (osd-node # ceph osd new $(cat
/var/lib/ceph/osd/ceph-999/fsid) 999)


Step 5 and 6 are problematic if you are going to be trying ceph-volume
later on, which takes care of doing this for you.



6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
/var/lib/ceph/osd/ceph-999/keyring)


I at first tried to follow the documented steps (without my steps 5  
and 6), which did not work for me. The documented approach failed with  
"init authentication >> failed: (1) Operation not permitted", because  
actually ceph-volume did not add the auth entry for me.


But even after manually adding the authentication, the "ceph-volume"  
approach failed, as the OSD was still marked "destroyed" in the osdmap  
epoch as used by ceph-osd (see the commented messages from  
ceph-osd.999.log below).




7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
--osd-id 999 --data /dev/sdzz)


You are going to hit a bug in ceph-volume that is preventing you from
specifying the osd id directly if the ID has been destroyed.

See http://tracker.ceph.com/issues/22642


If I read that bug description correctly, you're confirming why I  
needed step #6 above (manually adding the OSD auth entry. But even if  
ceph-volume had added it, the ceph-osd.log entries suggest that  
starting the OSD would still have failed, because of accessing the  
wrong osdmap epoch.


To me it seems like I'm hitting a bug outside of ceph-volume - unless  
it's ceph-volume that somehow determines which osdmap epoch is used by  
ceph-osd.



In order for this to work, you would need to make sure that the ID has
really been destroyed and avoid passing --osd-id in ceph-volume. The
caveat
being that you will get whatever ID is available next in the cluster.


Yes, that's the work-around I then used - purge the old OSD and create  
a new one.


Thanks & regards,
Jens


[...]
--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using weightedpriority
op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init authentication
failed: (1) Operation not permitted

# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at that
time and osd.999 is marked destroyed there
# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has features
288232575208783872, adjusting msgr requires for osds
2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc5590

[ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-10 Thread Jens-U. Mozdzen

Dear *,

has anybody been successful migrating Filestore OSDs to Bluestore  
OSDs, keeping the OSD number? There have been a number of messages on  
the list, reporting problems, and my experience is the same. (Removing  
the existing OSD and creating a new one does work for me.)


I'm working on an Ceph 12.2.2 cluster and tried following  
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#replacing-an-osd - this basically  
says


1. destroy old OSD
2. zap the disk
3. prepare the new OSD
4. activate the new OSD

I never got step 4 to complete. The closest I got was by doing the  
following steps (assuming OSD ID "999" on /dev/sdzz):


1. Stop the old OSD via systemd (osd-node # systemctl stop  
ceph-osd@999.service)


2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)

3a. if the old OSD was Bluestore with LVM, manually clean up the old  
OSD's volume group


3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)

4. destroy the old OSD (osd-node # ceph osd destroy 999  
--yes-i-really-mean-it)


5. create a new OSD entry (osd-node # ceph osd new $(cat  
/var/lib/ceph/osd/ceph-999/fsid) 999)


6. add the OSD secret to Ceph authentication (osd-node # ceph auth add  
osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd'  
-i /var/lib/ceph/osd/ceph-999/keyring)


7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore  
--osd-id 999 --data /dev/sdzz)

mon 'allow profile osd' -i /var/lib/ceph/osd/ceph-999/keyring)

but ceph-osd keeps complaining "osdmap says I am destroyed, exiting"  
on "osd-node # systemctl start ceph-osd@999.service".


At first I felt I was hitting http://tracker.ceph.com/issues/21023  
(BlueStore-OSDs marked as destroyed in OSD-map after v12.1.1 to  
v12.1.4 upgrade). But I was already using the "ceph osd new" command,  
which didn't help.


Some hours of sleep later I matched the issued commands to the osdmap  
changes and the ceph-osd log messages, which revealed something strange:


- from issuing "ceph osd destroy", osdmap lists the OSD as  
"autoout,destroyed,exists" (no surprise here)

- once I issued "ceph osd new", osdmap lists the OSD as "autoout,exists,new"
- starting ceph-osd after "ceph osd new" reports "osdmap says I am  
destroyed, exiting"


I can see in the ceph-osd log that it is relating to an *old* osdmap  
epoch, roughly 45 minutes old by then?


This got me curious and I dug through the OSD log file, checking the  
epoch numbers during start-up:


I took some detours, so there's more than two failed starts in the OSD  
log file ;) :


--- cut here ---
# first of multiple attempts, before "ceph auth add ..."
# no actual epoch referenced, as login failed due to missing auth
2018-01-10 00:00:02.173983 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:00:02.173990 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:00:02.173994 7f5cf1c89d00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds

2018-01-10 00:00:02.174046 7f5cf1c89d00  0 osd.999 0 load_pgs
2018-01-10 00:00:02.174051 7f5cf1c89d00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:00:02.174055 7f5cf1c89d00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:00:02.174891 7f5cf1c89d00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:00:02.177479 7f5cf1c89d00 -1 osd.999 0 init  
authentication failed: (1) Operation not permitted


# after "ceph auth ..."
# note the different epochs below? BTW, 110587 is the current epoch at  
that time and osd.999 is marked destroyed there

# 109892: much too old to offer any details
# 110587: modified 2018-01-09 23:43:13.202381

2018-01-10 00:08:00.945507 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for clients
2018-01-10 00:08:00.945514 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872 was 8705, adjusting msgr requires for mons
2018-01-10 00:08:00.945521 7fc55905bd00  0 osd.999 0 crush map has  
features 288232575208783872, adjusting msgr requires for osds

2018-01-10 00:08:00.945588 7fc55905bd00  0 osd.999 0 load_pgs
2018-01-10 00:08:00.945594 7fc55905bd00  0 osd.999 0 load_pgs opened 0 pgs
2018-01-10 00:08:00.945599 7fc55905bd00  0 osd.999 0 using  
weightedpriority op queue with priority op cut off at 64.
2018-01-10 00:08:00.946544 7fc55905bd00 -1 osd.999 0 log_to_monitors  
{default=true}
2018-01-10 00:08:00.951720 7fc55905bd00  0 osd.999 0 done with init,  
starting boot process
2018-01-10 00:08:00.952225 7fc54160a700 -1 osd.999 0 waiting for  
initial osdmap
2018-01-10 00:08:00.970644 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064, adjusting msgr requires for clients
2018-01-10 00:08:00.970653 7fc546614700  0 osd.999 109892 crush map  
has features 288232610642264064

Re: [ceph-users] Reduced data availability: 4 pgs inactive, 4 pgs incomplete

2018-01-09 Thread Jens-U. Mozdzen

Hi Brent,

Brent Kennedy wrote to the mailing list:
Unfortunately, I don?t see that setting documented anywhere other  
than the release notes.  Its hard to find guidance for questions in  
that case, but luckily you noted it in your blog post.  I wish I  
knew what setting to put that at.  I did use the deprecated one  
after moving to hammer a while back due to the mis-calcuated PGs.  I  
have now that settings, but used 0 as the value, which cleared the  
error in the status, but the stuck incomplete pgs persist.


per your earlier message, you currently have at max 2549 PGs per OSD  
("too many PGs per OSD (2549 > max 200)"). Therefore, you might try  
setting mon_max_pg_per_osd to 2600 (to give some room for minor growth  
during backfills) and restart the OSDs.


Of course, reducing the number of PGs per OSD should somehow be on  
your list, but I do understand that that's not always as easy as it's  
written... especially given the fact that Ceph seems to still lack a  
few mechanisms to clean up certain situations (like lossless migration  
of pool contents to another pool, for RBD or CephFS).


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Removing cache tier for RBD pool

2018-01-08 Thread Jens-U. Mozdzen

Hi *,

trying to remove a caching tier from a pool used for RBD / Openstack,  
we followed the procedure from  
http://docs.ceph.com/docs/master/rados/operations/cache-tiering/#removing-a-writeback-cache and ran into  
problems.


The cluster is currently running Ceph 12.2.2, the caching tier was  
created with an earlier release of Ceph.


First of all, setting the cache-mode to "forward" is reported to be  
unsafe, which is not mentioned in the documentation - if it's really  
meant to be used in this case, the need for "--yes-i-really-mean-it"  
should be documented.


Unfortunately, using "rados -p hot-storage cache-flush-evict-all" not  
only reported errors ("file not found") for many objects, but left us  
with quite a number of objects in the pool and new ones being created,  
despite the "forward" mode. Even after stopping all Openstack  
instances ("VMs"), we could also see that the remaining objects in the  
pool were still locked. Manually unlocking these via rados commands  
worked, but "cache-flush-evict-all" then still reported those "file  
not found" errors and 1070 objects remained in the pool, like before.  
We checked the remaining objects via "rados stat" both in the  
hot-storage and the cold-storage pool and could see that every  
hot-storage object had a counter-part in cold-storage with identical  
stat info. We also compared some of the objects (with size > 0) and  
found the hot-storage and cold-storage entities to be identical.


We aborted that attempt, reverted the mode to "writeback" and  
restarted the Openstack cluster - everything was working fine again,  
of course still using the cache tier.


During a recent maintenance window, the Openstack cluster was shut  
down again and we re-tried the procedure. As there were no active  
users of the images pool, we skipped the step of forcing the cache  
mode to forward and immediately issued the "cache-flush-evict-all"  
command. Again 1070 objects remained in the hot-storage pool (and gave  
"file not found" errors), but unlike last time, none were locked.


Out of curiosity we then issued loops of "rados -p hot-storage  
cache-flush " and "rados -p hot-storage cache-evict  
" for all objects in the hot-storage pool and surprisingly  
not only received no error messages at all, but were left with an  
empty hot-storage pool! We then proceeded with the further steps from  
the docs and were able to successfully remove the cache tier.


This leaves us with two questions:

1. Does setting the cache mode to "forward" lead to above situation of  
remaining locks on hot-storage pool objects? Maybe the clients' unlock  
requests are forwarded to the cold-storage pool, leaving the  
hot-storage objects locked? If so, this should be documented and it'd  
seem impossible to cleanly remove a cache tier during live operations.


2. What is the significant difference between "rados  
cache-flush-evict-all" and separate "cache-flush" and "cache-evict"  
cycles? Or is it some implementation error that leads to those "file  
not found" errors with "cache-flush-evict-all", while the manual  
cycles work successfully?


Thank you for any insight you might be able to share.

Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to fix mon scrub errors?

2017-12-19 Thread Jens-U. Mozdzen

Hi Burkhard,

Zitat von Burkhard Linke :

HI,


since the upgrade to luminous 12.2.2 the mons are complaining about
scrub errors:


2017-12-13 08:49:27.169184 mon.ceph-storage-03 [ERR] scrub mismatch


today two such messages turned up here, too, in a cluster upgraded to  
12.2.2 over the weekend.


--- cut here ---
2017-12-19 09:28:29.180583 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc  
{logm=4023984760})
2017-12-19 09:28:29.183685 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc  
{logm=1698072116})
2017-12-19 09:28:29.186730 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc  
{logm=3505522493})
2017-12-19 09:28:29.189709 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {logm=100} crc  
{logm=3854110003})
2017-12-19 09:28:29.192081 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {logm=85,mds_health=15}  
crc {logm=239151592,mds_health=1391313747})
2017-12-19 09:28:29.193563 7fe171845700 -1 log_channel(cluster) log  
[ERR] : scrub mismatch
2017-12-19 09:28:29.193596 7fe171845700 -1 log_channel(cluster) log  
[ERR] :  mon.0 ScrubResult(keys  
{mds_health=7,mds_metadata=1,mdsmap=92} crc  
{mds_health=604545522,mds_metadata=3932958966,mdsmap=1333403161})
2017-12-19 09:28:29.193615 7fe171845700 -1 log_channel(cluster) log  
[ERR] :  mon.1 ScrubResult(keys  
{mds_health=8,mds_metadata=1,mdsmap=91} crc  
{mds_health=1003932403,mds_metadata=3932958966,mdsmap=1897035448})
2017-12-19 09:28:29.193638 7fe171845700 -1 log_channel(cluster) log  
[ERR] : scrub mismatch
2017-12-19 09:28:29.193657 7fe171845700 -1 log_channel(cluster) log  
[ERR] :  mon.0 ScrubResult(keys  
{mds_health=7,mds_metadata=1,mdsmap=92} crc  
{mds_health=604545522,mds_metadata=3932958966,mdsmap=1333403161})
2017-12-19 09:28:29.193684 7fe171845700 -1 log_channel(cluster) log  
[ERR] :  mon.2 ScrubResult(keys  
{mds_health=8,mds_metadata=1,mdsmap=91} crc  
{mds_health=1003932403,mds_metadata=3932958966,mdsmap=1897035448})
2017-12-19 09:28:29.194957 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {mdsmap=100} crc  
{mdsmap=3440145783})
2017-12-19 09:28:29.196308 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {mdsmap=100} crc  
{mdsmap=1425524862})
2017-12-19 09:28:29.197593 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {mdsmap=100} crc  
{mdsmap=3092285774})
2017-12-19 09:28:29.198871 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {mdsmap=100} crc  
{mdsmap=1144015866})
2017-12-19 09:28:29.200207 7fe171845700  0 log_channel(cluster) log  
[DBG] : scrub ok on 0,1,2: ScrubResult(keys {mdsmap=100} crc  
{mdsmap=3585539515})

--- cut here ---

Ceph state remained the same even with this in the mon logs. It was  
only in the logs of a single mon. I also checked the MDS log files,  
nothing to be found there (and especially not at that time).



These errors might have been caused by problems setting up multi mds
after luminous upgrade.


we have only a single active MDS, plus one standby - so maybe it's  
something different.


Regards,
J

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating to new pools (RBD, CephFS)

2017-12-18 Thread Jens-U. Mozdzen

Hi *,

facing the problem to reduce the number of PGs for a pool, I've found  
various information and suggestions, but no "definite guide" to handle  
pool migration with Ceph 12.2.x. This seems to be a fairly common  
problem when having to deal with "teen-age clusters", so consolidated  
information would be a real help. I'm willing to start writing things  
up, but don't want to duplicate information. So:


Are there any documented "operational procedures" on how to migrate

- an RBD pool (with snapshots created by Openstack)

- a CephFS data pool

- a CephFS metadata pool

to a different volume, in order to be able to utilize pool settings  
that cannot be changed on an existing pool?


---

RBD pools: From what I've read, RBD snapshots are "broken" after using  
"rados cppool" to move the content of an "RBD pool" to a new pool.


---

CephFS data pool: I know I can add additional pools to a CephFS  
instance ("ceph fs add_data_pool"), and have newly created files to be  
placed in the new pool ("file layouts"). But according to the docs, a  
small amount of metadata is kept in the primary data pool for all  
files, so I cannot remove the original pool.


I couldn't identify how CephFS (MDS) identifies it's current data pool  
(or "default data pool" in case of multiple pools - the one named in  
"ceph fs new"), so "rados cppool"-moving the data to a new pool and  
then reconfiguring CephFS to use the new pool (while MDS are stopped,  
of course) is not yet an option? And there might be references to the  
pool id hiding in CephFS metadata, too, invalidating this approach  
altogether.


Of course, dumping the current content of the CephFS to external  
storage and recreating the CephFS instance with new pools is a  
potential option, but may required a substantial amount of extra  
storage ;)


---

CephFS metadata pool: I've not seen any indication of a procedure to  
swap metadata pools.



I couldn't identify how CephFS (MDS) identifies it's current metadata  
pool, so "rados cppool"-moving the metadata to a new pool and then  
reconfiguring CephFS to use the new pool (while MDS are stopped, of  
course) is not yet an option?


Of course, dumping the current content of the CephFS to external  
storage and recreating the CephFS instance with new pools is a  
potential option, but may required a substantial amount of extra  
storage ;)


---

http://cephnotes.ksperis.com/blog/2015/04/15/ceph-pool-migration  
describes an interesting approach to migrate all pool contents by  
making the current pool a cache tier to the new pool and then migrate  
the "cache tier content" to the (new) base pool. But I'm not yet able  
to judge the approach and will have to conduct tests. Can anyone  
already make an educated guess if especially the "snapshot" problem  
for RBD pools will be circumvented this way and how CephFS will react  
to this approach? This "cache tier" approach, if feasible, would be a  
nice way to circumvent downtime and extra space requirements.


Thank you for any ideas, insight and experience you can share!

Regards,
J

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi Yan,

Zitat von "Yan, Zheng" :

[...]

It's likely some clients had caps on unlinked inodes, which prevent
MDS from purging objects. When a file gets deleted, mds notifies all
clients, clients are supposed to drop corresponding caps if possible.
You may hit a bug in this area, some clients failed to drop cap for
unlinked inodes.
[...]
There is a reconnect stage during MDS recovers. To reduce reconnect
message size, clients trim unused inodes from their cache
aggressively. In your case,  most unlinked inodes also got trimmed .
So mds could purge corresponding objects after it recovered


thank you for that detailed explanation. While I've already included  
the recent code fix for this issue on a test node, all other mount  
points (including the NFS server machine) still run thenon-fixed  
kernel Ceph client. So your description makes me believe we've hit  
exactly what you describe.


Seems we'll have to fix the clients :)

Is there a command I can use to see what caps a client holds, to  
verify the proposed patch actually works?


Regards,
Jens

PS: Is there a command I can use to see what caps a client holds, to  
verify the proposed patch actually works?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi Webert,

Zitat von Webert de Souza Lima :

I have experienced delayed free in used space before, in Jewel, but that
just stopped happening with no intervention.


thank you for letting me know.

If none of the developers remembers fixing this issue, it might be a  
still pending problem.



Back then, umounting all client's fs would make it free the space rapidly.


In our case, while one of the Ceph cluster members rebooted, the  
CephFS clients remained active continuously. So my hope is high that a  
complete umount is no hard requirement ;)


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi John,

Zitat von John Spray :

On Wed, Dec 13, 2017 at 2:11 PM, Jens-U. Mozdzen  wrote:
[...]

Then we had one of the nodes crash for a lack of memory (MDS was > 12 GB,
plus the new Bluestore OSD and probably the 12.2.1 BlueStore memory leak).

We brought the node back online and at first had MDS report an inconsistent
file system, though no other errors were reported. Once we restarted the
other MDS (by then active MDS on another node), that problem went away, too,
and we were back online. We did not restart clients, neither CephFS mounts
nor rbd clients.


I'm curious about the "MDS report an inconsistent file system" part --
what exactly was the error you were seeing?


my apologies, being off-site I mixed up messages. It wasn't about  
inconsistencies, but FS_DEGRADED.


When the failed node came back online (and Ceph then had recovered all  
objects problems after bringing the OSDs online), "ceph -s" reported  
"1 filesystem is degraded" and "ceph health detail" did also show just  
this error. At that time, both MDS were up and the MDS on the  
surviving node was the active MDS.


Once I restarted the MDS on the surviving node, FS_DEGRADED was cleared:

--- cut here ---
2017-12-07 19:05:33.113619 mon.node01 mon.0 192.168.160.15:6789/0 243  
: cluster [WRN] overall HEALTH_WARN 1 filesystem is degraded; noout  
flag(s) set; 1 nearfull osd(s)
2017-12-07 19:06:33.113826 mon.node01 mon.0 192.168.160.15:6789/0 298  
: cluster [INF] mon.1 192.168.160.16:6789/0
2017-12-07 19:06:33.113923 mon.node01 mon.0 192.168.160.15:6789/0 299  
: cluster [INF] mon.2 192.168.160.17:6789/0
2017-12-07 19:11:16.997308 mon.node01 mon.0 192.168.160.15:6789/0 541  
: cluster [INF] Standby daemon mds.node01 assigned to filesystem  
cephfs as rank 0
2017-12-07 19:11:16.997446 mon.node01 mon.0 192.168.160.15:6789/0 542  
: cluster [WRN] Health check failed: insufficient standby MDS daemons  
available (MDS_INSUFFICIENT_STANDBY)
2017-12-07 19:11:20.968933 mon.node01 mon.0 192.168.160.15:6789/0 553  
: cluster [INF] Health check cleared: MDS_INSUFFICIENT_STANDBY (was:  
insufficient standby MDS daemons available)
2017-12-07 19:11:33.113816 mon.node01 mon.0 192.168.160.15:6789/0 565  
: cluster [INF] mon.1 192.168.160.16:6789/0
2017-12-07 19:11:33.114958 mon.node01 mon.0 192.168.160.15:6789/0 566  
: cluster [INF] mon.2 192.168.160.17:6789/0
2017-12-07 19:12:09.889106 mon.node01 mon.0 192.168.160.15:6789/0 598  
: cluster [INF] daemon mds.node01 is now active in filesystem cephfs  
as rank 0
2017-12-07 19:12:09.983442 mon.node01 mon.0 192.168.160.15:6789/0 599  
: cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem  
is degraded)


No other errors/warnings were obvious. The "insufficient standby" at  
19:11:16.997308 is likely caused by the restart of the MDS at node2.


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs automatic data pool cleanup

2017-12-13 Thread Jens-U. Mozdzen

Hi *,

during the last weeks, we noticed some strange behavior of our CephFS  
data pool (not metadata). As things have worked out over time, I'm  
just asking here so that I can better understand what to look out for  
in the future.


This is on a three-node Ceph Luminous (12.2.1) cluster with one active  
MDS and one standby MDS. We have a range of machines mounting that  
single CephFS via kernel mounts, using different versions of Linux  
kernels (all at least 4.4, with vendor backports).


We observed an ever-increasing number of objects and space allocation  
on the (HDD-based, replicated) CephFS data pool, although the actual  
file system usage didn't grow over time and actually decreased  
significantly during that time period. The pool allocation went above  
all warn and crit levels, forcing us to add new OSDs (our first three  
Bluestore OSDs - all others are file-based) to relief pressure, if  
only for some time.


Part of the growth seems to be related to a large nightly compile job,  
that was using CephFS via an NFS server (kernel-based) exposing the  
kernel-mounted CephFS to many nodes: Once we stopped that job, pool  
allocation growth significantly slowed (but didn't stop).


Further diagnosis hinted that the data pool had many orphan objects,  
that is objects for inodes we could not locate in the live CephFS.


All the time, we did not notice any significant growth of the metadata  
pool (SSD-based) nor obvious errors in the Ceph logs (Ceph, MDS,  
OSDs). Except for the fill levels, the cluster was healthy. Restarting  
MDSs did not help.


Then we had one of the nodes crash for a lack of memory (MDS was > 12  
GB, plus the new Bluestore OSD and probably the 12.2.1 BlueStore  
memory leak).


We brought the node back online and at first had MDS report an  
inconsistent file system, though no other errors were reported. Once  
we restarted the other MDS (by then active MDS on another node), that  
problem went away, too, and we were back online. We did not restart  
clients, neither CephFS mounts nor rbd clients.


The following day we noticed an ongoing significant decrease in the  
number of objects in the CephFS data pool. As we couldn't spot any  
actual problems with the content of the CephFS (which was rather  
stable at the time), we sat back and watched - after some hours, the  
pool stabilized in size and was at a total size a bit closer to the  
actual CephFS content than before the mass deletion (FS size around  
630 GB per "df" output, current data pool size about 1100 GB, peak  
size was around 1.3 TB before the mass deletion).


What may it have been that we were watching - some form of garbage  
collection that was triggered by the node outage? Is this something we  
could have triggered manually before, to avoid the free space problems  
we faced? Or is this something unexpected, that should have happened  
auto-magically and much more often, but that for some reason didn't  
occur in our environment?


Thank you for any ideas and/or pointers you may share.

Regards,
J

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: costly MDS cache misses?

2017-11-29 Thread Jens-U. Mozdzen

Hi *,

while tracking down a different performance issue with CephFS  
(creating tar balls from CephFS-based directories takes multiple times  
as long as when backing up the same data from local disks, i.e. 56  
hours instead of 7), we had a look at CephFS performance related to  
the size of the MDS process.


Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS  
data is on SAS HDDs, meta data is on SAS SSDs.


It came to mind that MDS memory consumption might cause the delays  
with "tar". But while below results don't confirm this (it actually  
confirms that MDS memory size does not affect CephFS read speed when  
the cache is sufficiently warm), it does show an almost 30%  
performance drop if the cache is filled with the wrong entries.


After a fresh process start, our MDS takes about 450k memory, with 56k  
residual. I then start a tar run for 36 GB small files (which I had  
also run a few minutes before MDS restart, to warm up disk caches):


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
   1233 ceph  20   0  446584  56000  15908 S  3.960 0.085
0:01.08 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 17:38:21 CET 2017
38245529600
Wed Nov 29 17:44:27 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  485760 109156  16148 S  0.331 0.166
0:10.76 ceph-mds

--- cut here ---

As you can see, there's only small growth in MDS virtual size.

The job took 366 seconds, that an average of about 100 MB/s.

I repeat that job a few minutes later, to get numbers with a  
previously active MDS (the MDS cache should be warmed up now):


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  494976 118404  16148 S  2.961 0.180
0:16.21 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 17:53:09 CET 2017
38245529600
Wed Nov 29 17:58:53 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0  508288 131368  16148 S  1.980 0.200
0:25.45 ceph-mds

--- cut here ---

The job took 344 seconds, that's an average of about 106 MB/s. With  
only a single run per situation, these numbers aren't more than rough  
estimate, of course.


At 18:00:00, a file-based incremental backup job kicks in, which reads  
through most of the files on the CephFS, but only backing up those  
that were changed since the last run. This has nothing to do with our  
"tar" and is running on a different node, where CephFS is  
kernel-mounted as well. That backup job makes the MDS cache grow  
drastically, you can see MDS at more than 8 GB now.


We then start another tar job (or rather two, to account for MDS  
caching), as before:


--- cut here ---
   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0 8644776 7.750g  16184 S  0.990 12.39
6:45.24 ceph-mds


server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 18:13:20 CET 2017
38245529600
Wed Nov 29 18:21:50 CET 2017
server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . |  
wc -c; date

Wed Nov 29 18:22:52 CET 2017
38245529600
Wed Nov 29 18:28:28 CET 2017
server01:~ #

   PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+ COMMAND
  1233 ceph  20   0 8761512 7.642g  16184 S  3.300 12.22
7:03.52 ceph-mds

--- cut here ---

The second run is even a bit quicker than the "warmed-up" run with the  
only partially filled cache (336 seconds, that's 108,5 MB/s).


But the run against the filled-up MDS cache, where most (if not all)  
entries are no match for our tar lookups, took 510 seconds - that 71,5  
MB/s, instead of the roughly 100 MB/s when the cache was empty.


This is by far no precise benchmark test, indeed. But it at least  
seems to be an indicator that MDS cache misses are costly. (During the  
tests, only small amounts of changes in CephFS were likely -  
especially compared to the amount of reads and file lookups for their  
metadata.)


Regards,
Jens

PS: Why so much memory for MDS in the first place? Because during  
those (hourly) incremental backup runs, we got a large number of MDS  
warnings about insufficient cache pressure responses from clients.  
Increasing the MDS cache size did help to avoid these.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] strange error on link() for nfs over cephfs

2017-11-29 Thread Jens-U. Mozdzen

Hi *,

we recently have switched to using CephFS (with Luminous 12.2.1). On  
one node, we're kernel-mounting the CephFS (kernel 4.4.75, openSUSE  
version) and export it via kernel nfsd. As we're transitioning right  
now, a number of machines still auto-mount users home directories from  
that nfsd.


A strange error that was not present when using the same nfsd  
exporting local-disk-based file systems, has recently surfaced. The  
problem is most visible to the user when doing a ssh-keygen operation  
to remove old keys from their "known_hosts", but it seems likely that  
this error will occur in other constellations, too.


The error report from "ssh_keygen" is:

--- cut here ---
user@host:~> ssh-keygen -R somehost -f /home/user/.ssh/known_hosts
# Host somehost found: line 232
link /home/user/.ssh/known_hosts to /home/user/.ssh/known_hosts.old:  
Not a directory

user@host:~>
--- cut here ---

This error persists... until the user lists the contents of the  
directory containing the "known_hosts" file (~/.ssh). Once that is  
done (i.e. "ls -l ~/.ssh"), ssh_keygen works as expected.


We've strace'd ssh_keygen and see the following steps (and more, of course):

- the original known_hosts file is opened successfully
- a temp file is created in .ssh (successfully)
- a previous backup copy (known_hosts.old) is unlink()ed (not  
successful, since not present)

- a link() from known_hosts to known_hosts.old is tried - ENOTDIR

--- cut here ---
[...]
unlink("/home/user/.ssh/known_hosts.old") = -1 ENOENT (No such file or  
directory)
link("/home/user/.ssh/known_hosts", "/home/user/.ssh/known_hosts.old")  
= -1 ENOTDIR (Not a directory)

--- cut here ---

Once the directory was listed, the link() call works nicely:

--- cut here ---
unlink("/home/user/.ssh/known_hosts.old") = -1 ENOENT (No such file or  
directory)

link("/home/user/.ssh/known_hosts", "/home/user/.ssh/known_hosts.old") = 0
rename("/home/user/.ssh/known_hosts.5trpXBpIgB",  
"/home/user/.ssh/known_hosts") = 0

--- cut here ---

When link() returns an error, the rename is not called, leaving the  
user with ("try" times) temporary files in .ssh - they never got  
renamed.


This does sound like a bug to me, has anybody else stumbled across  
similar symptoms as well?


Regards,
Jens


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "failed to open ino"

2017-11-28 Thread Jens-U. Mozdzen

Hi David,

Zitat von David C :

On 27 Nov 2017 1:06 p.m., "Jens-U. Mozdzen"  wrote:

Hi David,

Zitat von David C :

Hi Jens


We also see these messages quite frequently, mainly the "replicating
dir...". Only seen "failed to open ino" a few times so didn't do any real
investigation. Our set up is very similar to yours, 12.2.1, active/standby
MDS and exporting cephfs through KNFS (hoping to replace with Ganesha
soon).



been there, done that - using Ganesha more than doubled the run-time of our
jobs, while with knfsd, the run-time is about the same for CephFS-based and
"local disk"-based files. But YMMV, so if you see speeds with Ganesha that
are similar to knfsd, please report back with details...


I'd be interested to know if you tested Ganesha over a cephfs kernel mount
(ie using the VFS fsal) or if you used the Ceph fsal. Also the server and
client versions you tested.


I had tested Ganesha only via the Ceph FSAL. Our Ceph nodes (including  
the one used as a Ganesha server) are running  
ceph-12.2.1+git.1507910930.aea79b8b7a on OpenSUSE 42.3, SUSE's kernel  
4.4.76-1-default (which has a number of back-ports in it), Ganesha is  
at version nfs-ganesha-2.5.2.0+git.1504275777.a9d23b98f.


The NFS clients are a broad mix of current and older systems.


Prior to Luminous, Ganesha writes were terrible due to a bug with fsync
calls in the mds code. The fix went into the mds and client code. If you're
doing Ganesha over the top of the kernel mount you'll need a pretty recent
kernel to see the write improvements.


As we were testing the Ceph FSAL, this should not be the cause.


From my limited Ganesha testing so far, reads are better when exporting the
kernel mount, writes are much better with the Ceph fsal. But that's
expected for me as I'm using the CentOS kernel. I was hoping the
aforementioned fix would make it into the rhel 7.4 kernel but doesn't look
like it has.


When exporting the kernel-mounted CephFS via kernel nfsd, we see  
similar speeds to serving the same set of files from a local bcache'd  
RAID1 array on SAS disks. This is for a mix of reads and writes,  
mostly small files (compile jobs, some packaging).



From what I can see, it would have to be A/A/P, since MDS demands at least
one stand-by.


That's news to me.


From http://docs.ceph.com/docs/master/cephfs/multimds/ :

"Each CephFS filesystem has a max_mds setting, which controls how many  
ranks will be created. The actual number of ranks in the filesystem  
will only be increased if a spare daemon is available to take on the  
new rank. For example, if there is only one MDS daemon running, and  
max_mds is set to two, no second rank will be created."


Might well be I was mis-reading this... I had first read it to mean  
that a spare daemon needs to be available *while running* A/A, but the  
example sounds like the spare is required when *switching to* A/A.



Is it possible you still had standby config in your ceph.conf?


Not sure what you're asking for, is this related to active/active or  
to our Ganesha tests? We have not yet tried to switch to A/A, so our  
config actually contains standby parameters.


Regards,
Jens

--
Jens-U. Mozdzen voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15   mobile  : +49-179-4 98 21 98
D-22423 Hamburg e-mail  : jmozd...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Torlée-Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "failed to open ino"

2017-11-28 Thread Jens-U. Mozdzen

Hi,

Zitat von "Yan, Zheng" :

On Sat, Nov 25, 2017 at 2:27 AM, Jens-U. Mozdzen  wrote:

Hi all,
[...]
In the log of the active MDS, we currently see the following two inodes
reported over and over again, about every 30 seconds:

--- cut here ---
2017-11-24 18:24:16.496397 7fa308cf0700  0 mds.0.cache  failed to open ino
0x10001e45e1d err -22/0
2017-11-24 18:24:16.497037 7fa308cf0700  0 mds.0.cache  failed to open ino
0x10001e4d6a1 err -22/-22
[...]
--- cut here ---

There were other reported inodes with other errors, too ("err -5/0", for
instance), the root cause seems to be the same (see below).

[...]
It's likely caused by NFS export.  MDS reveals this error message if
NFS client tries to access a deleted file. The error causes NFS client
to return -ESTALE.


you were right on the spot - a process remained active after the test  
runs and had that directory as its current working directory. Stopping  
the process stops the messages.


Thank you for pointing me there!

Best regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "failed to open ino"

2017-11-27 Thread Jens-U. Mozdzen

Hi David,

Zitat von David C :

Hi Jens

We also see these messages quite frequently, mainly the "replicating
dir...". Only seen "failed to open ino" a few times so didn't do any real
investigation. Our set up is very similar to yours, 12.2.1, active/standby
MDS and exporting cephfs through KNFS (hoping to replace with Ganesha
soon).


been there, done that - using Ganesha more than doubled the run-time  
of our jobs, while with knfsd, the run-time is about the same for  
CephFS-based and "local disk"-based files. But YMMV, so if you see  
speeds with Ganesha that are similar to knfsd, please report back with  
details...



Interestingly, the paths reported in "replicating dir" are usually
dirs exported through Samba (generally Windows profile dirs). Samba runs
really well for us and there doesn't seem to be any impact on users. I
expect we wouldn't see these messages if running active/active MDS but I'm
still a bit cautious about implementing that (am I being overly cautious I
wonder?).


From what I can see, it would have to be A/A/P, since MDS demands at  
least one stand-by.


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "failed to open ino"

2017-11-27 Thread Jens-U. Mozdzen

Hi,

Zitat von "Yan, Zheng" :

On Sat, Nov 25, 2017 at 2:27 AM, Jens-U. Mozdzen  wrote:

[...]
In the log of the active MDS, we currently see the following two inodes
reported over and over again, about every 30 seconds:

--- cut here ---
2017-11-24 18:24:16.496397 7fa308cf0700  0 mds.0.cache  failed to open ino
[...]


It's likely caused by NFS export.  MDS reveals this error message if
NFS client tries to access a deleted file. The error causes NFS client
to return -ESTALE.


thank you for pointing me at this potential cause - as we're still  
using NFS access during that job (old clients without native CephFS  
support), it may be we have some yet unnoticed stale NFS file handles.  
I'll have a closer look, indeed!


Regards,
Jens

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "failed to open ino"

2017-11-24 Thread Jens-U. Mozdzen

Hi all,

with our Ceph Luminous CephFS, we're plaqued with "failed to open ino"  
messages. These don't seem to affect daily business (in terms of "file  
access"). (There's a backup performance issue that may eventually be  
related, but I'll report on that in a different thread.)


Our Ceph currently is at v12.2.1 (git.1507910930.aea79b8b7a), on  
OpenSUSE Leap 42.3. Three Ceph nodes, 12 HDD OSDs, two SSD OSDs,  
status "HEALTH_OK".


We have a single CephFS and two MDS (active/standby), metadata pool is  
on SSD OSDs, content is on HDD OSDs (all file stores). That CephFS is  
mounted by several clients (via kernel cephfs support, mostly kernel  
version 4.4.76) and via NFS (kernel nfsd on a kernel-mounted CephFS).


In the log of the active MDS, we currently see the following two  
inodes reported over and over again, about every 30 seconds:


--- cut here ---
2017-11-24 18:24:16.496397 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:16.497037 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:16.500645 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:16.501218 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:46.506210 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:46.506926 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22
2017-11-24 18:24:46.510354 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 18:24:46.510891 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22

--- cut here ---

There were other reported inodes with other errors, too ("err -5/0",  
for instance), the root cause seems to be the same (see below).


For the 0x10001e4d6a1 inode, it was first mentioned in the MDS log as  
follows, and then every 30 seconds as "failed to open". 0x10001e45e1d  
just appeared at the same time, "out of the blue":


--- cut here ---
2017-11-23 19:12:28.440107 7f3586cbd700  0 mds.0.bal replicating dir  
[dir 0x10001e4d6a1 /some/path/in/our/cephfs/ [2,head] auth pv=3473  
v=3471 cv=2963/2963 ap=1+6+6 state=1610612738|complete f(v0  
m2017-11-23 19:12:25.005299 15=4+11) n(v74 rc2017-11-23  
19:12:28.429258 b137317605 5935=4875+1060)/n(v74 rc2017-11-23  
19:12:28.337259 b139723963 5969=4898+1071) hs=15+23,ss=0+0 dirty=30 |  
child=1 dirty=1 authpin=1 0x5649e811e9c0] pop 10223.1 .. rdp 373 adj 0
2017-11-23 19:15:55.015347 7f3580cb1700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-23 19:15:55.016056 7f3580cb1700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22

--- cut here ---

According to a message from this ML, "replicating dir" is supposed to  
indicate that the directory is hot and being replicated to another  
active MDS to spread the load. But there is no other active MDS, as we  
have only two in total (and Ceph properly reports "mds: cephfs-1/1/1  
up  {0=server1=up:active}, 1 up:standby-replay".


From what I took from the logs, all error messages seem to be related  
to that same path "/some/path/in/our/cephfs/" all the time, which gets  
deleted and recreated every evening - thus changing ino numbers. But  
what can it be that makes that same path, buried far down in some  
cephfs hierarchy, cause these "failed to open ino" messages? It's just  
one of a lot of directories, and no particularly populated one (about  
40 files and directories).


Another possibly interesting side fact: The previous ino's messages  
sometimes stop when our evening/night job ends (no more active  
accesses to that directory), sometimes they continue throughout the  
day. I've even noticed it crossing the directory recreation time  
(messages for old inode don't stop when the old directory is deleted,  
except for after a new "replicating dir" message appears). Today I  
deleted the directory (including its parent) during the day, but the  
messages wouldn't stop. The messages also survive switching to the  
other MDS. And by now the next daily instance of our job has started,  
recreating the directories (that I had deleted throughout the day).  
The "failed to open ino" messages for the old ino numbers still  
appears every 30 seconds, I've not yet seen any "replicating dir"  
message for any of that cephfs tree area. I have seen a few for other  
areas of the cephfs tree, but no other ino numbers were reported as  
"failed to open" - only the two from above:


--- cut here ---
2017-11-24 19:22:13.50 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:13.000770 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22
2017-11-24 19:22:13.003918 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e45e1d err -22/0
2017-11-24 19:22:13.004469 7fa308cf0700  0 mds.0.cache  failed to open  
ino 0x10001e4d6a1 err -22/-22
2017-11-24

Re: [ceph-users] who is using nfs-ganesha and cephfs?

2017-11-15 Thread Jens-U. Mozdzen

Hi all,

By Sage Weil :

Who is running nfs-ganesha's FSAL to export CephFS?  What has your
experience been?

(We are working on building proper testing and support for this into
Mimic, but the ganesha FSAL has been around for years.)


After we had moved most of our file-based data to a CephFS environment  
and suffering from what later turned out to be a (mis-)configuration  
issue with our existing nfsd server, I had decided to give Ganesha a  
try.


We run a Ceph cluster on three servers, openSUSE Leap 42.3, Ceph  
Luminous (latest stable). 2x10G interfaces for intra-cluster  
communication, 2x1G towards the NFS clients. CephFS meta-data is on an  
SSD pool, the actual data is on SAS HDDs, 12 OSDs. Ganesha version is  
2.5.2.0+git.1504275777.a9d23b98f-3.6. All Ganesha/nfsd server services  
are on one of the servers that are also Ceph nodes.


We run an automated, distributed build&stage environment (tons of gcc  
compiles on multiple clients, some Java compiles, RPM builds etc.),  
with (among others) nightly test build runs. These usually take about  
8 hours, when using kernel nfsd and local storage on the same servers  
that also provide the Ceph service.


After switching to Ganesha (with CephFS FSAL, Ganesha running on the  
same server where we originally had run nfsd) and starting test runs  
of the same work load, we aborted the runs after about 12 hours - by  
then, only (estimated) 60 percent of the job were done.


For comparison, when now using kernel nfsd to serve the CephFS shares  
(mounting the single CephFS via kernel FS module on the server that's  
running nfsd, and exporting multiple sub-directories via nfsd), we see  
an increase of between none and eight percent of the original run time.


So to us, comparing "Ganesha+CephFS FSAL" to "kernel nfsd with kernel  
CephFS module", the latter wins. Or to put it the other way around,  
Ganesha seems unusable to us in its current state, judging by the  
slowness observed.


Other issues I noticed:

- the directory size, as shown by "ls -l" on the client, was very  
different from that shown when mounting via nfsd ;)


- "showmount" did not return any entries, with would have (later on,  
had we continued to use Ganesha) caused problems with our dynamic  
automouter maps


Please note that I did not have time to do intensive testing against  
different Ganesha parameters. The only runs I made were with or  
without "MaxRead = 1048576; MaxWrite = 1048576;" per share, per some  
comment about buffer sizes. These changes didn't seem to bring much  
difference, though.


We closely monitor our network and server performance, I could clearly  
see a huge drop of network traffic (NFS server to clients) when  
switching from nfsd to Ganesha, and an according increase when  
switching back to nfsd (sharing the CephFS mount). None of the servers  
seemed to be under excessive load during these tests but it was  
obvious that Ganesha took its share of CPU - maybe the bottle-neck  
were some single-threaded operations, so Ganesha could not make use of  
the other, idling cores. But I'm just guessing here.


Regards,
J


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com