Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Jeff Layton
On Fri, 2020-01-17 at 17:10 +0100, Ilya Dryomov wrote:
> On Fri, Jan 17, 2020 at 2:21 AM Aaron  wrote:
> > No worries, can definitely do that.
> > 
> > Cheers
> > Aaron
> > 
> > On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
> > > On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
> > > > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
> > > > > Seeing a weird mount issue.  Some info:
> > > > > 
> > > > > No LSB modules are available.
> > > > > Distributor ID: Ubuntu
> > > > > Description: Ubuntu 18.04.3 LTS
> > > > > Release: 18.04
> > > > > Codename: bionic
> > > > > 
> > > > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
> > > > > Ceph 14.2.5 & 14.2.6
> > > > > 
> > > > > With ceph-common, ceph-base, etc installed:
> > > > > 
> > > > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
> > > > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
> > > > > [installed,automatic]
> > > > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > 
> > > > > I create a user via get-or-create cmd, and I have a users/secret now.
> > > > > When I try to mount on these Ubuntu nodes,
> > > > > 
> > > > > The mount cmd I run for testing is:
> > > > > sudo mount -t ceph -o
> > > > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
> > > > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 /tmp/test
> > > > > 
> > > > > I get the error:
> > > > > couldn't finalize options: -34
> > > > > 
> > > > > From some tracking down, it's part of the get_secret_option() in
> > > > > common/secrets.c and the Linux System Error:
> > > > > 
> > > > > #define ERANGE  34  /* Math result not representable */
> > > > > 
> > > > > Now the weird part...when I remove all the above libs above, the mount
> > > > > command works. I know that there are ceph.ko modules in the Ubuntu
> > > > > filesystems DIR, and that Ubuntu comes with some understanding of how
> > > > > to mount a cephfs system.  So, that explains how it can mount
> > > > > cephfs...but, what I don't understand is why I'm getting that -34
> > > > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
> > > > > issue with 14.2.3 or 14.2.4.
> > > > 
> > > > This sounds like a regression in mount.ceph, probably due to something
> > > > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
> > > > think it has something to do with the very long username you're using.
> > > > 
> > > > I'll take a closer look and let you know. Stay tuned.
> > > > 
> > > 
> > > I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
> > > your use case. We need to make that a little larger than the largest
> > > name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
> > > it smaller in that set thinking that was too large. Mea culpa.
> > > 
> > > The problem is determining how big that size can be. AFAICT EntityName
> > > is basically a std::string in the ceph code, which can be an arbitrary
> > > size (up to 4g or so).
> 
> It's just that you made SECRET_OPTION_BUFSIZE account precisely for
> "secret=", but it can also be "key=".
> 
> I don't think there is much of a problem.  Defining it back to ~1000 is
> guaranteed to work.  Or we could remove it and just compute the size of
> secret_option exactly the same way as get_secret_option() does it:
> 
>   strlen(cmi->cmi_secret) + strlen(cmi->cmi_name) + 7 + 1
> 

Yeah, it's not hard to do a simple fix like that, but I opted to rework
the code to just safe_cat the secret option string(s) directly into the 
options buffer.

That eliminates some extra copies of this info and the need for an
arbitrary limit altogether. It also removes a chunk of code that doesn't
really need to be in the common lib.

See:

https://github.com/ceph/ceph/pull/32706

Aaron, if you have a way to build and test this, it'd be good if you
could confirm that it fixes the problem for you.
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weird mount issue (Ubuntu 18.04, Ceph 14.2.5 & 14.2.6)

2020-01-17 Thread Jeff Layton
Actually, scratch that. I went ahead and opened this:

https://tracker.ceph.com/issues/43649

Feel free to watch that one for updates.

On Fri, 2020-01-17 at 07:43 -0500, Jeff Layton wrote:
> No problem. Can you let me know the tracker bug number once you've
> opened it?
> 
> Thanks,
> Jeff
> 
> On Thu, 2020-01-16 at 20:24 -0500, Aaron wrote:
> > This debugging started because the ceph-provisioner from k8s was making 
> > those users...but what we found was doing something similar by hand caused 
> > the same issue. Just surprised no one else using k8s and ceph backed 
> > PVC/PVs  ran into this issue. 
> > 
> > Thanks again for all your help!
> > 
> > Cheers
> > Aaron
> > 
> > On Thu, Jan 16, 2020 at 8:21 PM Aaron  wrote:
> > > No worries, can definitely do that. 
> > > 
> > > Cheers
> > > Aaron
> > > 
> > > On Thu, Jan 16, 2020 at 8:08 PM Jeff Layton  wrote:
> > > > On Thu, 2020-01-16 at 18:42 -0500, Jeff Layton wrote:
> > > > > On Wed, 2020-01-15 at 08:05 -0500, Aaron wrote:
> > > > > > Seeing a weird mount issue.  Some info:
> > > > > > 
> > > > > > No LSB modules are available.
> > > > > > Distributor ID: Ubuntu
> > > > > > Description: Ubuntu 18.04.3 LTS
> > > > > > Release: 18.04
> > > > > > Codename: bionic
> > > > > > 
> > > > > > Ubuntu 18.04.3 with kerne 4.15.0-74-generic
> > > > > > Ceph 14.2.5 & 14.2.6
> > > > > > 
> > > > > > With ceph-common, ceph-base, etc installed:
> > > > > > 
> > > > > > ceph/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-base/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-common/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > ceph-mds/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-mgr/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > ceph-mgr-dashboard/stable,stable,now 14.2.6-1bionic all [installed]
> > > > > > ceph-mon/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > ceph-osd/stable,now 14.2.6-1bionic amd64 [installed]
> > > > > > libcephfs2/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > python-ceph-argparse/stable,stable,now 14.2.6-1bionic all 
> > > > > > [installed,automatic]
> > > > > > python-cephfs/stable,now 14.2.6-1bionic amd64 [installed,automatic]
> > > > > > 
> > > > > > I create a user via get-or-create cmd, and I have a users/secret 
> > > > > > now.
> > > > > > When I try to mount on these Ubuntu nodes,
> > > > > > 
> > > > > > The mount cmd I run for testing is:
> > > > > > sudo mount -t ceph -o
> > > > > > name=user-20c5338c-34db-11ea-b27a-de7033e905f6,secret=AQC6dhpeyczkDxAAhRcr7oERUY4BcD2NCUkuNg==
> > > > > > 10.10.10.10:6789:/work/20c5332d-34db-11ea-b27a-de7033e905f6 
> > > > > > /tmp/test
> > > > > > 
> > > > > > I get the error:
> > > > > > couldn't finalize options: -34
> > > > > > 
> > > > > > From some tracking down, it's part of the get_secret_option() in
> > > > > > common/secrets.c and the Linux System Error:
> > > > > > 
> > > > > > #define ERANGE  34  /* Math result not representable */
> > > > > > 
> > > > > > Now the weird part...when I remove all the above libs above, the 
> > > > > > mount
> > > > > > command works. I know that there are ceph.ko modules in the Ubuntu
> > > > > > filesystems DIR, and that Ubuntu comes with some understanding of 
> > > > > > how
> > > > > > to mount a cephfs system.  So, that explains how it can mount
> > > > > > cephfs...but, what I don't understand is why I'm getting that -34
> > > > > > error with the 14.2.5 and 14.2.6 libs installed. I didn't have this
> > > > > > issue with 14.2.3 or 14.2.4.
> > > > > 
> > > > > This sounds like a regression in mount.ceph, probably due to something
> > > > > that went in for v14.2.5. I can reproduce the problem on Fedora, and I
> > > > > think it has something to do with the very long username you're using.
> > > > > 
> > > > > I'll take a closer look and let you know. Stay tuned.
> > > > > 
> > > > 
> > > > I think I see the issue. The SECRET_OPTION_BUFSIZE is just too small for
> > > > your use case. We need to make that a little larger than the largest
> > > > name= parameter can be. Prior to v14.2.5, it was ~1000 bytes, but I made
> > > > it smaller in that set thinking that was too large. Mea culpa.
> > > > 
> > > > The problem is determining how big that size can be. AFAICT EntityName
> > > > is basically a std::string in the ceph code, which can be an arbitrary
> > > > size (up to 4g or so).
> > > > 
> > > > Aaron, would you mind opening a bug for this at tracker.ceph.com? We
> > > > should be able to get it fixed up, once I do a bit more research to
> > > > figure out how big to make this buffer.

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-15 Thread Jeff Layton
On Thu, 2019-08-15 at 16:45 +0900, Hector Martin wrote:
> On 15/08/2019 03.40, Jeff Layton wrote:
> > On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote:
> > > Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
> > > Please take a look.
> > > 
> > 
> > (sorry for duplicate mail -- the other one ended up in moderation)
> > 
> > Thanks Ilya,
> > 
> > That function is pretty straightforward. We don't do a whole lot of
> > pointer chasing in there, so I'm a little unclear on where this would
> > have crashed. Right offhand, that kernel is probably missing
> > 1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but
> > that seems unlikely to result in an oops.
> > 
> > Hector, if you have the debuginfo for this kernel installed on one of
> > these machines, could you run gdb against the ceph.ko module and then
> > do:
> > 
> >   gdb> list *(ceph_lock_message+0x212)
> > 
> > That may give me a better hint as to what went wrong.
> 
> This is what I get:
> 
> (gdb)  list *(ceph_lock_message+0x212)
> 0xd782 is in ceph_lock_message 
> (/build/linux-hwe-B83fOS/linux-hwe-4.18.0/fs/ceph/locks.c:116).
> 111 req->r_wait_for_completion = 
> ceph_lock_wait_for_completion;
> 112
> 113 err = ceph_mdsc_do_request(mdsc, inode, req);
> 114
> 115 if (operation == CEPH_MDS_OP_GETFILELOCK) {
> 116 fl->fl_pid = 
> -le64_to_cpu(req->r_reply_info.filelock_reply->pid);
> 117 if (CEPH_LOCK_SHARED == 
> req->r_reply_info.filelock_reply->type)
> 118 fl->fl_type = F_RDLCK;
> 119 else if (CEPH_LOCK_EXCL == 
> req->r_reply_info.filelock_reply->type)
> 120 fl->fl_type = F_WRLCK;
> 
> Disasm:
> 
> 0xd77b <+523>:   mov0x250(%rbx),%rdx
> 0xd782 <+530>:   mov0x20(%rdx),%rdx
> 0xd786 <+534>:   neg%edx
> 0xd788 <+536>:   mov%edx,0x48(%r15)
> 
> That means req->r_reply_info.filelock_reply was NULL.
> 
> 

Many thanks, Hector. Would you mind opening a bug against the kernel
client at https://tracker.ceph.com ? That's better than doing this via
email and we'll want to make sure we keep track of this.  Did you say
that this was reproducible?

Now...

Note that we don't actually check whether ceph_mdsc_do_request returned
success before we start dereferencing there. I suspect that function
returned an error, and the pointer was left zeroed out.

Probably, we just need to turn that if statement into:

if (!err && operation == CEPH_MDS_OP_GETFILELOCK) {

I'll queue up a patch.

Thanks for the report!
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS meltdown fallout: mds assert failure, kernel oopses

2019-08-14 Thread Jeff Layton
On Wed, 2019-08-14 at 19:29 +0200, Ilya Dryomov wrote:
> On Tue, Aug 13, 2019 at 1:06 PM Hector Martin  wrote:
> > I just had a minor CephFS meltdown caused by underprovisioned RAM on the
> > MDS servers. This is a CephFS with two ranks; I manually failed over the
> > first rank and the new MDS server ran out of RAM in the rejoin phase
> > (ceph-mds didn't get OOM-killed, but I think things slowed down enough
> > due to swapping out that something timed out). This happened 4 times,
> > with the rank bouncing between two MDS servers, until I brought up an
> > MDS on a bigger machine.
> > 
> > The new MDS managed to become active, but then crashed with an assert:
> > 
> > 2019-08-13 16:03:37.346 7fd4578b2700  1 mds.0.1164 clientreplay_done
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1239 from mon.1
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map i am
> > now mds.0.1164
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 handle_mds_map state
> > change up:clientreplay --> up:active
> > 2019-08-13 16:03:37.502 7fd45e2a7700  1 mds.0.1164 active_start
> > 2019-08-13 16:03:37.690 7fd45e2a7700  1 mds.0.1164 cluster recovered.
> > 2019-08-13 16:03:45.130 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1240 from mon.1
> > 2019-08-13 16:03:46.162 7fd45e2a7700  1 mds.mon02 Updating MDS map to
> > version 1241 from mon.1
> > 2019-08-13 16:03:50.286 7fd4578b2700 -1
> > /build/ceph-13.2.6/src/mds/MDCache.cc: In function 'void
> > MDCache::remove_inode(CInode*)' thread 7fd4578b2700 time 2019-08-13
> > 16:03:50.279463
> > /build/ceph-13.2.6/src/mds/MDCache.cc: 361: FAILED
> > assert(o->get_num_ref() == 0)
> > 
> >   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> > (stable)
> >   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x14e) [0x7fd46650eb5e]
> >   2: (()+0x2c4cb7) [0x7fd46650ecb7]
> >   3: (MDCache::remove_inode(CInode*)+0x59d) [0x55f423d6992d]
> >   4: (StrayManager::_purge_stray_logged(CDentry*, unsigned long,
> > LogSegment*)+0x1f2) [0x55f423dc7192]
> >   5: (MDSIOContextBase::complete(int)+0x11d) [0x55f423ed42bd]
> >   6: (MDSLogContextBase::complete(int)+0x40) [0x55f423ed4430]
> >   7: (Finisher::finisher_thread_entry()+0x135) [0x7fd46650d0a5]
> >   8: (()+0x76db) [0x7fd465dc26db]
> >   9: (clone()+0x3f) [0x7fd464fa888f]
> > 
> > Thankfully this didn't happen on a subsequent attempt, and I got the
> > filesystem happy again.
> > 
> > At this point, of the 4 kernel clients actively using the filesystem, 3
> > had gone into a strange state (can't SSH in, partial service). Here is a
> > kernel log from one of the hosts (the other two were similar):
> > https://mrcn.st/p/ezrhr1qR
> > 
> > After playing some service failover games and hard rebooting the three
> > affected client boxes everything seems to be fine. The remaining FS
> > client box had no kernel errors (other than blocked task warnings and
> > cephfs talking about reconnections and such) and seems to be fine.
> > 
> > I can't find these errors anywhere, so I'm guessing they're not known bugs?
> 
> Jeff, the oops seems to be a NULL dereference in ceph_lock_message().
> Please take a look.
> 

(sorry for duplicate mail -- the other one ended up in moderation)

Thanks Ilya,

That function is pretty straightforward. We don't do a whole lot of
pointer chasing in there, so I'm a little unclear on where this would
have crashed. Right offhand, that kernel is probably missing
1b52931ca9b5b87 (ceph: remove duplicated filelock ref increase), but
that seems unlikely to result in an oops.

Hector, if you have the debuginfo for this kernel installed on one of
these machines, could you run gdb against the ceph.ko module and then
do:

 gdb> list *(ceph_lock_message+0x212)

That may give me a better hint as to what went wrong.

Thanks,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph nfs ganesha exports

2019-08-01 Thread Jeff Layton
On Sun, 2019-07-28 at 18:20 +, Lee Norvall wrote:
> Update to this I found that you cannot create a 2nd files system as yet and 
> it is still experimental.  So I went down this route:
> 
> Added a pool to the existing cephfs and then setfattr -n ceph.dir.layout.pool 
> -v SSD-NFS /mnt/cephfs/ssdnfs/ from a ceph-fuse client.
> 
> I then nfs mounted from another box. I can see the files and dir etc from the 
> nfs client but my issue now is that I do not have permission to write, create 
> dir etc.  The same goes for the default setup after running the ansible 
> playbook even when setting export to no_root_squash.  I am missing a chain of 
> permission?  ganesha-nfs is using admin userid, is this the same as the 
> client.admin or is this a user I need to create?  Any info appreciated.
> 
> Ceph is on CentOS 7 and SELinux is currently off as well.
> 
> Copy of the ganesha conf below.  Is secType correct or is it missing 
> something?
> 
> RADOS_URLS {
>ceph_conf = '/etc/ceph/ceph.conf';
>userid = "admin";
> }
> %url rados://cephfs_data/ganesha-export-index
> 
> NFSv4 {
> RecoveryBackend = 'rados_kv';
> }

I your earlier email, you mentioned that you had more than one NFS
server, but rados_kv is not safe in a multi-server configuration. The
servers will be competing to store recovery information in the same
objects, and won't honor each others' grace periods/

You may want to explore using "RecoveryBackend = rados_cluster" instead,
which should handle that situation better. See this writeup, for some
guidelines:


https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/

Much of this is already automated too if you use k8s+rook.

> RADOS_KV {
> ceph_conf = '/etc/ceph/ceph.conf';
> userid = "admin";
> pool = "cephfs_data";
> }
> 
> EXPORT
> {
> Export_id=20133;
> Path = "/";
> Pseudo = /cephfile;
> Access_Type = RW;
> Protocols = 3,4;
> Transports = TCP;
> SecType = sys,krb5,krb5i,krb5p;
> Squash = Root_Squash;
> Attr_Expiration_Time = 0;
> 
> FSAL {
> Name = CEPH;
> User_Id = "admin";
> }
> 
> 
> }
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 28/07/2019 12:11, Lee Norvall wrote:
> > Hi
> > 
> > I am using ceph-ansible to deploy and just looking for best way/tips on 
> > how to export multiple pools/fs.
> > 
> > Ceph: nautilus (14.2.2)
> > NFS-Ganesha v 2.8
> > ceph-ansible stable 4.0
> > 
> > I have 3 x osd/NFS gateways running and NFS on the dashboard can see 
> > them in the cluster.  I have managed to export for cephfs / and mounted 
> > it on another box.
> > 
> > 1) can I add a new pool/fs to the export under that same NFS gateway 
> > cluster, or
> > 
> > 2) do I have the to do something like add a new pool to the fs and then 
> > setfattr to make the layout /newfs_dir point to /new_pool?  does this 
> > cause issues and false object count?
> > 
> > 3) any other better ways...
> > 
> > Rgds
> > 
> > Lee
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> -- 
>  
> 
> Lee Norvall | CEO / Founder 
> Mob. +44 (0)7768 201884 
> Tel. +44 (0)20 3026 8930 
> Web. www.blocz.io 
> 
> Enterprise Cloud | Private Cloud | Hybrid/Multi Cloud | Cloud Backup 
> 
> 
> 
> This e-mail (and any attachment) has been sent from a PC belonging to My Mind 
> (Holdings) Limited. If you receive it in error, please tell us by return and 
> then delete it from your system; you may not rely on its contents nor 
> copy/disclose it to anyone. Opinions, conclusions and statements of intent in 
> this e-mail are those of the sender and will not bind My Mind (Holdings) 
> Limited unless confirmed by an authorised representative independently of 
> this message. We do not accept responsibility for viruses; you must scan for 
> these. Please note that e-mails sent to and from blocz IO Limited are 
> routinely monitored for record keeping, quality control and training 
> purposes, to ensure regulatory compliance and to prevent viruses and 
> unauthorised use of our computer systems. My Mind (Holdings) Limited is 
> registered in England & Wales under company number 10186410. Registered 
> office: 1st Floor Offices, 2a Highfield Road, Ringwood, Hampshire, United 
> Kingdom, BH24 1RQ. VAT Registration GB 244 
 9628 77
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread Jeff Layton
Ahh, I just noticed you were running nautilus on the client side. This
patch went into v14.2.2, so once you update to that you should be good
to go.

-- Jeff

On Wed, 2019-07-17 at 17:10 -0400, Jeff Layton wrote:
> This is almost certainly the same bug that is fixed here:
> 
> https://github.com/ceph/ceph/pull/28324
> 
> It should get backported soon-ish but I'm not sure which luminous
> release it'll show up in.
> 
> Cheers,
> Jeff
> 
> On Wed, 2019-07-17 at 10:36 +0100, David C wrote:
> > Thanks for taking a look at this, Daniel. Below is the only interesting bit 
> > from the Ceph MDS log at the time of the crash but I suspect the slow 
> > requests are a result of the Ganesha crash rather than the cause of it. 
> > Copying the Ceph list in case anyone has any ideas.
> > 
> > 2019-07-15 15:06:54.624007 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > 6 slow requests, 5 included below; oldest blocked for > 34.588509 secs
> > 2019-07-15 15:06:54.624017 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 33.113514 seconds old, received at 2019-07-15 15:06:21.510423: 
> > client_request(client.16140784:5571174 setattr mtime=2019-07-15 
> > 14:59:45.642408 #0x10009079cfb 2019-07
> > -15 14:59:45.642408 caller_uid=1161, caller_gid=1131{}) currently failed to 
> > xlock, waiting
> > 2019-07-15 15:06:54.624020 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 34.588509 seconds old, received at 2019-07-15 15:06:20.035428: 
> > client_request(client.16129440:1067288 create 
> > #0x1000907442e/filePathEditorRegistryPrefs.melDXAtss 201
> > 9-07-15 14:59:53.694087 caller_uid=1161, 
> > caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
> > 22,3520,3523,}) currently failed to wrlock, waiting
> > 2019-07-15 15:06:54.624025 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 34.583918 seconds old, received at 2019-07-15 15:06:20.040019: 
> > client_request(client.16140784:5570551 getattr pAsLsXsFs #0x1000907443b 
> > 2019-07-15 14:59:44.171408 cal
> > ler_uid=1161, caller_gid=1131{}) currently failed to rdlock, waiting
> > 2019-07-15 15:06:54.624028 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 34.580632 seconds old, received at 2019-07-15 15:06:20.043305: 
> > client_request(client.16129440:1067293 unlink 
> > #0x1000907442e/filePathEditorRegistryPrefs.melcdzxxc 201
> > 9-07-15 14:59:53.701964 caller_uid=1161, 
> > caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,35
> > 22,3520,3523,}) currently failed to wrlock, waiting
> > 2019-07-15 15:06:54.624032 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 34.538332 seconds old, received at 2019-07-15 15:06:20.085605: 
> > client_request(client.16129440:1067308 create 
> > #0x1000907442e/filePathEditorRegistryPrefs.melHHljMk 201
> > 9-07-15 14:59:53.744266 caller_uid=1161, 
> > caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
> >  currently failed to wrlock, waiting
> > 2019-07-15 15:06:55.014073 7f5fdcdc0700  1 mds.mds01 Updating MDS map to 
> > version 68166 from mon.2
> > 2019-07-15 15:06:59.624041 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > 7 slow requests, 2 included below; oldest blocked for > 39.588571 secs
> > 2019-07-15 15:06:59.624048 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 30.495843 seconds old, received at 2019-07-15 15:06:29.128156: 
> > client_request(client.16129440:1072227 create 
> > #0x1000907442e/filePathEditorRegistryPrefs.mel58AQSv 2019-07-15 
> > 15:00:02.786754 caller_uid=1161, 
> > caller_gid=1131{1131,4121,2330,2683,4115,2322,2779,2979,1503,3511,2783,2707,2942,2980,2258,2829,1238,1237,2793,1235,1249,2097,1154,2982,2983,3860,4101,1208,3638,3641,3644,3640,3643,3639,3642,3822,3945,4045,3521,3522,3520,3523,})
> >  currently failed to wrlock, waiting
> > 2019-07-15 15:06:59.624053 7f5fda5bb700  0 log_channel(cluster) log [WRN] : 
> > slow request 39.432848 seconds old, received at 2019-07-15 15:06:20.191151: 
> > client_request(client.16140784:5570649 mknod 
> > #0x1000907442e/filePathEditorRegistryPrefs.mel3HZLNE 2019-07-15 
> > 14:59:44.322408 caller_uid=1161, 

Re: [ceph-users] [Nfs-ganesha-devel] 2.7.3 with CEPH_FSAL Crashing

2019-07-17 Thread Jeff Layton
.c:1011
> > > #8  0x0051d278 in mdcache_create_handle (exp_hdl=0x1bafbf0, 
> > > fh_desc=, handle=0x7f0470fd4900, attrs_out=0x0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/FSAL/Stackable_FSALs/FSAL_MDCACHE/mdcache_handle.c:1578
> > > #9  0x0046d404 in nfs4_mds_putfh 
> > > (data=data@entry=0x7f0470fd4ea0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:211
> > > #10 0x0046d8e8 in nfs4_op_putfh (op=0x7f03effaf1d0, 
> > > data=0x7f0470fd4ea0, resp=0x7f03ec1de1f0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_op_putfh.c:281
> > > #11 0x0045d120 in nfs4_Compound (arg=, 
> > > req=, res=0x7f03ec1de9d0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/Protocols/NFS/nfs4_Compound.c:942
> > > #12 0x004512cd in nfs_rpc_process_request 
> > > (reqdata=0x7f03ee5ed4b0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_worker_thread.c:1328
> > > #13 0x00450766 in nfs_rpc_decode_request (xprt=0x7f02180c2320, 
> > > xdrs=0x7f03ec568ab0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/MainNFSD/nfs_rpc_dispatcher_thread.c:1345
> > > #14 0x7f04df45d07d in svc_rqst_xprt_task (wpe=0x7f02180c2538) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:769
> > > #15 0x7f04df45d59a in svc_rqst_epoll_events (n_events= > > out>, sr_rec=0x4bb53e0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:941
> > > #16 svc_rqst_epoll_loop (sr_rec=) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1014
> > > #17 svc_rqst_run_task (wpe=0x4bb53e0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/svc_rqst.c:1050
> > > #18 0x7f04df465123 in work_pool_thread (arg=0x7f044c0008c0) at 
> > > /usr/src/debug/nfs-ganesha-2.7.3/libntirpc/src/work_pool.c:181
> > > #19 0x7f04dda05dd5 in start_thread () from /lib64/libpthread.so.0
> > > #20 0x7f04dcb7dead in clone () from /lib64/libc.so.6
> > > 
> > > Package versions:
> > > 
> > > nfs-ganesha-2.7.3-0.1.el7.x86_64
> > > nfs-ganesha-ceph-2.7.3-0.1.el7.x86_64
> > > libcephfs2-14.2.1-0.el7.x86_64
> > > librados2-14.2.1-0.el7.x86_64
> > > 
> > > I notice in my Ceph log I have a bunch of slow requests around the time 
> > > it went down, I'm not sure if it's a symptom of Ganesha segfaulting or 
> > > if it was a contributing factor.
> > > 
> > > Thanks,
> > > David
> > > 
> > > 
> > > ___
> > > Nfs-ganesha-devel mailing list
> > > nfs-ganesha-de...@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel
> > > 
> > 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS : Kernel/Fuse technical differences

2019-06-24 Thread Jeff Layton
On Mon, 2019-06-24 at 15:51 +0200, Hervé Ballans wrote:
> Hi everyone,
> 
> We successfully use Ceph here for several years now, and since recently, 
> CephFS.
> 
>  From the same CephFS server, I notice a big difference between a fuse 
> mount and a kernel mount (10 times faster for kernel mount). It makes 
> sense to me (an additional fuse library versus a direct access to a 
> device...), but recently, one of our users asked me to explain him in 
> more detail the reason for this big difference...Hum...
> 
> I then realized that I didn't really know how to explain the reasons to 
> him !!
> 
> As well, does anyone have a more detailed explanation in a few words or 
> know a good web resource on this subject (I guess it's not specific to 
> Ceph but it's generic to all filesystems ?..)
> 
> Thanks in advance,
> Hervé
> 

A lot of it is the context switching.

Every time you make a system call (or other activity) that accesses a
FUSE mount, it has to dispatch that request to the fuse device, the
userland ceph-fuse daemon then has to wake up and do its thing (at least
once) and then send the result back down to the kernel which then wakes
up the original task so it can get the result.

FUSE is a wonderful thing, but it's not really built for speed.

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Nfs-ganesha with rados_kv backend

2019-05-29 Thread Jeff Layton
On Wed, 2019-05-29 at 13:49 +, Stolte, Felix wrote:
> Hi,
> 
> is anyone running an active-passive nfs-ganesha cluster with cephfs backend 
> and using the rados_kv recovery backend? My setup runs fine, but takeover is 
> giving me a headache. On takeover I see the following messages in ganeshas 
> log file:
> 

Note that there are significant problems with the rados_kv recovery
backend. In particular, it does not properly handle the case where the
server crashes during the grace period. The rados_ng and rados_cluster
backends do handle those situations properly.

> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server 
> Now IN GRACE, duration 5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] nfs_start_grace :STATE :EVENT :NFS Server 
> recovery event 5 nodeid -1 ip 10.0.0.5
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] rados_kv_traverse :CLIENT ID :EVENT :Failed 
> to lst kv ret=-2
> 29/05/2019 15:38:21 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[dbus_heartbeat] rados_kv_read_recov_clids_takeover :CLIENT 
> ID :EVENT :Failed to takeover
> 29/05/2019 15:38:26 : epoch 5cee88c4 : cephgw-e2-1 : 
> ganesha.nfsd-9793[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now 
> NOT IN GRACE
> 
> The result is clients hanging for up to 2 Minutes. Has anyone ran into the 
> same problem?
> 
> Ceph Version: 12.2.11
> nfs-ganesha: 2.7.3
> 

If I had to guess, the hanging is probably due to state that is being
held by the other node's MDS session that hasn't expired yet. Ceph v12
doesn't have the client reclaim interfaces that make more instantaneous
failover possible. That's new in v14 (Nautilus). See pages 12 and 13
here:

https://static.sched.com/hosted_files/cephalocon2019/86/Rook-Deployed%20NFS%20Clusters%20over%20CephFS.pdf

> ganesha.conf (identical on both nodes besides nodeid in rados_kv:
> 
> NFS_CORE_PARAM {
> Enable_RQUOTA = false;
> Protocols = 3,4;
> }
> 
> CACHEINODE {
> Dir_Chunk = 0;
> NParts = 1;
> Cache_Size = 1;
> }
> 
> NFS_krb5 {
> Active_krb5 = false;
> }
> 
> NFSv4 {
> Only_Numeric_Owners = true;
> RecoveryBackend = rados_kv;
> Grace_Period = 5;
> Lease_Lifetime = 5;

Yikes! That's _way_ too short a grace period and lease lifetime. Ganesha
will probably exit the grace period before the clients ever realize the
server has restarted, and they will fail to reclaim their state.

> Minor_Versions = 1,2;
> }
> 
> RADOS_KV {
> ceph_conf = '/etc/ceph/ceph.conf';
> userid = "ganesha";
> pool = "cephfs_metadata";
> namespace = "ganesha";
> nodeid = "cephgw-k2-1";
> }
> 
> Any hint would be appreciated.

I consider ganesha's dbus-based takeover mechanism to be broken by
design, as it requires the recovery backend to do things that can't be
done atomically. If a crash occurs at the wrong time, the recovery
database can end up trashed and no one can reclaim anything.

If you really want an active/passive setup then I'd move away from that
and just have whatever clustering software you're using start up the
daemon on the active node after ensuring that it's shut down on the
passive one. With that, you can also use the rados_ng recovery backend,
which is more resilient in the face of multiple crashes.

In that configuration you would want to have the same config file on
both nodes, including the same nodeid so that you can potentially take
advantage of the RECLAIM_RESET interface to kill off the old session
quickly after the server restarts.

You also need a much longer grace period.

Cheers,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS-Ganesha CEPH_FSAL | potential locking issue

2019-04-16 Thread Jeff Layton
On Tue, Apr 16, 2019 at 10:36 AM David C  wrote:
>
> Hi All
>
> I have a single export of my cephfs using the ceph_fsal [1]. A CentOS 7 
> machine mounts a sub-directory of the export [2] and is using it for the home 
> directory of a user (e.g everything under ~ is on the server).
>
> This works fine until I start a long sequential write into the home directory 
> such as:
>
> dd if=/dev/zero of=~/deleteme bs=1M count=8096
>
> This saturates the 1GbE link on the client which is great but during the 
> transfer, apps that are accessing files in home start to lock up. Google 
> Chrome for example, which puts it's config in ~/.config/google-chrome/,  
> locks up during the transfer, e.g I can't move between tabs, as soon as the 
> transfer finishes, Chrome goes back to normal. Essentially the desktop 
> environment reacts as I'd expect if the server was to go away. I'm using the 
> MATE DE.
>
> However, if I mount a separate directory from the same export on the machine 
> [3] and do the same write into that directory, my desktop experience isn't 
> affected.
>
> I hope that makes some sense, it's a bit of a weird one to describe. This 
> feels like a locking issue to me, although I can't explain why a single write 
> into the root of a mount would affect access to other files under that same 
> mount.
>

It's not a single write. You're doing 8G worth of 1M I/Os. The server
then has to do all of those to the OSD backing store.

> [1] CephFS export:
>
> EXPORT
> {
> Export_ID=100;
> Protocols = 4;
> Transports = TCP;
> Path = /;
> Pseudo = /ceph/;
> Access_Type = RW;
> Attr_Expiration_Time = 0;
> Disable_ACL = FALSE;
> Manage_Gids = TRUE;
> Filesystem_Id = 100.1;
> FSAL {
> Name = CEPH;
> }
> }
>
> [2] Home directory mount:
>
> 10.10.10.226:/ceph/homes/username on /homes/username type nfs4 
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
>
> [3] Test directory mount:
>
> 10.10.10.226:/ceph/testing on /tmp/testing type nfs4 
> (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.10.135,local_lock=none,addr=10.10.10.226)
>
> Versions:
>
> Luminous 12.2.10
> nfs-ganesha-2.7.1-0.1.el7.x86_64
> nfs-ganesha-ceph-2.7.1-0.1.el7.x86_64
>
> Ceph.conf on nfs-ganesha server:
>
> [client]
> mon host = 10.10.10.210:6789, 10.10.10.211:6789, 10.10.10.212:6789
> client_oc_size = 8388608000
> client_acl_type=posix_acl
> client_quota = true
> client_quota_df = true
>

No magic bullets here, I'm afraid.

Sounds like ganesha is probably just too swamped with write requests
to do much else, but you'll probably want to do the legwork starting
with the hanging application, and figure out what it's doing that
takes so long. Is it some syscall? Which one?

>From there you can start looking at statistics in the NFS client to
see what's going on there. Are certain RPCs taking longer than they
should? Which ones?

Once you know what's going on with the client, you can better tell
what's going on with the server.
-- 
Jeff Layton 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploying a Ceph+NFS Server Cluster with Rook

2019-03-06 Thread Jeff Layton
I had several people ask me to put together some instructions on how to
deploy a Ceph+NFS cluster from scratch, and the new functionality in
Ceph and rook.io make this quite easy.

I wrote a Ceph community blog post that walks the reader through the
process:

https://ceph.com/community/deploying-a-cephnfs-server-cluster-with-rook/

I don't think that site has a way to post comments, but I'm happy to
answer questions about it via email.
-- 
Jeff Layton 


signature.asc
Description: This is a digitally signed message part
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Jeff Layton
On Mon, 2019-02-18 at 17:02 +0100, Paul Emmerich wrote:
> > > I've benchmarked a ~15% performance difference in IOPS between cache
> > > expiration time of 0 and 10 when running fio on a single file from a
> > > single client.
> > > 
> > > 
> > 
> > NFS iops? I'd guess more READ ops in particular? Is that with a
> > FSAL_CEPH backend?
> 
> Yes. But that take that with a grain of salt, that was just a quick
> and dirty test of a very specific scenario that may or may not be
> relevant.
> 
> 

Sure.

If the NFS iops go up when you remove a layer of caching, then that
suggests that you had a situation where the cache likely should have
been invalidated, but wasn't. Basically, you may be sacrificing cache
coherency for performance.

The bigger question I have is whether the ganesha mdcache provides any
performance gain when the attributes are already cached in the libcephfs
layer.

If we did want to start using the mdcache, then we'd almost certainly
want to invalidate that cache when libcephfs gives up caps. I just don't
see how the extra layer of caching provides much value in that
situation.


> > 
> > > > > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  
> > > > > wrote:
> > > > > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > > > > Will Client query 'change' attribute every time before reading to 
> > > > > > > know
> > > > > > > if the data has been changed?
> > > > > > > 
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > >   | Name| ID | Data Type  | Acc | Defined in  
> > > > > > >   |
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1 
> > > > > > >   |
> > > > > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2 
> > > > > > >   |
> > > > > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3 
> > > > > > >   |
> > > > > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4 
> > > > > > >   |
> > > > > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5 
> > > > > > >   |
> > > > > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6 
> > > > > > >   |
> > > > > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7 
> > > > > > >   |
> > > > > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8 
> > > > > > >   |
> > > > > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9 
> > > > > > >   |
> > > > > > >   | unique_handles  | 9  | bool   | R   | Section 
> > > > > > > 5.8.1.10  |
> > > > > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 
> > > > > > > 5.8.1.11  |
> > > > > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 
> > > > > > > 5.8.1.12  |
> > > > > > >   | filehandle  | 19 | nfs_fh4| R   | Section 
> > > > > > > 5.8.1.13  |
> > > > > > >   
> > > > > > > +-+++-+---+
> > > > > > > 
> > > > > > 
> > > > > > Not every time -- only when the cache needs revalidation.
> > > > > > 
> > > > > > In the absence of a delegation, that happens on a timeout (see the
> > > > > > acregmin/acregmax settings in nfs(5)), though things like opens and 
> > > > > > file
> > > > > > locking events also affect when the client revalidates.
> > > > > > 
> > > > > > When the v4 client does revalidate the cache, it relies heavily on 
> > > > > > NFSv4
> > > > > > change attribute. Cephfs's change attribute is cluster-coherent 
> > > > > > too, so
> > > > > > if the client does revalidate it 

Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-18 Thread Jeff Layton
On Mon, 2019-02-18 at 16:40 +0100, Paul Emmerich wrote:
> > A call into libcephfs from ganesha to retrieve cached attributes is
> > mostly just in-memory copies within the same process, so any performance
> > overhead there is pretty minimal. If we need to go to the network to get
> > the attributes, then that was a case where the cache should have been
> > invalidated anyway, and we avoid having to check the validity of the
> > cache.
> 
> I've benchmarked a ~15% performance difference in IOPS between cache
> expiration time of 0 and 10 when running fio on a single file from a
> single client.
> 
> 

NFS iops? I'd guess more READ ops in particular? Is that with a
FSAL_CEPH backend?


> 
> > 
> > > On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  
> > > wrote:
> > > > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > > > Will Client query 'change' attribute every time before reading to know
> > > > > if the data has been changed?
> > > > > 
> > > > >   +-+++-+---+
> > > > >   | Name| ID | Data Type  | Acc | Defined in|
> > > > >   +-+++-+---+
> > > > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
> > > > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
> > > > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
> > > > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
> > > > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
> > > > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
> > > > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
> > > > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
> > > > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
> > > > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
> > > > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
> > > > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
> > > > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
> > > > >   +-+++-+---+
> > > > > 
> > > > 
> > > > Not every time -- only when the cache needs revalidation.
> > > > 
> > > > In the absence of a delegation, that happens on a timeout (see the
> > > > acregmin/acregmax settings in nfs(5)), though things like opens and file
> > > > locking events also affect when the client revalidates.
> > > > 
> > > > When the v4 client does revalidate the cache, it relies heavily on NFSv4
> > > > change attribute. Cephfs's change attribute is cluster-coherent too, so
> > > > if the client does revalidate it should see changes made on other
> > > > servers.
> > > > 
> > > > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  
> > > > > wrote:
> > > > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > > > Hi Jeff,
> > > > > > > Another question is about Client Caching when disabling 
> > > > > > > delegation.
> > > > > > > I set breakpoint on nfs4_op_read, which is OP_READ process 
> > > > > > > function in
> > > > > > > nfs-ganesha. Then I read a file, I found that it will hit only 
> > > > > > > once on
> > > > > > > the first time, which means latter reading operation on this file 
> > > > > > > will
> > > > > > > not trigger OP_READ. It will read the data from client side 
> > > > > > > cache. Is
> > > > > > > it right?
> > > > > > 
> > > > > > Yes. In the absence of a delegation, the client will periodically 
> > > > > > query
> > > > > > for the inode attributes, and will serve reads from the cache if it
> > > > > > looks like the file hasn't changed.
> > > > > > 
> > > > > > > I also checked the nfs client code in linux kernel. Only
> &

Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-15 Thread Jeff Layton
On Fri, 2019-02-15 at 15:34 +0800, Marvin Zhang wrote:
> Thanks Jeff.
> If I set Attr_Expiration_Time as zero in conf , deos it mean timeout
> is zero? If so, every client will see the change immediately. Will it
> decrease the performance hardly?
> I seems that GlusterFS FSAL use  UPCALL to invalidate the cache. How
> about the CephFS FSAL?
> 

We mostly suggest ganesha's attribute cache be disabled when exporting
FSAL_CEPH. libcephfs caches attributes too, and it knows the status of
those attributes better than ganesha can.

A call into libcephfs from ganesha to retrieve cached attributes is
mostly just in-memory copies within the same process, so any performance
overhead there is pretty minimal. If we need to go to the network to get
the attributes, then that was a case where the cache should have been
invalidated anyway, and we avoid having to check the validity of the
cache.


> On Thu, Feb 14, 2019 at 9:04 PM Jeff Layton  wrote:
> > On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> > > Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> > > Will Client query 'change' attribute every time before reading to know
> > > if the data has been changed?
> > > 
> > >   +-+++-+---+
> > >   | Name| ID | Data Type  | Acc | Defined in|
> > >   +-+++-+---+
> > >   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
> > >   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
> > >   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
> > >   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
> > >   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
> > >   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
> > >   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
> > >   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
> > >   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
> > >   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
> > >   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
> > >   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
> > >   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
> > >   +-+++-+---+
> > > 
> > 
> > Not every time -- only when the cache needs revalidation.
> > 
> > In the absence of a delegation, that happens on a timeout (see the
> > acregmin/acregmax settings in nfs(5)), though things like opens and file
> > locking events also affect when the client revalidates.
> > 
> > When the v4 client does revalidate the cache, it relies heavily on NFSv4
> > change attribute. Cephfs's change attribute is cluster-coherent too, so
> > if the client does revalidate it should see changes made on other
> > servers.
> > 
> > > On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  
> > > wrote:
> > > > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > > > Hi Jeff,
> > > > > Another question is about Client Caching when disabling delegation.
> > > > > I set breakpoint on nfs4_op_read, which is OP_READ process function in
> > > > > nfs-ganesha. Then I read a file, I found that it will hit only once on
> > > > > the first time, which means latter reading operation on this file will
> > > > > not trigger OP_READ. It will read the data from client side cache. Is
> > > > > it right?
> > > > 
> > > > Yes. In the absence of a delegation, the client will periodically query
> > > > for the inode attributes, and will serve reads from the cache if it
> > > > looks like the file hasn't changed.
> > > > 
> > > > > I also checked the nfs client code in linux kernel. Only
> > > > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again,
> > > > > like this:
> > > > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > > > ret = nfs_invalidate_mapping(inode, mapping);
> > > > > }
> > > > > This about this senario, client1 connect ganesha1 and client2 connect
> > > > > ganesha2. I read /1.txt on client1 and client1 will cache the data.
> > > > > Then I modify this file on client2. At that time, how client1 know th

Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-14 Thread Jeff Layton
On Thu, 2019-02-14 at 20:57 +0800, Marvin Zhang wrote:
> Here is the copy from https://tools.ietf.org/html/rfc7530#page-40
> Will Client query 'change' attribute every time before reading to know
> if the data has been changed?
> 
>   +-+++-+---+
>   | Name| ID | Data Type  | Acc | Defined in|
>   +-+++-+---+
>   | supported_attrs | 0  | bitmap4| R   | Section 5.8.1.1   |
>   | type| 1  | nfs_ftype4 | R   | Section 5.8.1.2   |
>   | fh_expire_type  | 2  | uint32_t   | R   | Section 5.8.1.3   |
>   | change  | 3  | changeid4  | R   | Section 5.8.1.4   |
>   | size| 4  | uint64_t   | R W | Section 5.8.1.5   |
>   | link_support| 5  | bool   | R   | Section 5.8.1.6   |
>   | symlink_support | 6  | bool   | R   | Section 5.8.1.7   |
>   | named_attr  | 7  | bool   | R   | Section 5.8.1.8   |
>   | fsid| 8  | fsid4  | R   | Section 5.8.1.9   |
>   | unique_handles  | 9  | bool   | R   | Section 5.8.1.10  |
>   | lease_time  | 10 | nfs_lease4 | R   | Section 5.8.1.11  |
>   | rdattr_error| 11 | nfsstat4   | R   | Section 5.8.1.12  |
>   | filehandle  | 19 | nfs_fh4| R   | Section 5.8.1.13  |
>   +-+++-+---+
> 

Not every time -- only when the cache needs revalidation.

In the absence of a delegation, that happens on a timeout (see the
acregmin/acregmax settings in nfs(5)), though things like opens and file
locking events also affect when the client revalidates.

When the v4 client does revalidate the cache, it relies heavily on NFSv4
change attribute. Cephfs's change attribute is cluster-coherent too, so
if the client does revalidate it should see changes made on other
servers.

> On Thu, Feb 14, 2019 at 8:29 PM Jeff Layton  wrote:
> > On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> > > Hi Jeff,
> > > Another question is about Client Caching when disabling delegation.
> > > I set breakpoint on nfs4_op_read, which is OP_READ process function in
> > > nfs-ganesha. Then I read a file, I found that it will hit only once on
> > > the first time, which means latter reading operation on this file will
> > > not trigger OP_READ. It will read the data from client side cache. Is
> > > it right?
> > 
> > Yes. In the absence of a delegation, the client will periodically query
> > for the inode attributes, and will serve reads from the cache if it
> > looks like the file hasn't changed.
> > 
> > > I also checked the nfs client code in linux kernel. Only
> > > cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again,
> > > like this:
> > > if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> > > ret = nfs_invalidate_mapping(inode, mapping);
> > > }
> > > This about this senario, client1 connect ganesha1 and client2 connect
> > > ganesha2. I read /1.txt on client1 and client1 will cache the data.
> > > Then I modify this file on client2. At that time, how client1 know the
> > > file is modifed and how it will add NFS_INO_INVALID_DATA into
> > > cache_validity?
> > 
> > Once you modify the code on client2, ganesha2 will request the necessary
> > caps from the ceph MDS, and client1 will have its caps revoked. It'll
> > then make the change.
> > 
> > When client1 reads again it will issue a GETATTR against the file [1].
> > ganesha1 will then request caps to do the getattr, which will end up
> > revoking ganesha2's caps. client1 will then see the change in attributes
> > (the change attribute and mtime, most likely) and will invalidate the
> > mapping, causing it do reissue a READ on the wire.
> > 
> > [1]: There may be a window of time after you change the file on client2
> > where client1 doesn't see it. That's due to the fact that inode
> > attributes on the client are only revalidated after a timeout. You may
> > want to read over the DATA AND METADATA COHERENCE section of nfs(5) to
> > make sure you understand how the NFS client validates its caches.
> > 
> > Cheers,
> > --
> > Jeff Layton 
> > 

-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-14 Thread Jeff Layton
On Thu, 2019-02-14 at 19:49 +0800, Marvin Zhang wrote:
> Hi Jeff,
> Another question is about Client Caching when disabling delegation.
> I set breakpoint on nfs4_op_read, which is OP_READ process function in
> nfs-ganesha. Then I read a file, I found that it will hit only once on
> the first time, which means latter reading operation on this file will
> not trigger OP_READ. It will read the data from client side cache. Is
> it right?

Yes. In the absence of a delegation, the client will periodically query
for the inode attributes, and will serve reads from the cache if it
looks like the file hasn't changed.

> I also checked the nfs client code in linux kernel. Only
> cache_validity is NFS_INO_INVALID_DATA, it will send OP_READ again,
> like this:
> if (nfsi->cache_validity & NFS_INO_INVALID_DATA) {
> ret = nfs_invalidate_mapping(inode, mapping);
> }
> This about this senario, client1 connect ganesha1 and client2 connect
> ganesha2. I read /1.txt on client1 and client1 will cache the data.
> Then I modify this file on client2. At that time, how client1 know the
> file is modifed and how it will add NFS_INO_INVALID_DATA into
> cache_validity?


Once you modify the code on client2, ganesha2 will request the necessary
caps from the ceph MDS, and client1 will have its caps revoked. It'll
then make the change.

When client1 reads again it will issue a GETATTR against the file [1].
ganesha1 will then request caps to do the getattr, which will end up
revoking ganesha2's caps. client1 will then see the change in attributes
(the change attribute and mtime, most likely) and will invalidate the
mapping, causing it do reissue a READ on the wire.

[1]: There may be a window of time after you change the file on client2
where client1 doesn't see it. That's due to the fact that inode
attributes on the client are only revalidated after a timeout. You may
want to read over the DATA AND METADATA COHERENCE section of nfs(5) to
make sure you understand how the NFS client validates its caches.

Cheers,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-14 Thread Jeff Layton
On Thu, 2019-02-14 at 10:35 +0800, Marvin Zhang wrote:
> On Thu, Feb 14, 2019 at 8:09 AM Jeff Layton  wrote:
> > > Hi,
> > > As http://docs.ceph.com/docs/master/cephfs/nfs/ says, it's OK to
> > > config active/passive NFS-Ganesha to use CephFs. My question is if we
> > > can use active/active nfs-ganesha for CephFS.
> > 
> > (Apologies if you get two copies of this. I sent an earlier one from the
> > wrong account and it got stuck in moderation)
> > 
> > You can, with the new rados-cluster recovery backend that went into
> > ganesha v2.7. See here for a bit more detail:
> > 
> > https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/
> > 
> > ...also have a look at the ceph.conf file in the ganesha sources.
> > 
> > > In my thought, only state consistance should we think about.
> > > 1. Lock support for Active/Active. Even each nfs-ganesha sever mantain
> > > the lock state, the real lock/unlock will call
> > > ceph_ll_getlk/ceph_ll_setlk. So Ceph cluster will handle the lock
> > > safely.
> > > 2. Delegation support Active/Active. It's similar question 1,
> > > ceph_ll_delegation will handle it safely.
> > > 3. Nfs-ganesha cache support Active/Active. As
> > > https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/ceph.conf
> > > describes, we can config cache size as size 1.
> > > 4. Ceph-FSAL cache support Active/Active. Like other CephFs client,
> > > there is no issues for cache consistance.
> > 
> > The basic idea with the new recovery backend is to have the different
> > NFS ganesha heads coordinate their recovery grace periods to prevent
> > stateful conflicts.
> > 
> > The one thing missing at this point is delegations in an active/active
> > configuration, but that's mainly because of the synchronous nature of
> > libcephfs. We have a potential fix for that problem but it requires work
> > in libcephfs that is not yet done.
> [marvin] So we should disable delegation on active/active and set the
> conf like this. Is it right?
> NFSv4
> {
> Delegations = false;
> }

Yes.
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: NAS solution for CephFS

2019-02-13 Thread Jeff Layton
> Hi,
> As http://docs.ceph.com/docs/master/cephfs/nfs/ says, it's OK to
> config active/passive NFS-Ganesha to use CephFs. My question is if we
> can use active/active nfs-ganesha for CephFS.

(Apologies if you get two copies of this. I sent an earlier one from the
wrong account and it got stuck in moderation)

You can, with the new rados-cluster recovery backend that went into
ganesha v2.7. See here for a bit more detail:

https://jtlayton.wordpress.com/2018/12/10/deploying-an-active-active-nfs-cluster-over-cephfs/

...also have a look at the ceph.conf file in the ganesha sources.

> In my thought, only state consistance should we think about.
> 1. Lock support for Active/Active. Even each nfs-ganesha sever mantain
> the lock state, the real lock/unlock will call
> ceph_ll_getlk/ceph_ll_setlk. So Ceph cluster will handle the lock
> safely.
> 2. Delegation support Active/Active. It's similar question 1,
> ceph_ll_delegation will handle it safely.
> 3. Nfs-ganesha cache support Active/Active. As
> https://github.com/nfs-ganesha/nfs-ganesha/blob/next/src/config_samples/ceph.conf
> describes, we can config cache size as size 1.
> 4. Ceph-FSAL cache support Active/Active. Like other CephFs client,
> there is no issues for cache consistance.

The basic idea with the new recovery backend is to have the different
NFS ganesha heads coordinate their recovery grace periods to prevent
stateful conflicts.

The one thing missing at this point is delegations in an active/active
configuration, but that's mainly because of the synchronous nature of
libcephfs. We have a potential fix for that problem but it requires work
in libcephfs that is not yet done.

Cheers,
-- 
Jeff Layton 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com