Have you compared performance to mounting cephfs using ceph-fuse instead of the kernel client? ceph-fuse is a package that will match your current version of ceph as opposed to the kernel client where you need to update your kernel to match the current version/features of ceph. I switched to ceph-fuse for my cluster (drastically smaller and less utilized than yours) and it has been working smoother than when I was using the kernel client. A very interesting thing that ceph-fuse does is that an ls -lhd of a directory shows the directory structures size. It's a drastically faster response than a du for the size of a folder.
david@kaylee:/mnt/cephfs$ ls -lh total 2.5K drwxr-xr-x 1 david david 89G Dec 12 2016 fix/ drwxr-xr-x 1 david david 1.2T Dec 5 2016 active/ drwxr-xr-x 1 david david 7.0T Jan 20 18:40 archive/ drwxr-xr-x 1 david david 0 Jun 15 13:24 sort/ david@kaylee:/mnt/cephfs$ ls -lh archive/ total 2.0K drwxr-xr-x 1 david david 6.5T Jun 11 13:47 book/ drwxr-xr-x 1 david david 587G Jun 7 10:51 zoe/ Another thing that strikes me odd is that you seem to be doing one of the no no's of distributed file systems. It looks like you have some devs working on this project based on the multithreaded solution to place files into cephfs. It's always best to query a database for information as opposed to the file system. If I'm using a large distributed filesystem for something at work, I make sure that nothing is being placed into that filesystem without the database knowing everything it needs to about the file. It's location, size, who the file belongs to, if the file has an expiration for when it should be deleted, etc. You can always reach a scale where querying the filesystem for such information could take hours where a query to the database with a proper structure would return in seconds. On the topic of running hourly snapshots of cephfs, are you monitoring how large your snap trim queue is? I've found that deleting snapshots can cause a lot of slowdowns in the cluster and should be scheduled for a time when the cluster will be mostly idle to get through as much of the snapshot deletions as possible. If you're deleting snapshots each hour as well, that might be a place to look for odd cluster happenings as well. On Thu, Jun 15, 2017 at 12:39 PM Eric Eastman <[email protected]> wrote: > We are running Ceph 10.2.7 and after adding a new multi-threaded > writer application we are seeing hangs accessing metadata from ceph > file system kernel mounted clients. I have a "du -ah /cephfs" process > that been stuck for over 12 hours on one cephfs client system. We > started seeing hung "du -ah" processes two days ago, so yesterday we > upgraded the whole cluster from v10.2.5 to v10.2.7, but the problem > occurred again last night. Rebooting the client fixes the problem. > The ceph -s command is showing HEALTH_OK > > We have four ceph file system clients, each kernel mounting our 1 ceph > file system to /cephfs. The "du -ah /cephfs" runs hourly within a test > script that is cron controlled. If the du -ah /cephfs does not > complete within an hour, emails are sent to the admin group as part of > our monitoring process. This command normally takes less then a minute > to run and we have just over 3.6M files in this file system. The du > -ah is hanging while accessing sub-directories where the new > multi-threaded writer application is writing. > > About the application: On one ceph client we are downloading external > data via the network and writing data as files with a python program > into the ceph file system. The python script can write up to 100 files > in parallel. The metadata hangs we are seeing can occur on one or more > client systems, but right now it is only hung on one system, which is > not the node writing the data. > > System info: > > ceph -s > cluster ba0c94fc-1168-11e6-aaea-000c290cc2d4 > health HEALTH_OK > monmap e1: 3 mons at > {mon01= > 10.16.51.21:6789/0,mon02=10.16.51.22:6789/0,mon03=10.16.51.23:6789/0} > election epoch 138, quorum 0,1,2 mon01,mon02,mon03 > fsmap e3210: 1/1/1 up {0=mds02=up:active}, 2 up:standby > osdmap e33046: 85 osds: 85 up, 85 in > flags sortbitwise,require_jewel_osds > pgmap v27679236: 16192 pgs, 12 pools, 7655 GB data, 6591 kobjects > 24345 GB used, 217 TB / 241 TB avail > 16188 active+clean > 3 active+clean+scrubbing > 1 active+clean+scrubbing+deep > client io 0 B/s rd, 15341 kB/s wr, 0 op/s rd, 21 op/s wr > > > On the hung client node, we are seeing an entry in mdsc > cat /sys/kernel/debug/ceph/*/mdsc > 163925513 mds0 readdir #100003be2b1 kplr009658474_dr25_window.fits > > I am not seeing this on the other 3 client nodes. > > On the active metdata server, I ran: > > ceph daemon mds.mds02 dump_ops_in_flight > > every 2 seconds, as it kept changing. Part of the output is at: > https://paste.fedoraproject.org/paste/OizCowo3oGzZo-cJWV5R~Q > > Info about the system > > OS: Ubuntu Trusty > > Cephfs snapshots are turned on and being created hourly > > Ceph Version > ceph -v > ceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) > > Kernel: Ceph Servers: > uname -a > Linux mon01 4.2.0-27-generic #32~14.04.1-Ubuntu SMP Fri Jan 22 > 15:32:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux > > Kernel Cephfs clients: > uname -a > Linux dfgw02 4.9.21-040921-generic #201704080434 SMP Sat Apr 8 > 08:35:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux > > Let me know if I should write up a ticket on this. > > Thanks > > Eric > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
