Re: [ceph-users] Experience with 5k RPM/archive HDDs
Hi, again, as I said, in normal operation everything is fine with SMR. They perform well in particular for large sequential writes because of the on platter cache (20 GB I think). All tests we have done were with good SSDs for OSD cache. Things blow up during backfill / recovery because the SMR disks saturate and then slow down to some 2-3 iops. Cache tiering will not work either because if one or more of the SMR in the backend fail, backfilling / recovery will again slow things down to almost no client io. And yes, we tried all sorts of throttling mechanisms during backfilling / recovery. In all cases we tested the cluster is useless from the client side during backfilling / recovery. - mike On 2/19/17 9:54 AM, Wido den Hollander wrote: Op 18 februari 2017 om 17:03 schreef rick stehno <rs3...@me.com>: I work for Seagate and have done over a hundred of tests using SMR 8TB disks in a cluster. It all depends on what your access is if SMR hdd would be the best choice. Remember SMR hdd don't perform well doing random writes, but are excellent for reads and sequential writes. I have many tests where I added a SSD or PCIe flash card to place the journals on these devices and SMR performed better than a typical CMR disk and overall cheaper than using all CMR hdd. You can also use some type of caching like Ceph Cache Tier or other caching with very good results. By placing the journals on flash or adopt some type of caching you are eliminating the double writes to the SMR hdd and performance should be fine. I have test results if you would like to see them. I am really keen on seeing those numbers. The blogpost ( https://blog.widodh.nl/2017/02/do-not-use-smr-disks-with-ceph/ ) I wrote is based on two occasions where people bought 6TB and 8TB Seagate SMR disks and used them in Ceph. One use-case was with a application which would write natively to RADOS and the other with CephFS. In both occasions the Journals were on SSD, but the backing disk would just be saturated very easily. Ceph still does Random Writes on the disk for updating things like PGLogs and such, writing new OSDMaps, etc. A sequential large write into Ceph might be splitted up by either CephFS or RBD into smaller writes to various RADOS objects. I haven't seen a use-case where SMR disks perform 'OK' at all with Ceph. That's why my advise is still to stay away from those disks for Ceph. In both cases my customers had to spent a lot of money on buying new disks to make it work. The first case was actually somebody who bought 1000 SMR disks and then found out they didn't work with Ceph. Wido Rick Sent from my iPhone, please excuse any typing errors. On Feb 17, 2017, at 8:49 PM, Mike Miller <millermike...@gmail.com> wrote: Hi, don't go there, we tried this with SMR drives, which will slow down to somewhere around 2-3 IOPS during backfilling/recovery and that renders the cluster useless for client IO. Things might change in the future, but for now, I would strongly recommend against SMR. Go for normal SATA drives with only slightly higher price/capacity ratios. - mike On 2/3/17 2:46 PM, Stillwell, Bryan J wrote: On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander" <ceph-users-boun...@lists.ceph.com on behalf of w...@42on.com> wrote: Op 3 februari 2017 om 11:03 schreef Maxime Guyot <maxime.gu...@elits.com>: Hi, Interesting feedback! > In my opinion the SMR can be used exclusively for the RGW. > Unless it's something like a backup/archive cluster or pool with little to none concurrent R/W access, you're likely to run out of IOPS (again) long before filling these monsters up. That¹s exactly the use case I am considering those archive HDDs for: something like AWS Glacier, a form of offsite backup probably via radosgw. The classic Seagate enterprise class HDD provide ³too much² performance for this use case, I could live with 1Ž4 of the performance for that price point. If you go down that route I suggest that you make a mixed cluster for RGW. A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or PM863 or a Intel DC series. All pools by default should go to those OSDs. Only the RGW buckets data pool should go to the big SMR drives. However, again, expect very, very low performance of those disks. One of the other concerns you should think about is recovery time when one of these drives fail. The more OSDs you have, the less this becomes an issue, but on a small cluster is might take over a day to fully recover from an OSD failure. Which is a decent amount of time to have degraded PGs. Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please
Re: [ceph-users] Experience with 5k RPM/archive HDDs
Hi, don't go there, we tried this with SMR drives, which will slow down to somewhere around 2-3 IOPS during backfilling/recovery and that renders the cluster useless for client IO. Things might change in the future, but for now, I would strongly recommend against SMR. Go for normal SATA drives with only slightly higher price/capacity ratios. - mike On 2/3/17 2:46 PM, Stillwell, Bryan J wrote: On 2/3/17, 3:23 AM, "ceph-users on behalf of Wido den Hollander"wrote: Op 3 februari 2017 om 11:03 schreef Maxime Guyot : Hi, Interesting feedback! > In my opinion the SMR can be used exclusively for the RGW. > Unless it's something like a backup/archive cluster or pool with little to none concurrent R/W access, you're likely to run out of IOPS (again) long before filling these monsters up. That¹s exactly the use case I am considering those archive HDDs for: something like AWS Glacier, a form of offsite backup probably via radosgw. The classic Seagate enterprise class HDD provide ³too much² performance for this use case, I could live with 1Ž4 of the performance for that price point. If you go down that route I suggest that you make a mixed cluster for RGW. A (small) set of OSDs running on top of proper SSDs, eg Samsung SM863 or PM863 or a Intel DC series. All pools by default should go to those OSDs. Only the RGW buckets data pool should go to the big SMR drives. However, again, expect very, very low performance of those disks. One of the other concerns you should think about is recovery time when one of these drives fail. The more OSDs you have, the less this becomes an issue, but on a small cluster is might take over a day to fully recover from an OSD failure. Which is a decent amount of time to have degraded PGs. Bryan E-MAIL CONFIDENTIALITY NOTICE: The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster
Hi, thanks all, still I would appreciate hints on a concrete procedure how to migrate cephfs metadata to a SSD pool, the SSDs being on the same hosts like to spinning disks. This reference I read: https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ Are there other alternatives to this suggested configuration? I am kind of a little paranoid to start playing around with crush rules in the running system. Regards, Mike On 1/5/17 11:40 PM, jiajia zhong wrote: 2017-01-04 23:52 GMT+08:00 Mike Miller <millermike...@gmail.com <mailto:millermike...@gmail.com>>: Wido, all, can you point me to the "recent benchmarks" so I can have a look? How do you define "performance"? I would not expect cephFS throughput to change, but it is surprising to me that metadata on SSD will have no measurable effect on latency. - mike operations like "ls", "stat", "find" would become faster, the bottleneck is the slow osds which store file data. On 1/3/17 10:49 AM, Wido den Hollander wrote: Op 3 januari 2017 om 2:49 schreef Mike Miller <millermike...@gmail.com <mailto:millermike...@gmail.com>>: will metadata on SSD improve latency significantly? No, as I said in my previous e-mail, recent benchmarks showed that storing CephFS metadata on SSD does not improve performance. It still might be good to do since it's not that much data thus recovery will go quickly, but don't expect a CephFS performance improvement. Wido Mike On 1/2/17 11:50 AM, Wido den Hollander wrote: Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo <ski...@redhat.com <mailto:ski...@redhat.com>>: I've never done migration of cephfs_metadata from spindle disks to ssds. But logically you could achieve this through 2 phases. #1 Configure CRUSH rule including spindle disks and ssds #2 Configure CRUSH rule for just pointing to ssds * This would cause massive data shuffling. Not really, usually the CephFS metadata isn't that much data. Recent benchmarks (can't find them now) show that storing CephFS metadata on SSD doesn't really improve performance though. Wido On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller <millermike...@gmail.com <mailto:millermike...@gmail.com>> wrote: Hi, Happy New Year! Can anyone point me to specific walkthrough / howto instructions how to move cephfs metadata to SSD in a running cluster? How is crush to be modified step by step such that the metadata migrate to SSD? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster
Wido, all, can you point me to the "recent benchmarks" so I can have a look? How do you define "performance"? I would not expect cephFS throughput to change, but it is surprising to me that metadata on SSD will have no measurable effect on latency. - mike On 1/3/17 10:49 AM, Wido den Hollander wrote: Op 3 januari 2017 om 2:49 schreef Mike Miller <millermike...@gmail.com>: will metadata on SSD improve latency significantly? No, as I said in my previous e-mail, recent benchmarks showed that storing CephFS metadata on SSD does not improve performance. It still might be good to do since it's not that much data thus recovery will go quickly, but don't expect a CephFS performance improvement. Wido Mike On 1/2/17 11:50 AM, Wido den Hollander wrote: Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo <ski...@redhat.com>: I've never done migration of cephfs_metadata from spindle disks to ssds. But logically you could achieve this through 2 phases. #1 Configure CRUSH rule including spindle disks and ssds #2 Configure CRUSH rule for just pointing to ssds * This would cause massive data shuffling. Not really, usually the CephFS metadata isn't that much data. Recent benchmarks (can't find them now) show that storing CephFS metadata on SSD doesn't really improve performance though. Wido On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller <millermike...@gmail.com> wrote: Hi, Happy New Year! Can anyone point me to specific walkthrough / howto instructions how to move cephfs metadata to SSD in a running cluster? How is crush to be modified step by step such that the metadata migrate to SSD? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrate cephfs metadata to SSD in running cluster
will metadata on SSD improve latency significantly? Mike On 1/2/17 11:50 AM, Wido den Hollander wrote: Op 2 januari 2017 om 10:33 schreef Shinobu Kinjo <ski...@redhat.com>: I've never done migration of cephfs_metadata from spindle disks to ssds. But logically you could achieve this through 2 phases. #1 Configure CRUSH rule including spindle disks and ssds #2 Configure CRUSH rule for just pointing to ssds * This would cause massive data shuffling. Not really, usually the CephFS metadata isn't that much data. Recent benchmarks (can't find them now) show that storing CephFS metadata on SSD doesn't really improve performance though. Wido On Mon, Jan 2, 2017 at 2:36 PM, Mike Miller <millermike...@gmail.com> wrote: Hi, Happy New Year! Can anyone point me to specific walkthrough / howto instructions how to move cephfs metadata to SSD in a running cluster? How is crush to be modified step by step such that the metadata migrate to SSD? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrate cephfs metadata to SSD in running cluster
Hi, Happy New Year! Can anyone point me to specific walkthrough / howto instructions how to move cephfs metadata to SSD in a running cluster? How is crush to be modified step by step such that the metadata migrate to SSD? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] [EXTERNAL] Ceph performance is too good (impossible..)...
Hi, you need to flush all caches before starting read tests. With fio you can probably do this if you keep the files that it creates. as root on all clients and all osd nodes run: echo 3 > /proc/sys/vm/drop_caches But fio is a little problematic for ceph because of the caches in the clients and the osd nodes. If you really need to know the read rates, go for large files, write them with dd, flush all caches, and the read the files with dd. Single threaded read dd shows less throughput compared to multiple threads dd read. readahead size does also matter. Good luck testing. - mike On 12/13/16 2:37 AM, V Plus wrote: The same.. see: A: (g=0): rw=read, bs=5M-5M/5M-5M/5M-5M, ioengine=*libaio*, iodepth=1 ... fio-2.2.10 Starting 16 processes A: (groupid=0, jobs=16): err= 0: pid=27579: Mon Dec 12 20:36:10 2016 mixed: io=122515MB, bw=6120.3MB/s, iops=1224, runt= 20018msec I think at the end, the only one way to solve this issue is to write the image before read testas suggested I have no clue why rbd engine does not work... On Mon, Dec 12, 2016 at 4:23 PM, Will.Boege> wrote: Try adding --ioengine=libaio __ __ *From: *V Plus > *Date: *Monday, December 12, 2016 at 2:40 PM *To: *"Will.Boege" > *Subject: *Re: [EXTERNAL] [ceph-users] Ceph performance is too good (impossible..)... __ __ Hi Will, thanks very much.. However, I tried with your suggestions. Both are *not *working... 1. with FIO rbd engine: *[RBD_TEST] ioengine=rbd clientname=admin pool=rbd rbdname=fio_test invalidate=1 direct=1 group_reporting=1 unified_rw_reporting=1 time_based=1 rw=read bs=4MB numjobs=16 ramp_time=10 runtime=20* then I run "sudo fio rbd.job" and got: RBD_TEST: (g=0): rw=read, bs=4M-4M/4M-4M/4M-4M, ioengine=rbd, iodepth=1 ... fio-2.2.10 Starting 16 processes rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 rbd engine: RBD version: 0.1.10 Jobs: 12 (f=7): [R(5),_(1),R(4),_(1),R(2),_(2),R(1)] [100.0% done] [11253MB/0KB/0KB /s] [2813/0/0 iops] [eta 00m:00s] RBD_TEST: (groupid=0, jobs=16): err= 0: pid=17504: Mon Dec 12 15:32:52 2016 mixed: io=212312MB, *bw=10613MB/s*, iops=2653, runt= 20005msec 2. with blockalign [A] direct=1 group_reporting=1 unified_rw_reporting=1 size=100% time_based=1 filename=/dev/rbd0 rw=read bs=5MB numjobs=16 ramp_time=5 runtime=20 blockalign=512b [B] direct=1 group_reporting=1 unified_rw_reporting=1 size=100% time_based=1 filename=/dev/rbd1 rw=read bs=5MB numjobs=16 ramp_time=5 runtime=20 blockalign=512b sudo fio fioA.job -output a.txt & sudo fio fioB.job -output b.txt & wait Then I got: A: (groupid=0, jobs=16): err= 0: pid=19320: Mon Dec 12 15:35:32 2016 mixed: io=88590MB, bw=4424.7MB/s, iops=884, runt= 20022msec B: (groupid=0, jobs=16): err= 0: pid=19324: Mon Dec 12 15:35:32 2016 mixed: io=88020MB, *bw=4395.6MB/s*, iops=879, runt= 20025msec .. __ __ On Mon, Dec 12, 2016 at 10:45 AM, Will.Boege > wrote: My understanding is that when using direct=1 on a raw block device FIO (aka-you) will have to handle all the sector alignment or the request will get buffered to perform the alignment. Try adding the –blockalign=512b option to your jobs, or better yet just use the native FIO RBD engine. Something like this (untested) - [A] ioengine=rbd clientname=admin pool=rbd rbdname=fio_test direct=1 group_reporting=1 unified_rw_reporting=1 time_based=1 rw=read bs=4MB numjobs=16 ramp_time=10 runtime=20 *From: *ceph-users > on behalf of V Plus >
Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device
John, thanks for emphasizing this, before this workaround we tried many different kernel versions including 4.5.x, all the same. The problem might be particular to our environment as most of the client machines (compute servers) have large RAM, so plenty of cache space for inodes/dentries. Cheers, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rsync kernel client cepfs mkstemp no space left on device
Hi, you have given up too early. rsync is not a nice workload for cephfs, in particular, most linux kernel clients cephfs will end up caching all inodes/dentries. The result is that mds servers crash due to memory limitations. And rsync basically screens all inodes/dentries so it is the perfect application to gobble up all inode caps. We run a cronjob script flush_cache every few (2-5) minutes: #!/bin/bash echo 2 > /proc/sys/vm/drop_caches on all machines that mount cephfs. There is no performance drop in the client machines, but happily, the mds congestion is solved by this. We also went the rbd way before this, but for large rbd images we much prefer cephfs instead. Regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sandisk SSDs
Hi, some time ago when starting a ceph evaluation cluster I used SSDs with similar specs. I would strongly recommend against it, during normal operation things might be fine, but wait until the first disk fails and things have to be backfilled. If you still try, please let me know how things turn out for you. Regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs (rbd) read performance low - where is thebottleneck?
JiaJia, all, thanks, yes, I have the mount opts in mtab, and correct, if I leave out the "-v" option, no complaints. mtab: mounted ... type ceph (name=cephfs,rasize=134217728,key=client.cephfs) It has to be rasize (rsize will not work). One can check here: cat /sys/class/bdi/ceph-*/read_ahead_kb -> 131072 And YES! I am so happy, dd 40GB file does a lot more single thread now, much better. rasize= 67108864 222 MB/s rasize=134217728 360 MB/s rasize=268435456 474 MB/s Thank you all very much for bringing me on the right track, highly appreciated. Regards, Mike On 11/23/16 5:55 PM, JiaJia Zhong wrote: Mike, if you run mount.ceph with "-v" options, you may get "ceph: Unknown mount option rsize", actually, you could ignore this, the rsize and rasize will both be passed to mount syscall. I belive that you have had the cephfs mounted successfully, run "mount" in terminal to check the actual mount opts in mtab. -- Original -- *From: * "Mike Miller"<millermike...@gmail.com>; *Date: * Wed, Nov 23, 2016 02:38 PM *To: * "Eric Eastman"<eric.east...@keepertech.com>; *Cc: * "Ceph Users"<ceph-users@lists.ceph.com>; *Subject: * Re: [ceph-users] cephfs (rbd) read performance low - where is thebottleneck? Hi, did some testing multithreaded access and dd, performance scales as it should. Any ideas to improve single threaded read performance further would be highly appreciated. Some of our use cases requires that we need to read large files by a single thread. I have tried changing the readahead on the kernel client cephfs mount too, rsize and rasize. mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864 Doing this on kernel 4.5.2 gives the error message: "ceph: Unknown mount option rsize" or unknown rasize. Can someone explain to me how I can experiment with readahead on cephfs? Mike On 11/21/16 12:33 PM, Eric Eastman wrote: Have you looked at your file layout? On a test cluster running 10.2.3 I created a 5GB file and then looked at the layout: # ls -l test.dat -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat # getfattr -n ceph.file.layout test.dat # file: test.dat ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data" From what I understand with this layout you are reading 4MB of data from 1 OSD at a time so I think you are seeing the overall speed of a single SATA drive. I do not think increasing your MON/MDS links to 10Gb will help, nor for a single file read will it help by going to SSD for the metadata. To test this, you may want to try creating 10 x 50GB files, and then read them in parallel and see if your overall throughput increases. If so, take a look at the layout parameters and see if you can change the file layout to get more parallelization. https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst Regards, Eric On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com> wrote: Hi, reading a big file 50 GB (tried more too) dd if=bigfile of=/dev/zero bs=4M in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives me only about *122 MB/s* read speed in single thread. Scrubbing turned off during measurement. I have been searching for possible bottlenecks. The network is not the problem, the machine running dd is connected to the cluster public network with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private 10 GBASE-T. The osd SATA disks are utilized only up until about 10% or 20%, not more than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1 core is used on this 6-core machine). mon and mds connected with only 1 GbE (I would expect some latency from that, but no bandwidth issues; in fact network bandwidth is about 20 Mbit max). If I read a file with 50 GB, then clear the cache on the reading machine (but not the osd caches), I get much better reading performance of about *620 MB/s*. That seems logical to me as much (most) of the data is still in the osd cache buffers. But still the read performance is not super considered that the reading machine is connected to the cluster with a 20 Gbit/s bond. How can I improve? I am not really sure, but from my understanding 2 possible bottlenecks come to mind: 1) 1 GbE connection to mon / mds Is this the reason why reads are slow and osd disks are not hammered by read requests and therewith fully utilized? 2) Move metadata to SSD Currently, cephfs_metadata is on the same pool as the data on the spinning SATA disks. Is this the bottleneck? Is the move of metadata to SSD a solution? Or is it both? Your experience and insight are highly appreciated. Thanks, Mike ___
Re: [ceph-users] cephfs (rbd) read performance low - where is the bottleneck?
Hi, did some testing multithreaded access and dd, performance scales as it should. Any ideas to improve single threaded read performance further would be highly appreciated. Some of our use cases requires that we need to read large files by a single thread. I have tried changing the readahead on the kernel client cephfs mount too, rsize and rasize. mount.ceph ... -o name=cephfs,secretfile=secret.key,rsize=67108864 Doing this on kernel 4.5.2 gives the error message: "ceph: Unknown mount option rsize" or unknown rasize. Can someone explain to me how I can experiment with readahead on cephfs? Mike On 11/21/16 12:33 PM, Eric Eastman wrote: Have you looked at your file layout? On a test cluster running 10.2.3 I created a 5GB file and then looked at the layout: # ls -l test.dat -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat # getfattr -n ceph.file.layout test.dat # file: test.dat ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data" From what I understand with this layout you are reading 4MB of data from 1 OSD at a time so I think you are seeing the overall speed of a single SATA drive. I do not think increasing your MON/MDS links to 10Gb will help, nor for a single file read will it help by going to SSD for the metadata. To test this, you may want to try creating 10 x 50GB files, and then read them in parallel and see if your overall throughput increases. If so, take a look at the layout parameters and see if you can change the file layout to get more parallelization. https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst Regards, Eric On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com> wrote: Hi, reading a big file 50 GB (tried more too) dd if=bigfile of=/dev/zero bs=4M in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives me only about *122 MB/s* read speed in single thread. Scrubbing turned off during measurement. I have been searching for possible bottlenecks. The network is not the problem, the machine running dd is connected to the cluster public network with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private 10 GBASE-T. The osd SATA disks are utilized only up until about 10% or 20%, not more than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1 core is used on this 6-core machine). mon and mds connected with only 1 GbE (I would expect some latency from that, but no bandwidth issues; in fact network bandwidth is about 20 Mbit max). If I read a file with 50 GB, then clear the cache on the reading machine (but not the osd caches), I get much better reading performance of about *620 MB/s*. That seems logical to me as much (most) of the data is still in the osd cache buffers. But still the read performance is not super considered that the reading machine is connected to the cluster with a 20 Gbit/s bond. How can I improve? I am not really sure, but from my understanding 2 possible bottlenecks come to mind: 1) 1 GbE connection to mon / mds Is this the reason why reads are slow and osd disks are not hammered by read requests and therewith fully utilized? 2) Move metadata to SSD Currently, cephfs_metadata is on the same pool as the data on the spinning SATA disks. Is this the bottleneck? Is the move of metadata to SSD a solution? Or is it both? Your experience and insight are highly appreciated. Thanks, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs (rbd) read performance low - where is the bottleneck?
thank you very much for this info. On 11/21/16 12:33 PM, Eric Eastman wrote: Have you looked at your file layout? On a test cluster running 10.2.3 I created a 5GB file and then looked at the layout: # ls -l test.dat -rw-r--r-- 1 root root 524288 Nov 20 23:09 test.dat # getfattr -n ceph.file.layout test.dat # file: test.dat ceph.file.layout="stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=cephfs_data" The file layout looks the same in my case. From what I understand with this layout you are reading 4MB of data from 1 OSD at a time so I think you are seeing the overall speed of a single SATA drive. I do not think increasing your MON/MDS links to 10Gb will help, nor for a single file read will it help by going to SSD for the metadata. Really? Does ceph really wait until each of the stripe_unit reads has finished reading before the next one? To test this, you may want to try creating 10 x 50GB files, and then read them in parallel and see if your overall throughput increases. Scaling through parallelism works as expected, no problem there. If so, take a look at the layout parameters and see if you can change the file layout to get more parallelization. https://github.com/ceph/ceph/blob/master/doc/dev/file-striping.rst https://github.com/ceph/ceph/blob/master/doc/cephfs/file-layouts.rst Interesting. But how would I change this to improve single threaded read speed? And how would I do the changes to already existing files? Regards, Mike Regards, Eric On Sun, Nov 20, 2016 at 3:24 AM, Mike Miller <millermike...@gmail.com> wrote: Hi, reading a big file 50 GB (tried more too) dd if=bigfile of=/dev/zero bs=4M in a cluster with 112 SATA disks in 10 osd (6272 pgs, replication 3) gives me only about *122 MB/s* read speed in single thread. Scrubbing turned off during measurement. I have been searching for possible bottlenecks. The network is not the problem, the machine running dd is connected to the cluster public network with a 20 GBASE-T bond. osd dual network: cluster public 10 GBASE-T, private 10 GBASE-T. The osd SATA disks are utilized only up until about 10% or 20%, not more than that. CPUs on osd idle too. CPUs on mon idle, mds usage about 1.0 (1 core is used on this 6-core machine). mon and mds connected with only 1 GbE (I would expect some latency from that, but no bandwidth issues; in fact network bandwidth is about 20 Mbit max). If I read a file with 50 GB, then clear the cache on the reading machine (but not the osd caches), I get much better reading performance of about *620 MB/s*. That seems logical to me as much (most) of the data is still in the osd cache buffers. But still the read performance is not super considered that the reading machine is connected to the cluster with a 20 Gbit/s bond. How can I improve? I am not really sure, but from my understanding 2 possible bottlenecks come to mind: 1) 1 GbE connection to mon / mds Is this the reason why reads are slow and osd disks are not hammered by read requests and therewith fully utilized? 2) Move metadata to SSD Currently, cephfs_metadata is on the same pool as the data on the spinning SATA disks. Is this the bottleneck? Is the move of metadata to SSD a solution? Or is it both? Your experience and insight are highly appreciated. Thanks, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs - mds hardware recommendation for 40 million files and 500 users
Hi, we have started to migrate user homes to cephfs with the mds server 32GB RAM. With multiple rsync threads copying this seems to be undersized; the mds process consumes all memory 32GB fitting about 4 million caps. Any hardware recommendation for about 40 million files and about 500 users? Currently, we are on hammer 0.94.5 and linux ubuntu, kernel 3.13. Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osd current.remove.me.somenumber
Hi Greg, thanks, highly appreciated. And yes, that was on an osd with btrfs. We switched back to xfs because of btrfs instabilities. Regards, -Mike On 6/27/16 10:13 PM, Gregory Farnum wrote: On Sat, Jun 25, 2016 at 11:22 AM, Mike Miller <millermike...@gmail.com> wrote: Hi, what is the meaning of the directory "current.remove.me.846930886" is /var/lib/ceph/osd/ceph-14? If you're using btrfs, I believe that's a no-longer-required snapshot of the current state of the system. If you're not, I've no idea what creates directories named like that. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd current.remove.me.somenumber
Hi, what is the meaning of the directory "current.remove.me.846930886" is /var/lib/ceph/osd/ceph-14? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5
Nick, all, fantastic, that did it! I installed kernel 4.5.2 on the client, now the single threaded read performance using krbd mount is up to about 370 MB/s with standard 256 readahead size, and touching 400 MB/s with larger readahead sizes. All single threaded. Multi-threaded krbd read on the same mount also improved, a 10 GBit/s network connection is easily saturated. Thank you all so much for the discussion and the hints. Regards, Mike On 4/23/16 9:51 AM, n...@fisk.me.uk wrote: I've just looked through github for the Linux kernel and it looks like that read ahead fix was introduced in 4.4, so I'm not sure if it's worth trying a slightly newer kernel? Sent from Nine <http://xo4t.mj.am/link/xo4t/x07j8u1663zy/1/QWFy6wfcMj4vvr1tAIpuQA/aHR0cDovL3d3dy45Zm9sZGVycy5jb20v> *From:* Mike Miller <millermike...@gmail.com> *Sent:* 21 Apr 2016 2:20 pm *To:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5 Hi Udo, thanks, just to make sure, further increased the readahead: $ sudo blockdev --getra /dev/rbd0 1048576 $ cat /sys/block/rbd0/queue/read_ahead_kb 524288 No difference here. First one is sectors (512 bytes), second one KB. The second read (after drop cache) is somewhat faster (10%-20%) but not much. I also found this info http://tracker.ceph.com/issues/9192 Maybe Ilya can help us, he knows probably best how this can be improved. Thanks and cheers, Mike On 4/21/16 4:32 PM, Udo Lembke wrote: Hi Mike, Am 21.04.2016 um 09:07 schrieb Mike Miller: Hi Nick and Udo, thanks, very helpful, I tweaked some of the config parameters along the line Udo suggests, but still only some 80 MB/s or so. this mean you have reached factor 3 (this are round about the value I see with single thread on RBD too). Better than nothing. Kernel 4.3.4 running on the client machine and comfortable readahead configured $ sudo blockdev --getra /dev/rbd0 262144 Still not more than about 80-90 MB/s. they are two possibilities for read-ahead. Take a look here (and change with echo) cat /sys/block/rbd0/queue/read_ahead_kb Perhaps there are slightly differences? For writing the parallelization is amazing and I see very impressive speeds, but why is reading performance so much behind? Why is it not parallelized the same way writing is? Is this something coming up in the jewel release? Or is it planned further down the road? If you read an big file and clear your cache ("echo 3 > /proc/sys/vm/drop_caches") on the client, is the second read very fast? I assume yes. In this case the readed data is in the cache on the osd-nodes... so tuning must be there (and I'm very interesting in improvements). Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5
Hi Udo, thanks, just to make sure, further increased the readahead: $ sudo blockdev --getra /dev/rbd0 1048576 $ cat /sys/block/rbd0/queue/read_ahead_kb 524288 No difference here. First one is sectors (512 bytes), second one KB. The second read (after drop cache) is somewhat faster (10%-20%) but not much. I also found this info http://tracker.ceph.com/issues/9192 Maybe Ilya can help us, he knows probably best how this can be improved. Thanks and cheers, Mike On 4/21/16 4:32 PM, Udo Lembke wrote: Hi Mike, Am 21.04.2016 um 09:07 schrieb Mike Miller: Hi Nick and Udo, thanks, very helpful, I tweaked some of the config parameters along the line Udo suggests, but still only some 80 MB/s or so. this mean you have reached factor 3 (this are round about the value I see with single thread on RBD too). Better than nothing. Kernel 4.3.4 running on the client machine and comfortable readahead configured $ sudo blockdev --getra /dev/rbd0 262144 Still not more than about 80-90 MB/s. they are two possibilities for read-ahead. Take a look here (and change with echo) cat /sys/block/rbd0/queue/read_ahead_kb Perhaps there are slightly differences? For writing the parallelization is amazing and I see very impressive speeds, but why is reading performance so much behind? Why is it not parallelized the same way writing is? Is this something coming up in the jewel release? Or is it planned further down the road? If you read an big file and clear your cache ("echo 3 > /proc/sys/vm/drop_caches") on the client, is the second read very fast? I assume yes. In this case the readed data is in the cache on the osd-nodes... so tuning must be there (and I'm very interesting in improvements). Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5
Hi Nick and Udo, thanks, very helpful, I tweaked some of the config parameters along the line Udo suggests, but still only some 80 MB/s or so. Kernel 4.3.4 running on the client machine and comfortable readahead configured $ sudo blockdev --getra /dev/rbd0 262144 Still not more than about 80-90 MB/s. For writing the parallelization is amazing and I see very impressive speeds, but why is reading performance so much behind? Why is it not parallelized the same way writing is? Is this something coming up in the jewel release? Or is it planned further down the road? Please let me know if there is a way to enable clients better single threaded read performance for large files. Thanks and regards, Mike On 4/20/16 10:43 PM, Nick Fisk wrote: -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Udo Lembke Sent: 20 April 2016 07:21 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Slow read on RBD mount, Hammer 0.94.5 Hi Mike, I don't have experiences with RBD mounts, but see the same effect with RBD. You can do some tuning to get better results (disable debug and so on). As hint some values from a ceph.conf: [osd] debug asok = 0/0 debug auth = 0/0 debug buffer = 0/0 debug client = 0/0 debug context = 0/0 debug crush = 0/0 debug filer = 0/0 debug filestore = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug journal = 0/0 debug journaler = 0/0 debug lockdep = 0/0 debug mds = 0/0 debug mds balancer = 0/0 debug mds locker = 0/0 debug mds log = 0/0 debug mds log expire = 0/0 debug mds migrator = 0/0 debug mon = 0/0 debug monc = 0/0 debug ms = 0/0 debug objclass = 0/0 debug objectcacher = 0/0 debug objecter = 0/0 debug optracker = 0/0 debug osd = 0/0 debug paxos = 0/0 debug perfcounter = 0/0 debug rados = 0/0 debug rbd = 0/0 debug rgw = 0/0 debug throttle = 0/0 debug timer = 0/0 debug tp = 0/0 filestore_op_threads = 4 osd max backfills = 1 osd mount options xfs = "rw,noatime,inode64,logbufs=8,logbsize=256k,allocsize=4M" osd mkfs options xfs = "-f -i size=2048" osd recovery max active = 1 osd_disk_thread_ioprio_class = idle osd_disk_thread_ioprio_priority = 7 osd_disk_threads = 1 osd_enable_op_tracker = false osd_op_num_shards = 10 osd_op_num_threads_per_shard = 1 osd_op_threads = 4 Udo On 19.04.2016 11:21, Mike Miller wrote: Hi, RBD mount ceph v0.94.5 6 OSD with 9 HDD each 10 GBit/s public and private networks 3 MON nodes 1Gbit/s network A rbd mounted with btrfs filesystem format performs really badly when reading. Tried readahead in all combinations but that does not help in any way. Write rates are very good in excess of 600 MB/s up to 1200 MB/s, average 800 MB/s Read rates on the same mounted rbd are about 10-30 MB/s !? What kernel are you running, older kernels had an issue where readahead was capped at 2MB. In order to get good read speeds you need readahead set to about 32MB+. Of course, both writes and reads are from a single client machine with a single write/read command. So I am looking at single threaded performance. Actually, I was hoping to see at least 200-300 MB/s when reading, but I am seeing 10% of that at best. Thanks for your help. Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about cache tier and backfill/recover
Hi, in case of a failure in the storage tier, say single OSD disk failure or complete system failure with several OSD disks, will the remaining cache tier (on other nodes) be used for rapid backfilling/recovering first until it is full? Or is backfill/recovery done directly to the storage tier? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS memory sizing
Hi Dietmar, it all depends how many inodes have caps on the mds. I have run a very similar configuration with 0.5 TB raw and about 200 million files, mds collocated with mon and 32 GB RAM. When rsyncing files from other servers onto cephfs I have observed that the mds sometimes runs out of memory and hangs. All on hammer 0.94. Regards, Mike On 3/1/16 8:13 AM, Yan, Zheng wrote: On Tue, Mar 1, 2016 at 7:28 PM, Dietmar Riederwrote: Dear ceph users, I'm in the very initial phase of planning a ceph cluster an have a question regarding the RAM recommendation for an MDS. According to the ceph docs the minimum amount of RAM should be "1 GB minimum per daemon". Is this per OSD in the cluster or per MDS in the cluster? I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3 ceph-msd on these machines as well. The raw capacity of the cluster should be ~1.9PB. Would 64GB of RAM then be enough for the ceph-mon/ceph-msd nodes? Each file inode in MDS uses about 2k memory (It's not relevant to file size). MDS memory usage depends on how large active file set are. Regards Yan, Zheng Thanks Dietmar -- _ D i e t m a r R i e d e r ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HBA - PMC Adaptec HBA 1000
Hi, can someone report their experiences with the PMC Adaptec HBA 1000 series of controllers? https://www.adaptec.com/en-us/smartstorage/hba/ Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mount.ceph not accepting options, please help
Hi, sorry, the question might seem very easy, probably my bad, but can you please help me why I am unable to change read ahead size and other options when mounting cephfs? mount.ceph m2:6789:/ /foo2 -v -o name=cephfs,secret=,rsize=1024000 the result is: ceph: Unknown mount option rsize I am using hammer 0.94.5 and ubuntu trusty. Thanks for your help! Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Debug / monitor osd journal usage
Hi, is there a way to debug / monitor the osd journal usage? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mix of SATA and SSD
Hi, can you please help me with the question I am currently thinking about. I am entertaining a osd node design of a mixture of SATA spinner based osd daemons and SSD based osd daemons. Is it possible to have incoming write traffic go to the SSD first and then when write traffic is becoming less intense redistribute from SSD to the spinners? Is this possible using a suitable crushmap? Is this thought equivalent to having large SSD journals? Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph-deploy osd prepare for journal size 0
Hi, for testing I would like to create some OSD in the hammer release with journal size 0 I included this in ceph.conf: [osd] osd journal size = 0 Then I zapped the disk in question and tried: 'ceph-deploy disk zap o1:sda' Thank you for your advice how to prepare an osd without journal / journal size 0. Thanks and regards, Mike --- [ceph_deploy.conf][DEBUG ] found configuration file at: /home/ceph/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.22): /usr/bin/ceph-deploy disk prepare o1:sda [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks o1:/dev/sda: [o1][DEBUG ] connection detected need for sudo [o1][DEBUG ] connected to host: o1 [o1][DEBUG ] detect platform information from remote host [o1][DEBUG ] detect machine type [ceph_deploy.osd][INFO ] Distro info: Ubuntu 14.04 trusty [ceph_deploy.osd][DEBUG ] Deploying osd to o1 [o1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf [o1][INFO ] Running command: sudo udevadm trigger --subsystem-match=block --action=add [ceph_deploy.osd][DEBUG ] Preparing host o1 disk /dev/sda journal None activate False [o1][INFO ] Running command: sudo ceph-disk -v prepare --fs-type xfs --cluster ceph -- /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size [o1][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_dmcrypt_type [o1][WARNIN] INFO:ceph-disk:Will colocate journal with data on /dev/sda [o1][WARNIN] DEBUG:ceph-disk:Creating journal partition num 2 size 0 on /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk --new=2:0:0M --change-name=2:ceph journal --partition-guid=2:ded83283-2023-4c8e-93ae-b33341710bde --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sda [o1][DEBUG ] The operation has completed successfully. [o1][WARNIN] DEBUG:ceph-disk:Calling partprobe on prepared device /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle [o1][WARNIN] DEBUG:ceph-disk:Journal is GPT partition /dev/disk/by-partuuid/ded83283-2023-4c8e-93ae-b33341710bde [o1][WARNIN] DEBUG:ceph-disk:Journal is GPT partition /dev/disk/by-partuuid/ded83283-2023-4c8e-93ae-b33341710bde [o1][WARNIN] DEBUG:ceph-disk:Creating osd partition on /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:87ab533b-e530-4fa3-bfad-8a157a88cc88 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be -- /dev/sda [o1][DEBUG ] The operation has completed successfully. [o1][WARNIN] DEBUG:ceph-disk:Calling partprobe on created device /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sda [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/udevadm settle [o1][WARNIN] DEBUG:ceph-disk:Creating xfs fs on /dev/sda1 [o1][WARNIN] INFO:ceph-disk:Running command: /sbin/mkfs -t xfs -f -i size=2048 -- /dev/sda1 [o1][WARNIN] warning: device is not properly aligned /dev/sda1 [o1][WARNIN] agsize (251 blocks) too small, need at least 4096 blocks [o1][WARNIN] Usage: mkfs.xfs [o1][WARNIN] /* blocksize */[-b log=n|size=num] [o1][WARNIN] /* data subvol */ [-d agcount=n,agsize=n,file,name=xxx,size=num, [o1][WARNIN] (sunit=value,swidth=value|su=num,sw=num), [o1][WARNIN]sectlog=n|sectsize=num [o1][WARNIN] /* inode size */ [-i log=n|perblock=n|size=num,maxpct=n,attr=0|1|2, [o1][WARNIN]projid32bit=0|1] [o1][WARNIN] /* log subvol */ [-l agnum=n,internal,size=num,logdev=xxx,version=n [o1][WARNIN] sunit=value|su=num,sectlog=n|sectsize=num, [o1][WARNIN]lazy-count=0|1] [o1][WARNIN] /* label */[-L label (maximum 12 characters)] [o1][WARNIN] /* naming */ [-n log=n|size=num,version=2|ci] [o1][WARNIN] /* prototype file */ [-p fname] [o1][WARNIN] /* quiet */[-q] [o1][WARNIN] /* realtime subvol */ [-r
[ceph-users] Infernalis Error EPERM: problem getting command descriptions from mds.0
Hi, can some please help me with this error? $ ceph tell mds.0 version Error EPERM: problem getting command descriptions from mds.0 Tell is not working for me on mds. Version: infernalis - trusty Thanks and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] MDS memory usage
Hi, in my cluster with 16 OSD daemons and more than 20 million files on cephfs, the memory usage on MDS is around 16 GB. It seems that 'mds cache size' has no real influence on the memory usage of the MDS. Is there a formula that relates 'mds cache size' directly to memory consumption on the MDS? In the documentation (and other posts on the mailing list) it is said that the MDS needs 1 GB per daemon. I am observing that the MDS uses almost exactly 1 GB per OSD daemon (I have 16 OSD and 16 GB memory usage on the MDS). Is this the correct formula? Or is it 1 GB per MDS daemon? In my case, the standard 'mds cache size 10' makes MDS crash and/or the cephfs is unresponsive. Larger values for 'mds cache size' seem to work really well. Version trusty 14.04 and hammer. Thanks and kind regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS memory usage
Hi Greg, thanks very much. This is clear to me now. As for 'MDS cluster', I thought that this was not recommended at this stage? I would very much like to have a number >1 of MDS in my cluster as this would probably help very much to balance the load. But I am afraid what everybody says about stability issues. Is more than one MDS considered stable enough with hammer? Thanks and regards, Mike On 11/25/15 12:51 PM, Gregory Farnum wrote: On Tue, Nov 24, 2015 at 10:26 PM, Mike Miller <millermike...@gmail.com> wrote: Hi, in my cluster with 16 OSD daemons and more than 20 million files on cephfs, the memory usage on MDS is around 16 GB. It seems that 'mds cache size' has no real influence on the memory usage of the MDS. Is there a formula that relates 'mds cache size' directly to memory consumption on the MDS? The dominant factor should be the number of inodes in cache, although there are other things too. Depending on version I think it was ~2KB of memory for each inode+dentry at last count. In the documentation (and other posts on the mailing list) it is said that the MDS needs 1 GB per daemon. I am observing that the MDS uses almost exactly 1 GB per OSD daemon (I have 16 OSD and 16 GB memory usage on the MDS). Is this the correct formula? Or is it 1 GB per MDS daemon? It's got nothing to do with the number of OSDs. I'm not sure where 1GB per MDS came from, although you can certainly run a reasonable low-intensity cluster on that. In my case, the standard 'mds cache size 10' makes MDS crash and/or the cephfs is unresponsive. Larger values for 'mds cache size' seem to work really well. Right. You need the total cache size of your MDS "cluster" (which is really just 1) to be larger than your working set size or you'll have trouble. Similarly if you have any individual directories which are a significant portion of your total cache it might cause issues. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Question about hardware and CPU selection
Hi, as I am planning to set up a ceph cluster with 6 OSD nodes with 10 harddisks in each node, could you please give me some advice about hardware selection? CPU? RAM? I am planning a 10 GBit/s public and a separate 10 GBit/s private network. For a smaller test cluster with 5 OSD nodes and 4 harddisks each, 2 GBit/s public and 4 GBit/s private network, I already tested this using core i5 boxes 16GB RAM installed. In most of my test scenarios including load, node failure, backfilling, etc. the CPU usage was not at all the bottleneck with a maximum of about 25% load per core. The private network was also far from being fully loaded. It would be really great to get some advice about hardware choices for my newly planned setup. Thanks very much and regards, Mike ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com