Re: [ceph-users] Consumer-grade SSD in Ceph
I'm sure you know also the following, but just in case: - Intel SATA D3-S4610 (I think they're out of stock right now) - Intel SATA D3-S4510 (I see stock of these right now) El 27/12/19 a las 17:56, vita...@yourcmc.ru escribió: SATA: Micron 5100-5200-5300, Seagate Nytro 1351/1551 (don't forget to disable their cache with hdparm -W 0) NVMe: Intel P4500, Micron 9300 Thanks for all the replies. In summary; consumer grade SSD is a no go. What is an alternative to SM863a? Since it is quite hard to get these due non non-stock. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Consumer-grade SSD in Ceph
Hi Sinan, Just to reiterate: don't do this. Consumer SSDs will destroy your enterprise SSD's performance. Our office cluster is made of consumer-grade servers: cheap gaming motherboards, memory, ryzen processors, desktop HDDs. But SSD drives are Enterprise, we had awful experiences with consumer SSDs (some perform worse that HDDs with Ceph). Cheers Eneko El 19/12/19 a las 20:20, Sinan Polat escribió: Hi all, Thanks for the replies. I am not worried about their lifetime. We will be adding only 1 SSD disk per physical server. All SSD’s are enterprise drives. If the added consumer grade disk will fail, no problem. I am more curious regarding their I/O performance. I do want to have 50% drop in performance. So anyone any experience with 860 EVO or Crucial MX500 in a Ceph setup? Thanks! Op 19 dec. 2019 om 19:18 heeft Mark Nelson het volgende geschreven: The way I try to look at this is: 1) How much more do the enterprise grade drives cost? 2) What are the benefits? (Faster performance, longer life, etc) 3) How much does it cost to deal with downtime, diagnose issues, and replace malfunctioning hardware? My personal take is that enterprise drives are usually worth it. There may be consumer grade drives that may be worth considering in very specific scenarios if they still have power loss protection and high write durability. Even when I was in academia years ago with very limited budgets, we got burned with consumer grade SSDs to the point where we had to replace them all. You have to be very careful and know exactly what you are buying. Mark On 12/19/19 12:04 PM, jes...@krogh.cc wrote: I dont think “usually” is good enough in a production setup. Sent from myMail for iOS Thursday, 19 December 2019, 12.09 +0100 from Виталий Филиппов : Usually it doesn't, it only harms performance and probably SSD lifetime too > I would not be running ceph on ssds without powerloss protection. I > delivers a potential data loss scenario ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarragako bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Single threaded IOPS on SSD pool.
Hi, El 5/6/19 a las 16:53, vita...@yourcmc.ru escribió: Ok, average network latency from VM to OSD's ~0.4ms. It's rather bad, you can improve the latency by 0.3ms just by upgrading the network. Single threaded performance ~500-600 IOPS - or average latency of 1.6ms Is that comparable to what other are seeing? Good "reference" numbers are 0.5ms for reads (~2000 iops) and 1ms for writes (~1000 iops). I confirm that the most powerful thing to do is disabling CPU powersave (governor=performance + cpupower -D 0). You usually get 2x single thread iops at once. We have a small cluster with 4 OSD host, each with 1 SSD INTEL SSDSC2KB019T8 (D3-S4510 1.8T), connected with a 10G network (shared with VMs, not a busy cluster). Volumes are replica 3: Network latency from one node to the other 3: 10 packets transmitted, 10 received, 0% packet loss, time 9166ms rtt min/avg/max/mdev = 0.042/0.064/0.088/0.013 ms 10 packets transmitted, 10 received, 0% packet loss, time 9190ms rtt min/avg/max/mdev = 0.047/0.072/0.110/0.017 ms 10 packets transmitted, 10 received, 0% packet loss, time 9219ms rtt min/avg/max/mdev = 0.061/0.078/0.099/0.011 ms You fio test on a 4-core VM: $ fio fio-job-randr.ini test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.12 Starting 1 process test: Laying out IO file (1 file / 1024MiB) Jobs: 1 (f=1): [r(1)][100.0%][r=10.3MiB/s][r=2636 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=4056: Wed Jun 5 17:14:33 2019 Description : [fio random 4k reads] read: IOPS=2386, BW=9544KiB/s (9773kB/s)(559MiB/60001msec) slat (nsec): min=0, max=616576, avg=10847.27, stdev=3253.55 clat (nsec): min=0, max=10346k, avg=406536.60, stdev=145643.92 lat (nsec): min=0, max=10354k, avg=417653.11, stdev=145740.26 clat percentiles (usec): | 1.00th=[ 37], 5.00th=[ 202], 10.00th=[ 258], 20.00th=[ 318], | 30.00th=[ 351], 40.00th=[ 383], 50.00th=[ 416], 60.00th=[ 445], | 70.00th=[ 474], 80.00th=[ 502], 90.00th=[ 545], 95.00th=[ 578], | 99.00th=[ 701], 99.50th=[ 742], 99.90th=[ 1004], 99.95th=[ 1500], | 99.99th=[ 3752] bw ( KiB/s): min= 0, max=10640, per=100.00%, avg=9544.13, stdev=486.02, samples=120 iops : min= 0, max= 2660, avg=2386.03, stdev=121.50, samples=120 lat (usec) : 2=0.01%, 50=2.94%, 100=0.17%, 250=6.20%, 500=70.34% lat (usec) : 750=19.92%, 1000=0.33% lat (msec) : 2=0.07%, 4=0.03%, 10=0.01%, 20=0.01% cpu : usr=1.01%, sys=3.44%, ctx=143387, majf=0, minf=16 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=143163,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=9544KiB/s (9773kB/s), 9544KiB/s-9544KiB/s (9773kB/s-9773kB/s), io=559MiB (586MB), run=60001-60001msec Disk stats (read/write): dm-0: ios=154244/120, merge=0/0, ticks=63120/12, in_queue=63128, util=96.98%, aggrios=154244/58, aggrmerge=0/62, aggrticks=63401/40, aggrin_queue=62800, aggrutil=96.42% sda: ios=154244/58, merge=0/62, ticks=63401/40, in_queue=62800, util=96.42% So if I read correctly, about 2500 IOPS read. I see governor=performance (out of the box on Proxmox VE I think). We touched cpupower, at least not from beyond what does our distribution (Proxmox VE). For reference, the same test with random write (KVM disk cache is write-back): $ fio fio-job-randw.ini test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1 fio-3.12 Starting 1 process Jobs: 1 (f=1): [w(1)][100.0%][w=35.5MiB/s][w=9077 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=4278: Wed Jun 5 17:35:51 2019 Description : [fio random 4k writes] write: IOPS=9809, BW=38.3MiB/s (40.2MB/s)(2299MiB/60001msec); 0 zone resets slat (nsec): min=0, max=856527, avg=13669.16, stdev=5257.21 clat (nsec): min=0, max=256305k, avg=86123.12, stdev=913448.71 lat (nsec): min=0, max=256328k, avg=100145.33, stdev=913512.45 clat percentiles (usec): | 1.00th=[ 37], 5.00th=[ 41], 10.00th=[ 46], 20.00th=[ 54], | 30.00th=[ 60], 40.00th=[ 65], 50.00th=[ 71], 60.00th=[ 78], | 70.00th=[ 86], 80.00th=[ 96], 90.00th=[ 119], 95.00th=[ 151], | 99.00th=[ 251], 99.50th=[ 297], 99.90th=[ 586], 99.95th=[ 857], | 99.99th=[ 4490] bw ( KiB/s): min= 0, max=52392, per=100.00%, avg=39243.27, stdev=3553.88, samples=119 iops : min= 0, max=13098, avg=9810.81, stdev=888.47, samples=119 lat (nsec) : 1000=0.01% lat (usec) : 2=0.02%, 4=0.01%, 10=0.01%, 20=0.01%, 50=15.44% lat (usec) : 100=67.16%, 250=16.36%, 500=0.90%, 750=0.06%, 1000=0.03% lat
Re: [ceph-users] Intel D3-S4610 performance
Hi Kai, El 12/3/19 a las 9:13, Kai Wembacher escribió: Hi everyone, I have an Intel D3-S4610 SSD with 1.92 TB here for testing and get some pretty bad numbers, when running the fio benchmark suggested by Sébastien Han (http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/): Intel D3-S4610 1.92 TB --numjobs=1 write: IOPS=3860, BW=15.1MiB/s (15.8MB/s)(905MiB/60001msec) --numjobs=2 write: IOPS=7138, BW=27.9MiB/s (29.2MB/s)(1673MiB/60001msec) --numjobs=4 write: IOPS=12.5k, BW=48.7MiB/s (51.0MB/s)(2919MiB/60002msec) Compared to our current Samsung SM863 SSDs the Intel one is about 6x slower. Has someone here tested this SSD and can give me some values for comparison? We don't have D3-S4610 drives, but are in the process of deploying 4 D3-S4510 1.92TB for OSD purposes. I can test them if that helps? Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Blocked ops after change from filestore on HDD to bluestore on SDD
Hi Uwe, We tried to use a Samsung 840 Pro SSD as OSD some time ago and it was a no-go; it wasn't that performance was bad, it just didn't work for the kind of use of OSD. Any HDD was better than it (the disk was healthy and have been used in a software raid-1 for a pair of years). I suggest you check first that your Samsung 860 Pro disks work well for Ceph. Also, how is your host's RAM? Cheers El 26/2/19 a las 22:01, Uwe Sauter escribió: Hi, TL;DR: In my Ceph clusters I replaced all OSDs from HDDs of several brands and models with Samsung 860 Pro SSDs and used the opportunity to switch from filestore to bluestore. Now I'm seeing blocked ops in Ceph and file system freezes inside VMs. Any suggestions? I have two Proxmox clusters for virtualization which use Ceph on HDDs as backend storage for VMs. About half a year ago I had to increase the pool size and used the occasion to switch from filestore to bluestore. That was when trouble started. Both clusters showed blocked ops that caused freezes inside VMs which needed a reboot to function properly again. I wasn't able to identify the cause of the blocking ops but I blamed the low performance of the HDDs. It was also the time when patches for Spectre/Meltdown were released. Kernel 4.13.x didn't show the behavior while kernel 4.15.x did. After several weeks of debugging the workaround was to go back to filestore. Today I replace all HDDs with brand new Samsung 860 Pro SSDs and switched to bluestore again (on one cluster). And… the blocked ops reappeared. I am out of ideas about the cause. Any idea why bluestore is so much more demanding on the storage devices compared to filestore? Before switching back to filestore do you have any suggestions for debugging? Anything special to check for in the network? The clusters are both connected via 10GbE (MTU 9000) and are only lightly loaded (15 VMs on the first, 6 VMs on the second). Each host has 3 SSDs and 64GB memory. "rados bench" gives decent results for 4M block size but 4K block size triggers blocked ops (and only finishes after I restart the OSD with the blocked ops). Results below. Thanks, Uwe Results from "rados bench" runs with 4K block size when the cluster didn't block: root@px-hotel-cluster:~# rados bench -p scbench 60 write -b 4K -t 16 --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4096 bytes to objects of size 4096 for up to 60 seconds or 0 objects Object prefix: benchmark_data_px-hotel-cluster_3814550 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 2338 2322 9.06888 9.07031 0.0068972 0.0068597 2 16 4631 4615 9.01238 8.95703 0.0076618 0.00692027 3 16 6936 6920 9.00928 9.00391 0.0066511 0.00692966 4 16 9173 9157 8.94133 8.73828 0.00416256 0.00698071 5 16 11535 11519 8.99821 9.22656 0.00799875 0.00693842 6 16 13892 13876 9.03287 9.20703 0.00688782 0.00691459 7 15 16173 16158 9.01578 8.91406 0.00791589 0.00692736 8 16 18406 18390 8.97854 8.71875 0.00745151 0.00695723 9 16 20681 20665 8.96822 8.88672 0.0072881 0.00696475 10 16 23037 23021 8.99163 9.20312 0.00728763 0.0069473 11 16 24261 24245 8.60882 4.78125 0.00502342 0.00725673 12 16 25420 25404 8.26863 4.52734 0.00443917 0.00750865 13 16 27347 27331 8.21154 7.52734 0.00670819 0.00760455 14 16 28750 28734 8.01642 5.48047 0.00617038 0.00779322 15 16 30222 30206 7.8653 5.75 0.00700398 0.00794209 16 16 32180 32164 7.8517 7.64844 0.00704785 0.0079573 17 16 34527 34511 7.92907 9.16797 0.00582831 0.00788017 18 15 36969 36954 8.01868 9.54297 0.00635168 0.00779228 19 16 39059 39043 8.02609 8.16016 0.00622597 0.00778436 2019-02-26 21:55:41.623245 min lat: 0.00337595 max lat: 0.431158 avg lat: 0.00779143 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 20 16 41079 41063 8.01928 7.89062 0.00649895 0.00779143 21 16 43076 43060 8.00878 7.80078 0.00726145 0.00780128 22 16 45433 45417 8.06321 9.20703 0.00455727 0.00774944 23 16 47763 47747 8.10832 9.10156 0.00582818 0.00770599 24 16 50079 50063 8.14738 9.04688 0.0051125 0.00766894 25 16 52477 52461 8.19614 9.36719 0.00537575 0.00762343 26 16 54895 54879 8.24415 9.44531 0.00573134 0.00757909 27 16 57276 57260 8.28325 9.30078 0.00576683 0.00754383 28 16 59487 59471 8.29585 8.63672
Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.
Hi, El 25/11/18 a las 18:23, Виталий Филиппов escribió: Ok... That's better than previous thread with file download where the topic starter suffered from normal only-metadata-journaled fs... Thanks for the link, it would be interesting to repeat similar tests. Although I suspect it shouldn't be that bad... at least not all desktop SSDs are that broken - for example https://engineering.nordeus.com/power-failure-testing-with-ssds/ says samsumg 840 pro is ok. Only that ceph performance for that SSD model is very very bad. We had one of those repurposed for ceph and had to run to buy an Intel enterprise SSD drive to replace it. Don't even try :) Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Proxmox with EMC VNXe 3200
Hi all, We're planning the migration of a VMWare 5.5 cluster backed by a EMC VNXe 3200 storage appliance to Proxmox. The VNXe has about 3 year of warranty left and half the disks unprovisioned, so the current plan is to use the same VNXe for Proxmox storage. After warranty expires we'll most probably go ceph but that's some years in the future. VNXe seems to support both iSCSI and NFS (CIFS too but that is really out of my tech-tastes). I guess best option performance-wise would be iSCSI, but I like the simplicity of NFS. Any idea about what could be the performance impact of this (NFS/iSCSI)? Has anyone had any experience with this kind of storage appliances? Thanks a lot Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Mimic on Debian 9 Stretch
Hi Fabian, Hope your arm is doing well :) unless such a backport is created and tested fairly well (and we will spend some more time investigating this internally despite the caveats above), our plan B will probably involve: - building Luminous for Buster to ease the upgrade from Stretch+Luminous (upgrading both base distro release and Ceph major version in one go did not work out in the past) - keeping our Stretch-based release on Luminous even once Luminous is EoL upstream - strongly recommending to those of our users that rely on Ceph to upgrade to our (future/next) Buster-based release (which will likely get Mimic or Nautilus as default Ceph version, depending on whether the Ceph release schedule holds or not) - hope this whole story does not repeat itself too often because of the inherent misalignment between Ceph and Debian release cycles especially the second and third point will irritate some of our users, but sometimes life only hands you lemons. We're responsible of about 6 small clusters of Proxmox + Ceph; I think this is the path to take. Use the time to "extend" Luminous support, maybe you can do this together with others, maybe even with some support from Ceph upstream. I think it should be less work than the gcc backport just for a few months. Just skip Mimic like you did with non LTS releases in the past. It's also less work for the Proxmox admins, as we'll be able to skip a Ceph upgrade easily. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Hawk-M4E SSD disks for journal
Hi all, We're in the process of deploying a new Proxmox/ceph cluster. We had planned to use S3710 disks for system+journals, but our provider (Dell) is telling us that they're EOL and the only alternative they offer are some "mix use" Hawk-M4E with sizes 200GB/400GB. I really can't find reliable info on those disks online. Did anyone try them and can comment whether they perform well or not? Thanks a lot Eneko ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Small cluster for VMs hosting
Hi Gandalf, El 07/11/17 a las 14:16, Gandalf Corvotempesta escribió: Hi to all I've been far from ceph from a couple of years (CephFS was still unstable) I would like to test it again, some questions for a production cluster for VMs hosting: 1. Is CephFS stable? Yes. 2. Can I spin up a 3 nodes cluster with mons, MDS and osds on the same machine? Yes. 3. Hardware suggestions? Depends on your load. :-) 4. How can I understand the ceph health status output, in details? I've not seen any docs about this I think it is quite self-explanatory when you know how works Ceph. Don't run Ceph if you don't understand it :) Hell don't plan a Ceph deployment before understanding it either (reading this list can help too, look at the archives). 5. How can I know if cluster is fully synced or if any background operation (scrubbing, replication, ...) Is running? Looking at the health status output. 6. Is 10G Ethernet mandatory? Currently I only have 4 gigabit nic (2 for public traffic, 2 for cluster traffic) It is not mandatory, I administer 4 3-node clusters and all have 1gbit NICs. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943569206 Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Sharing SSD journals and SSD drive choice
- http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000798.html Regards, Jens Dueholm Christensen Rambøll Survey IT -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of Adam Carheden Sent: Wednesday, April 26, 2017 5:54 PM To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Sharing SSD journals and SSD drive choice Thanks everyone for the replies. I will be avoiding TLC drives, it was just something easy to benchmark with existing equipment. I hadn't though of unscrupulous data durability lies or performance suddenly tanking in unpredictable ways. I guess it all comes down to trusting the vendor since it would be expensive in time and $$ to test for such things. Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All have "DC" prefixes and are listed in the Data Center section of their marketing pages, so I assume they'll all have the same quality underlying NAND. -- Adam Carheden On 04/26/2017 09:20 AM, Chris Apsey wrote: > Adam, > > Before we deployed our cluster, we did extensive testing on all kinds of > SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME > Drives. We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB > to 12x HGST 10TB SAS3 HDDs. It provided the best > price/performance/density balance for us overall. As a frame of > reference, we have 384 OSDs spread across 16 nodes. > > A few (anecdotal) notes: > > 1. Consumer SSDs have unpredictable performance under load; write > latency can go from normal to unusable with almost no warning. > Enterprise drives generally show much less load sensitivity. > 2. Write endurance; while it may appear that having several > consumer-grade SSDs backing a smaller number of OSDs will yield better > longevity than an enterprise grade SSD backing a larger number of OSDs, > the reality is that enterprise drives that use SLC or eMLC are generally > an order of magnitude more reliable when all is said and done. > 3. Power Loss protection (PLP). Consumer drives generally don't do well > when power is suddenly lost. Yes, we should all have UPS, etc., but > things happen. Enterprise drives are much more tolerant of > environmental failures. Recovering from misplaced objects while also > attempting to serve clients is no fun. > > > > > > --- > v/r > > Chris Apsey > bitskr...@bitskrieg.net <mailto:bitskr...@bitskrieg.net> > https://www.bitskrieg.net > > On 2017-04-26 10:53, Adam Carheden wrote: >> What I'm trying to get from the list is /why/ the "enterprise" drives >> are important. Performance? Reliability? Something else? >> >> The Intel was the only one I was seriously considering. The others were >> just ones I had for other purposes, so I thought I'd see how they fared >> in benchmarks. >> >> The Intel was the clear winner, but my tests did show that throughput >> tanked with more threads. Hypothetically, if I was throwing 16 OSDs at >> it, all with osd op threads = 2, do the benchmarks below not show that >> the Hynix would be a better choice (at least for performance)? >> >> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously >> the single drive leaves more bays free for OSD disks, but is there any >> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s >> mean: >> >> a) fewer OSDs go down if the SSD fails >> >> b) better throughput (I'm speculating that the S3610 isn't 4 times >> faster than the S3520) >> >> c) load spread across 4 SATA channels (I suppose this doesn't really >> matter since the drives can't throttle the SATA bus). >> >> >> -- >> Adam Carheden >> >> On 04/2
Re: [ceph-users] Sharing SSD journals and SSD drive choice
Adam, What David said before about SSD drives is very important. I will tell you another way: use enterprise grade SSD drives, not consumer grade. Also, pay attention to endurance. The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7, and probably it isn't even the most suitable SATA SSD disk from Intel; better use S3610 o S3710 series. Cheers Eneko El 25/04/17 a las 21:02, Adam Carheden escribió: On 04/25/2017 11:57 AM, David wrote: On 19 Apr 2017 18:01, "Adam Carheden"> wrote: Does anyone know if XFS uses a single thread to write to it's journal? You probably know this but just to avoid any confusion, the journal in this context isn't the metadata journaling in XFS, it's a separate journal written to by the OSD daemons Ha! I didn't know that. I think the number of threads per OSD is controlled by the 'osd op threads' setting which defaults to 2 So the ideal (for performance) CEPH cluster would be one SSD per HDD with 'osd op threads' set to whatever value fio shows as the optimal number of threads for that drive then? I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps consider going up to a 37xx and putting more OSDs on it. Of course with the caveat that you'll lose more OSDs if it goes down. Why would you avoid the SanDisk and Hynix? Reliability (I think those two are both TLC)? Brand trust? If it's my benchmarks in my previous email, why not the Hynix? It's slower than the Intel, but sort of decent, at lease compared to the SanDisk. My final numbers are below, including an older Samsung Evo (MCL I think) which did horribly, though not as bad as the SanDisk. The Seagate is a 10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison. SanDisk SDSSDA240G, fio 1 jobs: 7.0 MB/s (5 trials) SanDisk SDSSDA240G, fio 2 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 4 jobs: 7.5 MB/s (5 trials) SanDisk SDSSDA240G, fio 8 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 16 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 32 jobs: 7.6 MB/s (5 trials) SanDisk SDSSDA240G, fio 64 jobs: 7.6 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 1 jobs: 4.2 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 2 jobs: 0.6 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 4 jobs: 7.5 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 8 jobs: 17.6 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 16 jobs: 32.4 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 32 jobs: 64.4 MB/s (5 trials) HFS250G32TND-N1A2A 3P10, fio 64 jobs: 71.6 MB/s (5 trials) SAMSUNG SSD, fio 1 jobs: 2.2 MB/s (5 trials) SAMSUNG SSD, fio 2 jobs: 3.9 MB/s (5 trials) SAMSUNG SSD, fio 4 jobs: 7.1 MB/s (5 trials) SAMSUNG SSD, fio 8 jobs: 12.0 MB/s (5 trials) SAMSUNG SSD, fio 16 jobs: 18.3 MB/s (5 trials) SAMSUNG SSD, fio 32 jobs: 25.4 MB/s (5 trials) SAMSUNG SSD, fio 64 jobs: 26.5 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 1 jobs: 91.2 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 2 jobs: 132.4 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 4 jobs: 138.2 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 8 jobs: 116.9 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 16 jobs: 61.8 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 32 jobs: 22.7 MB/s (5 trials) INTEL SSDSC2BB150G7, fio 64 jobs: 16.9 MB/s (5 trials) SEAGATE ST9300603SS, fio 1 jobs: 0.7 MB/s (5 trials) SEAGATE ST9300603SS, fio 2 jobs: 0.9 MB/s (5 trials) SEAGATE ST9300603SS, fio 4 jobs: 1.6 MB/s (5 trials) SEAGATE ST9300603SS, fio 8 jobs: 2.0 MB/s (5 trials) SEAGATE ST9300603SS, fio 16 jobs: 4.6 MB/s (5 trials) SEAGATE ST9300603SS, fio 32 jobs: 6.9 MB/s (5 trials) SEAGATE ST9300603SS, fio 64 jobs: 0.6 MB/s (5 trials) For those who come across this and are looking for drives for purposes other than CEPH, those are all sequential write numbers with caching disabled, a very CEPH-journal-specific test. The SanDisk held it's own against the Intel using some benchmarks on Windows that didn't disable caching. It may very well be a perfectly good drive for other purposes. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Bluestore
Hi Michal, El 14/03/17 a las 23:45, Michał Chybowski escribió: I'm going to set up a small cluster (5 nodes with 3 MONs, 2 - 4 HDDs per node) to test if ceph in such small scale is going to perform good enough to put it into production environment (or does it perform well only if there are tens of OSDs, etc.). Are there any "do's" and "don'ts" in matter of OSD storage type (bluestore / xfs / ext4 / btrfs), correct "journal-to-storage-drive-size" ratio and monitor placement in very limited space (dedicated machines just for MONs are not an option). You don't tell us what this cluster will be used for. I have several tiny ceph clusters (3 nodes) in production for some years now, ceph nodes usually do mon+osd+virtualization. They perform quite good for their use case (VMs only use heavy I/O rarely), but I have always built the clusters with SSDs for journals. I have seen better performance with this setup than some entry-level EMC disk enclosures; I always thought this was a misconfiguration problem on the other enclosure provider though! :) Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS PG calculation
Hi Martin, Take a look at http://ceph.com/pgcalc/ Cheers Eneko El 10/03/17 a las 09:54, Martin Wittwer escribió: Hi List I am creating a POC cluster with CephFS as a backend for our backup infrastructure. The backups are rsyncs of whole servers. I have 4 OSD nodes with 10 4TB disks and 2 SSDs for journaling per node. My question is now how to calculate the PG count for that scenario? Is there a way to calculate how many PGs the data/metadata pool needs or are there any recommendations? Best -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recovery ceph cluster down OS corruption
Hi Iban, Is the monitor data safe? If it is, just install jewel in other servers and plug in the OSD disks, it should work. El 24/02/17 a las 14:41, Iban Cabrillo escribió: Hi, We have a serious issue. We have a mini cluster (jewel version) with two server (Dell RX730), with 16Bays and the OS intalled on dual 8 GB sd card, But this configuration is working really really bad. The replication is 2, but yesterday one server crash and this morning the other One, this is not the first time, but others we had one server up and the data could be replicated without any troubles, reinstalling the osdserver completely. Until I understand, Ceph data and metadata is still on bays (data on SATA and metadata on 2 fast SSDs), I think only the OS installed on SD cards is corrupted. Is there any way to solve this situation? Any Idea will be great!! Regards, I -- Iban Cabrillo Bartolome Instituto de Fisica de Cantabria (IFCA) Santander, Spain Tel: +34942200969 PGP PUBLIC KEY: http://pgp.mit.edu/pks/lookup?op=get=0xD9DF0B3D6C8C08AC Bertrand Russell:/"El problema con el mundo es que los estúpidos están seguros de todo y los inteligentes están //llenos de dudas/" ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Release schedule and notes.
Hi, El 24/11/16 a las 12:09, Stephen Harker escribió: Hi All, This morning I went looking for information on the Ceph release timelines and so on and was directed to this page by Google: http://docs.ceph.com/docs/jewel/releases/ but this doesn't seem to have been updated for a long time. Is there somewhere else I should be looking? Here: http://docs.ceph.com/docs/master/releases/ :-) Additionally, I tried to find information on the Hammer release that I see as current for Debian Wheezy LTS: 0.94.9-1~bpo70+1 but there seems to be nothing here either: http://docs.ceph.com/docs/jewel/release-notes/ the latest Hammer release mentioned is 0.94.6 has this information been moved elsewhere or just not updated recently. Thanks! :) Kind regards, Stephen -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM / Ceph performance problems
Hi Michiel, How are you configuring VM disks on Proxmox? What type (virtio, scsi, ide) and what cache setting? El 23/11/16 a las 07:53, M. Piscaer escribió: Hi, I have an little performance problem with KVM and Ceph. I'm using Proxmox 4.3-10/7230e60f, with KVM version pve-qemu-kvm_2.7.0-8. Ceph is on version jewel 10.2.3 on both the cluster as the client (ceph-common). The systems are connected to the network via an 4x bonding with an total of 4 Gb/s. Within an guest, - when I do an write to I get about 10 MB/s. - Also when I try to do an write within the guest but then directly to ceph I get the same speed. - But when I mount an ceph object on the Proxmox host I get about 110MB/s The guest is connected to interface vmbr160 → bond0.160 → bond0. This bridge vmbr160 has an IP address with the same subnet as the ceph cluster with an mtu 9000. The KVM block device is an virtio device. What can I do to solve this problem? Kind regards, Michiel Piscaer ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating files from ceph fs from cluster a to cluster b without low downtime
El 06/06/16 a las 20:53, Oliver Dzombic escribió: Hi, thank you for your suggestion. Rsync will copy the whole file new, if the size is different. Since we talk about raw image files of virtual servers, rsync is no option. We need something which will inside of a file just copy the delta's. Something like lvmsync ( which is just working with LVM ). So i am looking for a tool which can do that on a file-base level. Have you tried rsync --inplace? It works quite well for us, no whole file copying. We use ir for raw VM disc files. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd resize option
You have to shrink FS before RBD block! Now your FS is corrupt! :) El 12/05/16 a las 15:41, M Ranga Swami Reddy escribió: Used "resize2fs" and its working for resize to higher number (ie from 10G -> 20G) or so... If I tried to resize the lower numbers (ie from 10G -> 5G), its failied...with below message: === ubuntu@swami-resize-test-vm:/$ sudo resize2fs /dev/vdb sudo: unable to resolve host swami-resize-test-vm resize2fs 1.42.9 (4-Feb-2014) Please run 'e2fsck -f /dev/vdb' first. ubuntu@swami-resize-test-vm:/$ sudo e2fsck -f /dev/vdb sudo: unable to resolve host swami-resize-test-vm e2fsck 1.42.9 (4-Feb-2014) The filesystem size (according to the superblock) is 52428800 blocks The physical size of the device is 13107200 blocks Either the superblock or the partition table is likely to be corrupt! Abort? On Thu, May 12, 2016 at 6:37 PM, Eneko Lacunza <elacu...@binovo.es> wrote: Swami, You must resize (reduce) a filesystem before shrinking a partition/disk. Please search online how to do so with your specific filesystem/partitions. El 12/05/16 a las 15:00, M Ranga Swami Reddy escribió: Not done any FS shrink before "rbd resize". Please let me know what to do with FS shink before "rbd resize" Thanks Swami On Thu, May 12, 2016 at 4:34 PM, Eneko Lacunza <elacu...@binovo.es> wrote: Did you shrink the FS to be smaller than the target rbd size before doing "rbd resize"? El 12/05/16 a las 12:33, M Ranga Swami Reddy escribió: When I used "rbd resize" option for size shrink, the image/volume lost its fs sectors and asking for "fs" not found... I have used "mkf" option, then all data lost in it? This happens with shrink option... Thanks Swami On Wed, May 11, 2016 at 5:28 PM, Christian Balzer <ch...@gol.com> wrote: Hello, On Wed, 11 May 2016 13:33:44 +0200 (CEST) Alexandre DERUMIER wrote: but the fstrim can used with in mount partition...But I wanted to as cloud admin... if you use qemu, you can launch fstrim through guest-agent This of course assumes that qemu/kvm is using a disk method that allows for TRIM. And nobody in their right mind uses IDE (performance), while virtio-scsi isn't the default or even supported with some cloud stacks. And of course that the VM in question runs Linux and has fstrim installed. Otherwise solid advise, I agree. Christian http://dustymabe.com/2013/06/26/enabling-qemu-guest-agent-and-fstrim-again/ - Mail original - De: "M Ranga Swami Reddy" <swamire...@gmail.com> À: "Wido den Hollander" <w...@42on.com> Cc: "ceph-users" <ceph-us...@ceph.com> Envoyé: Mercredi 11 Mai 2016 13:16:27 Objet: Re: [ceph-users] rbd resize option Thank you. but the fstrim can used with in mount partition...But I wanted to as cloud admin... I have a few uses with high volume size (ie capacity) allotted, but only used 5% of the capacity. so I wanted to reduce the size to 10% of size using the rbd resize command. But in this process, if a customer's volume has more than 10% data, then I may end-up with data lost... Thanks Swami On Wed, May 11, 2016 at 1:17 PM, Wido den Hollander <w...@42on.com> wrote: Op 11 mei 2016 om 8:38 schreef M Ranga Swami Reddy <swamire...@gmail.com>: Hello, I wanted to resize an image using 'rbd' resize option, but it should be have data loss. For ex: I have image with 100 GB size (thin provisioned). and this image has data of 10GB only. Here I wanted to resize this image to 11GB, so that 10GB data is safe and its resized. Can I do the above resize safely.? No, you can't. You need to resize the filesystem and partitions inside the RBD image to something below 11GB before you can do this. Still, make sure you have backups! Also, why shrink? If you can, run a fstrim on the image, that might reclaimed unused space on the Ceph cluster. If I tried to resize to 5GB, is rbd throughs an error saying that your data is going loss, something like that??? Any inputs here are appriciated. Thanks Swami ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/cep
Re: [ceph-users] rbd resize option
Did you shrink the FS to be smaller than the target rbd size before doing "rbd resize"? El 12/05/16 a las 12:33, M Ranga Swami Reddy escribió: When I used "rbd resize" option for size shrink, the image/volume lost its fs sectors and asking for "fs" not found... I have used "mkf" option, then all data lost in it? This happens with shrink option... Thanks Swami On Wed, May 11, 2016 at 5:28 PM, Christian Balzerwrote: Hello, On Wed, 11 May 2016 13:33:44 +0200 (CEST) Alexandre DERUMIER wrote: but the fstrim can used with in mount partition...But I wanted to as cloud admin... if you use qemu, you can launch fstrim through guest-agent This of course assumes that qemu/kvm is using a disk method that allows for TRIM. And nobody in their right mind uses IDE (performance), while virtio-scsi isn't the default or even supported with some cloud stacks. And of course that the VM in question runs Linux and has fstrim installed. Otherwise solid advise, I agree. Christian http://dustymabe.com/2013/06/26/enabling-qemu-guest-agent-and-fstrim-again/ - Mail original - De: "M Ranga Swami Reddy" À: "Wido den Hollander" Cc: "ceph-users" Envoyé: Mercredi 11 Mai 2016 13:16:27 Objet: Re: [ceph-users] rbd resize option Thank you. but the fstrim can used with in mount partition...But I wanted to as cloud admin... I have a few uses with high volume size (ie capacity) allotted, but only used 5% of the capacity. so I wanted to reduce the size to 10% of size using the rbd resize command. But in this process, if a customer's volume has more than 10% data, then I may end-up with data lost... Thanks Swami On Wed, May 11, 2016 at 1:17 PM, Wido den Hollander wrote: Op 11 mei 2016 om 8:38 schreef M Ranga Swami Reddy : Hello, I wanted to resize an image using 'rbd' resize option, but it should be have data loss. For ex: I have image with 100 GB size (thin provisioned). and this image has data of 10GB only. Here I wanted to resize this image to 11GB, so that 10GB data is safe and its resized. Can I do the above resize safely.? No, you can't. You need to resize the filesystem and partitions inside the RBD image to something below 11GB before you can do this. Still, make sure you have backups! Also, why shrink? If you can, run a fstrim on the image, that might reclaimed unused space on the Ceph cluster. If I tried to resize to 5GB, is rbd throughs an error saying that your data is going loss, something like that??? Any inputs here are appriciated. Thanks Swami ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding new disk/OSD to ceph cluster
Hi Mad, El 09/04/16 a las 14:39, Mad Th escribió: We have a 3 node proxmox/ceph cluster ... each with 4 x4 TB disks Are you using 3-way replication? I guess you are. :) 1) If we want to add more disks , what are the things that we need to be careful about? Will the following steps automatically add it to ceph.conf? ceph-disk zap /dev/sd[X] pveceph createosd /dev/sd[X] -journal_dev /dev/sd[Y] where X is new disk and Y is the journal disk. Yes, this is the same as adding it from web GUI. 2) Is it safe to run different number of OSDs in the cluster, say one server with 5 OSD and other two servers with 4OSD ? Though we have plan to add one OSD to each server. It is safe as long as none of your nodes OSDs are near-full. If you're asking this because you're adding a new OSD to each node, step by step; yes, it is safe. Be prepared for data moving around when you add new disks. (performance will suffer unless you have tuned some parameters in ceph.conf) 3) How do we safely add the new OSD to an existing storage pool? New OSD will be used automatically by existing ceph pools unless you have changed CRUSH map. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943493611 943324914 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Typical architecture in RDB mode - Number of servers explained ?
Hi, El 28/01/16 a las 13:53, Gaetan SLONGO escribió: Dear Ceph users, We are currently working on CEPH (RBD mode only). The technology is currently in "preview" state in our lab. We are currently diving into Ceph design... We know it requires at least 3 nodes (OSDs+Monitors inside) to work properly. But we would like to know if it makes sense to use 4 nodes ? I've heard this is not a good idea because all of the capacity of the 4 servers won't be available ? Someone can confirm ? There's no problem to use 4 servers for OSD; just don't put a monitor in one of the nodes. Always keep an odd number of monitors (3 or 5). Monitors don't need to be in a OSD node, and in fact for medium and large clusters it is recommended to have a dedicated node for them. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Upgrading Ceph
Hi, El 27/01/16 a las 15:00, Vlad Blando escribió: I have a production Ceph Cluster - 3 nodes - 3 mons on each nodes - 9 OSD @ 4TB per node - using ceph version 0.80.5 (38b73c67d375a2552d8ed67843c8a65c2c0feba6) Now I want to upgrade it to Hammer, I saw the documentation on upgrading, it looks straight forward, but I want to know to those who have tried upgrading a production environment, any precautions, caveats, preparation that I need to do before doing it? Our migration on 3 Proxmox nodes with 3x3 OSD disks, went really smooth. :) We were running lastest Firefly, I suggest you first upgrade to latest Firefly too. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug
Hi Mart, El 23/11/15 a las 10:29, Mart van Santen escribió: On 11/22/2015 10:01 PM, Robert LeBlanc wrote: There have been numerous on the mailing list of the Samsung EVO and Pros failing far before their expected wear. This is most likely due to the 'uncommon' workload of Ceph and the controllers of those drives are not really designed to handle the continuous direct sync writes that Ceph does. Because of this they can fail without warning (controller failure rather than MLC failure). I'm new to the mailinglist and I'm scanning the archive currently. And I'm getting a sense of the Samsung Evo quality disks. If i understand correctly, is is at least advise to put DC grade Journals in front om them to safe them a bit from failure. For example intel 750's. I don't think Intel 750's are DC grade. I don't have any of them though. However, is there experience in when the Evo's fail in the Ceph scenarion? For example, is wear leveling is according SMART about 40%, it's time to replace your disks? Or is it just random. Actually we are using mostly Crucial drives (m550, mx200's), there is not a lot about them on the list. Do other people use them and what's there experience so far. I expect about the same quality of the Samsung Evo's, but I'm not sure if that is the correct conclusion. My experience with Samsung 840 pro is that they can't be used for Ceph at all. In case of Crucial M550, they are slow and have little endurance for ceph use, but I have used them and seemed reliable during warranty lifetime (we retired them for performance reasons). About SSD failure in general, do they normally fail hard, or are they just getting unbearable slow? We do measure/graph disks 'busy' performance, and use that as an indicator if a disk is getting slow. Is this is a sensible approach? Just don't do it. Use DC SSDs, like intel S3xxx, or Samsung DC Pro, or something like that. You will save a lot of time and effort, and possibly also money. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IO scheduler osd_disk_thread_ioprio_class
Hi Jan, What SSD model? I've seen SSDs work quite well usually but suddenly give a totally awful performance for some time (not those 8K you see though). I think there was some kind of firmware process involved, I had to replace the drive with a serious DC one. El 23/06/15 a las 14:07, Jan Schermer escribió: Yes, but that’s a separate issue :-) Some drives are just slow (100 IOPS) for synchronous writes with no other load. The drives I’m testing have ~8K IOPS when not under load - having them drop to 10 IOPS is a huge problem. If it’s indeed a CFQ problem (as I suspect) then no matter what drive you have you will have problems. Jan On 23 Jun 2015, at 14:03, Dan van der Ster d...@vanderster.com wrote: Oh sorry, I had missed that. Indeed that is surprising. Did you read the recent thread (SSD IO performance) discussing the relevance of O_DSYNC performance for the journal? Cheers, Dan On Tue, Jun 23, 2015 at 1:54 PM, Jan Schermer j...@schermer.cz wrote: I only use SSDs, which is why I’m so surprised at the CFQ behaviour - the drive can sustain tens of thousand of reads per second, thousands of writes - yet saturating it with reads drops the writes to 10 IOPS - that’s mind boggling to me. Jan On 23 Jun 2015, at 13:43, Dan van der Ster d...@vanderster.com wrote: On Tue, Jun 23, 2015 at 1:37 PM, Jan Schermer j...@schermer.cz wrote: Yes, I use the same drive one partition for journal other for xfs with filestore I am seeing slow requests when backfills are occuring - backfills hit the filestore but slow requests are (most probably) writes going to the journal - 10 IOPS is just to few for anything. My Ceph version is dumpling - that explains the integers. So it’s possible it doesn’t work at all? I thought that bug was fixed. You can check if it worked by using iotop -b -n1 and looking for threads with the idle priority. Bad news about the backfills no being in the disk thread, I might have to use deadline after all. If your experience follows the same paths of most users, eventually deep scrubs will cause latency issues and you'll switch back to cfq plus ionicing the disk thread. Are you using Ceph RBD or object storage? If RBD, eventually you'll find that you need to put the journals on an SSD. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best setup for SSD
Hi, On 02/06/15 16:18, Mark Nelson wrote: On 06/02/2015 09:02 AM, Phil Schwarz wrote: Le 02/06/2015 15:33, Eneko Lacunza a écrit : Hi, On 02/06/15 15:26, Phil Schwarz wrote: On 02/06/15 14:51, Phil Schwarz wrote: i'm gonna have to setup a 4-nodes Ceph(Proxmox+Ceph in fact) cluster. -1 node is a little HP Microserver N54L with 1X opteron + 2SSD+ 3X 4TB SATA It'll be used as OSD+Mon server only. Are these SSDs Intel S3700 too? What amount of RAM? Yes, All DCS3700, for the four nodes. 16GB of RAM on this node. This should be enough for 3 OSDs I think, I used to have a Dell T20/Intel G3230 with 2x1TB OSDs with only 4 GB running OK. Cheers Eneko Yes, indeed. My main problem is doing something non adviced... Running VMs on Ceph nodes... No choice, but it seems that i'll have to do that. Hope i won't peg the CPU too quickly.. I'm doing it in 3 different Proxmox clusters. They're not very busy clusters, but works very well. You might want to consider using cgroups or some other mechanism to segment what runs on what cores. While not ideal, dedicating 2-3 of the cores to ceph and leaving the other(s) for VMs might be a reasonable way to go. I think this may be must if you setup a dedicated SSD pool. A single DC S3700 should suffice for journals for 4 OSDs. I wouldn't recommend using the other one for a cache tier unless you have a very highly skewed hot/cold workload. Perhaps instead make a dedicated SSD pool that could be used for high IOPS workloads. In fact you might consider skipping SSD journals and just making a dedicated SSD pool with all of the SSDs depending on how much write workload your main pool sees and if you could make good use of a dedicated SSD pool. Be warned that running SSD and HD based OSDs in the same server is not recommended. If you need the storage capacity, I'd stick to the journals on SSDs plan. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Recommendations for a driver situation
Hi, On 02/06/15 14:18, Pontus Lindgren wrote: We have recently acquired new servers for a new ceph cluster and we want to run Debian on those servers. Unfortunately drivers needed for the raid controller are only available in newer kernels than what Debian Wheezy provides. We need to run the dumpling release of Ceph. Since the Ceph repo does not have packages for Debian Jessie I see 3 alternatives for us: 1. Wait for the Ceph repo to add packages for Debian Jessie. Number 1 is not really an option for us. But, is there an approximate ETA on this? Why is this the case? At least Alexandre Derumier is working on this: (check an email from him in this list on 12th May) http://odisoweb1.odiso.net/ceph-jessie/ 2. Run Debian Wheezy with backported drivers. I haven't used them lately, but linux kernel in wheezy-backport is 3.16, is this enough? What kernel version do you require for the drivers? Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best setup for SSD
Hi, On 02/06/15 14:51, Phil Schwarz wrote: i'm gonna have to setup a 4-nodes Ceph(Proxmox+Ceph in fact) cluster. -1 node is a little HP Microserver N54L with 1X opteron + 2SSD+ 3X 4TB SATA It'll be used as OSD+Mon server only. Are these SSDs Intel S3700 too? What amount of RAM? - 3 nodes are setup upon Dell 730+ 1xXeon 2603, 48 GB RAM, 1x 1TB SAS for OS , 4x 4TB SATA for OSD and 2x DCS3700 200GB intel SSD I can't change the hardware, especially the poor cpu... Everything will be connected through Intel X520+Netgear XS708E, as 10GBE storage network. This cluster will support VM (mostly KVM) upon the 3 R730 nodes. I'm already aware of the CPU pegging all the time...But can't change it for the moment. The VM will be Filesharing servers, poor usage services (DNS,DHCP,AD or OpenLDAP). One Proxy cache (Squid) will be used upon a 100Mb Optical fiber with 500+ clients. My question is : Is it recommended to setup the 2 SSDS as : One SSD as journal for 2 (up to 3in the future) OSDs Or One SSD as journal for the 4 (up to 6 in the future) OSDs and the remaining SSD as cache tiering for the previous SSD+4 OSDs pool ? I haven't used cache tiering myself, but others have not reported much benefit from it (if any) at all, at least this is my understanding. So I think it would be better to use both SSDs for journals. It probably won't help performance using 2 instead of only 1, but it will lessen the impact from a SSD failure. Also it seems that the consensus is 3-4 OSD for each SSD, so it will help when you expand to 6 OSD. SSD should be rock solid enough to support both bandwidth and living time before being destroyed by the low amount of data that will be written on it (Few hundreds of GB per day as rule of thumb..) If all are Intel S3700 you're on the safe side unless you have lots on writes. Anyway I suggest you monitor the SMART values. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Best setup for SSD
Hi, On 02/06/15 15:26, Phil Schwarz wrote: On 02/06/15 14:51, Phil Schwarz wrote: i'm gonna have to setup a 4-nodes Ceph(Proxmox+Ceph in fact) cluster. -1 node is a little HP Microserver N54L with 1X opteron + 2SSD+ 3X 4TB SATA It'll be used as OSD+Mon server only. Are these SSDs Intel S3700 too? What amount of RAM? Yes, All DCS3700, for the four nodes. 16GB of RAM on this node. This should be enough for 3 OSDs I think, I used to have a Dell T20/Intel G3230 with 2x1TB OSDs with only 4 GB running OK. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacing OSD disks with SSD journal - journal disk space use
Hi, It's firefly 0.80.9, so if the improvement is in Hammer I haven't seen it. Will check back when I upgrade the cluster. Thanks Eneko On 26/05/15 17:45, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 What version of Ceph are you using? I seem to remember an enhancement of ceph-disk for Hammer that is more aggressive in reusing previous partition. - Robert LeBlanc GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, May 25, 2015 at 4:22 AM, Eneko Lacunza wrote: Hi all, We have a firefly ceph cluster (using Promxox VE, but I don't think this is revelant), and found a OSD disk was having quite a high amount of errors as reported by SMART, and also quite high wait time as reported by munin, so we decided to replace it. What I have done is down/out the osd, then remove it (removing partitions). Replace the disk and create a new OSD, which was created with the same ID as the removed one (as I was hoping to not change CRUSH map). So everything has worked as expected, except one minor non-issue: - Original OSD journal was on a separate SSD disk, which had partitions #1 and #2 (journals of 2 OSDs). - Original journal partition (#1) was removed - A new partition has been created as #1, but has been assigned space after the last existing partition. So there is now hole of 5GB in the beginning of SSD disk. Promox is using ceph-disk prepare for this, I seen in the docs (http://ceph.com/docs/master/man/8/ceph-disk/) that ceph-disk prepare creates a new partition in the journal block device. What I'm afraid is that given enough OSD replacements, Proxmox wouldn't find free space for new journals in that SSD disk? Although there would be plenty in the beginning? Maybe the journal-partition creation can be improved so that it can detect free space also in the beginning and between existing partitions? Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -BEGIN PGP SIGNATURE- Version: Mailvelope v0.13.1 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJVZJUvCRDmVDuy+mK58QAAZ+4QAIr27ymAPpOZPr9JUVWZ M8avNyddIiJpG/S2pP91UyxAzrgAy+mGbVQG0istpo98QKjT9UNxi/ySe64c OxmIHb1tp40nyMtWFnv3W0Iw1iiScTxp2hWc2KSubbibFS6YY4ACRmTysBh+ Curdo9TG9h6k4zSbQ1gAInuMCh6NIoxUMnNatkyju5UgxpGYKg9iN8Ddt+wX H/YC3yKLnwuqIkYBWsMpQCNpry2RZYWTUF9tRiuGTJg5lnIuU572sXRCpXkZ NGcVYjbOX2g16MMxohSfivxJ36PbCGsvPIde3WZz0RDP7xmeJnEanR3Zw9mC Td80pyVkuu28lRJ/UYWwTRkd0PECNejYaGvBN6LjidbZE2nejTz31Pl0DGuZ 9zlCyNFQDvUAcrKgIB0iE0qgNNzGgtmfgq+dvcu5+uFY0FLev8s7SZWCVcMf UUwGe+UldfDo9w5g2vo89jMFvG+SIA7Pmk3ZsSvt1NzQCAYABRsb4MXUwNJ8 k/S8ZgtNr1GcDeTSH+C+SqOdGS4i+AXVr3+r01Jw+9CbIWerI9aFZ8iBifUf Amhz0DCqFe4m4ZHNp1HSaGaHtc1DZYiqaRggQ73FeIfGnyheNllJXx9hlJJF ioLHk84XoiRn4KgdATF6XXIi1lk7zp0KyvyIxpGX958Q8qqPc5AbVDg3Q8OY f0yb =w3HG -END PGP SIGNATURE- -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Replacing OSD disks with SSD journal - journal disk space use
Hi all, We have a firefly ceph cluster (using Promxox VE, but I don't think this is revelant), and found a OSD disk was having quite a high amount of errors as reported by SMART, and also quite high wait time as reported by munin, so we decided to replace it. What I have done is down/out the osd, then remove it (removing partitions). Replace the disk and create a new OSD, which was created with the same ID as the removed one (as I was hoping to not change CRUSH map). So everything has worked as expected, except one minor non-issue: - Original OSD journal was on a separate SSD disk, which had partitions #1 and #2 (journals of 2 OSDs). - Original journal partition (#1) was removed - A new partition has been created as #1, but has been assigned space after the last existing partition. So there is now hole of 5GB in the beginning of SSD disk. Promox is using ceph-disk prepare for this, I seen in the docs (http://ceph.com/docs/master/man/8/ceph-disk/) that ceph-disk prepare creates a new partition in the journal block device. What I'm afraid is that given enough OSD replacements, Proxmox wouldn't find free space for new journals in that SSD disk? Although there would be plenty in the beginning? Maybe the journal-partition creation can be improved so that it can detect free space also in the beginning and between existing partitions? Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Hi, I'm just writing to you to stress out what others have already said, because it is very important that you take it very seriously. On 20/04/15 19:17, J-P Methot wrote: On 4/20/2015 11:01 AM, Christian Balzer wrote: This is similar to another thread running right now, but since our current setup is completely different from the one described in the other thread, I thought it may be better to start a new one. We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 MB/sec, plugged in jbod on a controller that can theoretically transfer at 6gb/sec. All of that is linked to openstack compute nodes on two bonded 10gbps links (so a max transfer rate of 20 gbps). I sure as hell hope you're not planning to write all that much to this cluster. But then again you're worried about write speed, so I guess you do. Those _consumer_ SSDs will be dropping like flies, there are a number of threads about them here. They also might be of the kind that don't play well with O_DSYNC, I can't recall for sure right now, check the archives. Consumer SSDs universally tend to slow down quite a bit when not TRIM'ed and/or subjected to prolonged writes, like those generated by a benchmark. I see, yes it looks like these SSDs are not the best for the job. We will not change them for now, but if they start failing, we will replace them with better ones. I tried to put a Samsung 840 Pro 256GB in a ceph setup. It is supposed to be quite better than the EVO right? It was total crap. No not the best for the job. TOTAL CRAP. :) It can't give any useful write performance for a Ceph OSD. Spec sheet numbers don't matter for this, they don't work for ceph OSD, period. And yes, the drive is fine and works like a charm in workstation workloads. I suggest you at least get some intel S3700/S3610 and use them for the journal of those samsung drives, I think that could help performance a lot. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] journal placement for small office?
Hi, The common recommendation is to use a good (Intel S3700) SSD disk for journals for each 3-4 OSDs, or otherwise to use internal journal on each OSD. Don't put more than one journal on the same spinning disk. Also, it is recommended to use 500G-1TB disks, specially if you have a 1gbit network; otherwise when a OSD fails recover time can be quite long. Also look in the mailing list archives for some tunning of backfiling for smalls ceph clusters. Cheers. Eneko On 06/02/15 16:48, pixelfairy wrote: 3 nodes, each with 2x1TB in a raid (for /) and 6x4TB for storage. all of this will be used for block devices for kvm instances. typical office stuff. databases, file servers, internal web servers, a couple dozen thin clients. not using the object store or cephfs. i was thinking about putting the journals on the root disk (this is how my virtual cluster works, because in that version the osds are 4G instead of 4TB), and keeping that on its current raid 1, for resiliency but im worried about making a performance bottleneck. tempted to swap these out with ssds. if so, how big should i get? is 1/2TB enough? the other thought was little partitions on each osd. were doing xfs because i dont know enough about brtfs to feel comfortable with that. would the performance degredation be worse? is there a better way? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] remote storage
Hi Robert, I don't see any reply to your email, so I send you my thoughts. Ceph is all about using cheap local disks to build a large performant and resilient storage. Your use case with SAN and storwise doesn't seem to fit very well to Ceph. (I'm not saying it can't be done). ¿Why are you planning to use Ceph with a SAN? Why not use the SAN directly? Cheers Eneko On 23/01/15 12:28, Robert Duncan wrote: Hi All, This is my first post, I have been using Ceph OSD in OpenStack Icehouse as part of the Mirantis distribution with Fuel – this is my only experience with Ceph, so as you can imagine – it works, but I don’t really understand all of the technical details, I am working for a college in Ireland and we are planning on deploying a larger private cloud this year using the Kilo release of OpenStack when it matures. I am architecting the physical components and storage has become quite complex – Currently we use a Dell Equallogic Array and we have configured the cinder service to use the driver provided by Dell, the nodes in the data centre don’t have a lot of local storage. So here is my Ceph ignorance laid bare 1-I have enough compute nodes to run a ceph cluster, radosgw etc. as per http://ceph.com/docs/master/radosgw/ 2-I have no available local disks * - this is the problem 3-I have a Dell Equallogic SAN and fabric (7.5k NL SAS) 4-I have access to storage as a service from our ISP – this is an IBM storwise V7000- I can provision block storage and mount iscsi volumes, it’s across town but we have a p2p layer 2 connection The use cases will be students on a Masters in data analytics using OpenStack Sahara and S3 for data sets. So if I mounted remote storage or network attached storage would it work? Can I put Ceph directly in front of my Equallogic array and use ceph for cinder, glance, nova and S3? Has anyone any thoughts or experience on this – thanks for taking the time to read this and any input would be greatly appreciated. All the best, Rob. disclaimer The information contained and transmitted in this e-mail is confidential information, and is intended only for the named recipient to which it is addressed. The content of this e-mail may not have been sent with the authority of National College of Ireland. Any views or opinions presented are solely those of the author and do not necessarily represent those of National College of Ireland. If the reader of this message is not the named recipient or a person responsible for delivering it to the named recipient, you are notified that the review, dissemination, distribution, transmission, printing or copying, forwarding, or any other use of this message or any part of it, including any attachments, is strictly prohibited. If you have received this communication in error, please delete the e-mail and destroy all record of this communication. Thank you for your assistance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New firefly tiny cluster stuck unclean
Hi all, Finally this was fixed this way: # ceph osd pool set rbd size 1 (wait some seconds for HEALTH_OK) # ceph osd pool set rbd size 2 (wait almost an hour for HEALTH_OK after backfilling) I wanted to avoid this but didn't want to leave the cluster in bad state all night :) I really think there's some kind of bug that sometimes prevents ceph to backfill correctly; this is quite similar to another problem I reported on december (that time it was originally a size=3 then changed to size=2 not cleaning correctly). This time default pools were deleted and a new rbd pool was created with size=2. This was done before adding the OSDs of one of the nodes. Thanks Eneko On 20/01/15 16:23, Eneko Lacunza wrote: Hi all, I've just created a new ceph cluster for RBD with latest firefly: - 3 monitors - 2 OSD nodes, each has 1 s3700 (journals) + 2 x 3TB WD red (osd) Network is 1gbit, different physical interfaces for public and private network. There's only one pool rbd, size=2. There are just 5 rbd devices created. Somehow I reached the following status: cluster 8f839a95-d5e3-4a31-981e-497f9a0e4991 health HEALTH_WARN 16 pgs stuck unclean; recovery 2986/47638 objects degraded (6.268%) monmap e3: 3 mons at {0=172.16.1.3:6789/0,1=172.16.1.1:6789/0,2=172.16.1.2:6789/0}, election epoch 10, quorum 0,1,2 1,2,0 osdmap e38: 4 osds: 4 up, 4 in pgmap v4347: 128 pgs, 1 pools, 95232 MB data, 23819 objects 186 GB used, 10985 GB / 11171 GB avail 2986/47638 objects degraded (6.268%) 16 active 112 active+clean client io 43854 B/s wr, 10 op/s I don't see the problem for 16 pgs stuck unclean. ¿Can somebody suggest any hint? # cat /etc/pve/ceph.conf [global] auth client required = cephx auth cluster required = cephx auth service required = cephx auth supported = cephx cluster network = 172.16.2.0/24 filestore xattr use omap = true fsid = 8f839a95-d5e3-4a31-981e-497f9a0e4991 keyring = /etc/pve/priv/$cluster.$name.keyring osd journal size = 5120 osd pool default min size = 1 public network = 172.16.1.0/24 [osd] keyring = /var/lib/ceph/osd/ceph-$id/keyring osd max backfills = 1 osd recovery max active = 1 [mon.0] host = proxmox3 mon addr = 172.16.1.3:6789 [mon.1] host = proxmox1 mon addr = 172.16.1.1:6789 [mon.2] host = proxmox2 mon addr = 172.16.1.2:6789 # ceph pg dump_stuck ok pg_statobjectsmipdegrunfbyteslog disklog statestate_stampvreportedup up_primaryacting acting_primarylast_scrub scrub_stamplast_deep_scrub deep_scrub_stamp 3.815501550650117120359359active 2015-01-20 12:44:19.54568538'35938:1593[1,3]1 [1,3] 10'02015-01-20 12:44:15.6770780'0 2015-01-20 12:44:15.677078 3.2221702170910163968987987active 2015-01-20 12:44:19.53959638'98738:1312[3,1]3 [3,1] 30'02015-01-20 12:44:15.6761280'0 2015-01-20 12:44:15.676128 3.1e1790179075078041630013001 active 2015-01-20 12:44:19.53957038'541038:5961 [3,0]3 [3,0]30'02015-01-20 12:44:15.675939 0'02015-01-20 12:44:15.675939 3.6218201820763363328588588active 2015-01-20 12:44:19.53971338'58838:932[3,1]3 [3,1] 30'02015-01-20 12:44:15.6808060'0 2015-01-20 12:44:15.680806 3.6317001700713031680340340active 2015-01-20 12:44:19.54032938'34038:512[3,0]3 [3,0] 30'02015-01-20 12:44:15.6810990'0 2015-01-20 12:44:15.681099 3.1819001900796917760589589active 2015-01-20 12:44:19.53955038'58938:852[3,0]3 [3,0] 30'02015-01-20 12:44:15.6753450'0 2015-01-20 12:44:15.675345 3.1b20002000838860800734734active 2015-01-20 12:44:19.53951438'73438:1882[3,0]3 [3,0] 30'02015-01-20 12:44:15.6757380'0 2015-01-20 12:44:15.675738 3.1418501850775946240393393active 2015-01-20 12:44:19.53949238'39338:965[3,0]3 [3,0] 30'02015-01-20 12:44:15.6751380'0 2015-01-20 12:44:15.675138 3.1018701870780140560606606active 2015-01-20 12:44:19.54574138'60638:925[1,3]1 [1,3] 10'02015-01-20 12:44:15.6780350'0 2015-01-20 12:44:15.678035 3.1118601860780140544301301active 2015-01-20 12:44:20.83855038'30138:686[0,2]0 [0,2] 00'02015-01-20 12:44:15.6769080'0 2015-01-20 12:44:15.676908 3.1218701870784334848601601active 2015-01-20 12:44:19.49926438'60138:1228[2,0]2 [2,0] 20'02015
Re: [ceph-users] Improving Performance with more OSD's?
Hi, On 29/12/14 15:12, Christian Balzer wrote: 3rd Node - Monitor only, for quorum - Intel Nuc - 8GB RAM - CPU: Celeron N2820 Uh oh, a bit weak for a monitor. Where does the OS live (on this and the other nodes)? The leveldb (/var/lib/ceph/..) of the monitors likes it fast, SSDs preferably. I have a small setup with such a node (only 4 GB RAM, another 2 good nodes for OSD and virtualization) - it works like a charm and CPU max is always under 5% in the graphs. It only peaks when backups are dumped to its 1TB disk using NFS. I'd prefer to use the existing third node (the Intel Nuc), but its expansion is limited to USB3 devices. Are there USB3 external drives with decent performance stats? I'd advise against it. That node doing both monitor and OSDs is not going to end well. My experience has led me not to trust USB disks for continuous operation, I wouldn't do this either. Just my cents Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Block and NAS Services for Non Linux OS
Hi Steven, Welcome to the list. On 30/12/14 11:47, Steven Sim wrote: This is my first posting and I apologize if the content or query is not appropriate. My understanding for CEPH is the block and NAS services are through specialized (albeit opensource) kernel modules for Linux. What about the other OS e.g. Solaris, AIX, Windows, ESX ... If the solution is to use a proxy, would using the MON servers (as iSCSI and NAS proxies) be okay? Virtual machines see a QEMU IDE/SCSI disk, they don't know whether its on ceph, NFS, local, LVM, ... so it works OK for any VM guest SO. Currently on Proxmox, it's qemu-kvm the ceph (RBD) client, not the linux kernel. What about performance? It depends a lot on the setup. Do you have something on your mind? :) Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Improving Performance with more OSD's?
Hi, On 30/12/14 11:55, Lindsay Mathieson wrote: On Tue, 30 Dec 2014 11:26:08 AM Eneko Lacunza wrote: have a small setup with such a node (only 4 GB RAM, another 2 good nodes for OSD and virtualization) - it works like a charm and CPU max is always under 5% in the graphs. It only peaks when backups are dumped to its 1TB disk using NFS. Yes, CPU has not been a problem for em at all, I even occasional run a windows VM on the NUC. Sounds like we have very similar setups - 2 good ndoes that run full osd's, mon and VM's, and a third smaller node for quorum. Do you have OSD's on your thrid ndoe as well? No, I have never had a VM running on it, there are only 6 VMs in this cluster and the other 2 nodes have plenty of RAM/CPU for them. I might try if one of the good nodes goes down ;) I'd advise against it. That node doing both monitor and OSDs is not going to end well. My experience has led me not to trust USB disks for continuous operation, I wouldn't do this either. Yeah, it doesn't sound like a good idea. Pity, the nucs are so small and quiet Yes. But I think the CPU would become a problem as soon as we put 1-2 OSDs on that NUC. Maybe with a Core i3 NUC... :) Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph PG Incomplete = Cluster unusable
Hi Christian, New pool's pgs also show as incomplete? Did you notice something remarkable in ceph logs in the new pools image format? On 30/12/14 12:31, Christian Eichelmann wrote: Hi Eneko, I was trying a rbd cp before, but that was haning as well. But I couldn't find out if the source image was causing the hang or the destination image. That's why I decided to try a posix copy. Our cluster is sill nearly empty (12TB / 867TB). But as far as I understood (If not, somebody please correct me) placement groups are in genereally not shared between pools at all. Regards, Christian Am 30.12.2014 12:23, schrieb Eneko Lacunza: Hi Christian, Have you tried to migrate the disk from the old storage (pool) to the new one? I think it should show the same problem, but I think it'd be a much easier path to recover than the posix copy. How full is your storage? Maybe you can customize the crushmap, so that some OSDs are left in the bad (default) pool, and other OSDs and set for the new pool. It think (I'm yet learning ceph) that this will make different pgs for each pool, also different OSDs, may be this way you can overcome the issue. Cheers Eneko On 30/12/14 12:17, Christian Eichelmann wrote: Hi Nico and all others who answered, After some more trying to somehow get the pgs in a working state (I've tried force_create_pg, which was putting then in creating state. But that was obviously not true, since after rebooting one of the containing osd's it went back to incomplete), I decided to save what can be saved. I've created a new pool, created a new image there, mapped the old image from the old pool and the new image from the new pool to a machine, to copy data on posix level. Unfortunately, formatting the image from the new pool hangs after some time. So it seems that the new pool is suffering from the same problem as the old pool. Which is totaly not understandable for me. Right now, it seems like Ceph is giving me no options to either save some of the still intact rbd volumes, or to create a new pool along the old one to at least enable our clients to send data to ceph again. To tell the truth, I guess that will result in the end of our ceph project (running for already 9 Monthes). Regards, Christian Am 29.12.2014 15:59, schrieb Nico Schottelius: Hey Christian, Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]: [incomplete PG / RBD hanging, osd lost also not helping] that is very interesting to hear, because we had a similar situation with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg directories to allow OSDs to start after the disk filled up completly. So I am sorry not to being able to give you a good hint, but I am very interested in seeing your problem solved, as it is a show stopper for us, too. (*) Cheers, Nico (*) We migrated from sheepdog to gluster to ceph and so far sheepdog seems to run much smoother. The first one is however not supported by opennebula directly, the second one not flexible enough to host our heterogeneous infrastructure (mixed disk sizes/amounts) - so we are using ceph at the moment. -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Block and NAS Services for Non Linux OS
Hi Steven, On 30/12/14 13:26, Steven Sim wrote: You mentioned that machines see a QEMU IDE/SCSI disk, they don't know whether its on ceph, NFS, local, LVM, ... so it works OK for any VM guest SO. But what if I want to CEPH cluster to serve a whole range of clients in the data center, ranging from ESXi, Microsoft Hypervisors, Solaris (unvirtualized), AIX (unvirtualized) etc ... Sorry, my mistake, I thought the message was on Proxmox VE list. :-) In particular, I'm being asked to create a NAS and iSCSI Block storage farm with an ability to serve not just Linux but a range of operating system(s), some virtualized, some not . ... I love the distributive nature of CEPH but using Proxy nodes (or heads) sort of goes against the distributive concept... For virtualized VMs, using a virtualization platform that supports Ceph/RBD will make the trick. I'm afraid you'll need proxy nodes for the rest, as pointed by Nick with his setup for VMware. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RESOLVED Re: Cluster with pgs in active (unclean) status
Hi Gregory, Sorry for the delay getting back. There was no activity at all on those 3 pools. Activity on the fourth pool was under 1 Mbps of writes. I think I waited several hours, but I can't recall exactly. One hour at least is for sure. Thanks Eneko On 11/12/14 19:32, Gregory Farnum wrote: Was there any activity against your cluster when you reduced the size from 3 - 2? I think maybe it was just taking time to percolate through the system if nothing else was going on. When you reduced them to size 1 then data needed to be deleted so everything woke up and started processing. -Greg On Wed, Dec 10, 2014 at 5:27 AM, Eneko Lacunza elacu...@binovo.es wrote: Hi all, I fixed the issue with the following commands: # ceph osd pool set data size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set data size 2 # ceph osd pool set metadata size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set metadata size 2 # ceph osd pool set rbd size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set rbd size 2 This now gives me: # ceph status cluster 3e91b908-2af3-4288-98a5-dbb77056ecc7 health HEALTH_OK monmap e3: 3 mons at {0=10.0.3.3:6789/0,1=10.0.3.1:6789/0,2=10.0.3.2:6789/0}, election epoch 32, quorum 0,1,2 1,2,0 osdmap e275: 2 osds: 2 up, 2 in pgmap v395557: 256 pgs, 4 pools, 194 GB data, 49820 objects 388 GB used, 116 GB / 505 GB avail 256 active+clean I'm still curious whether this can be fixed without this trick? Cheers Eneko On 10/12/14 13:14, Eneko Lacunza wrote: Hi all, I have a small ceph cluster with just 2 OSDs, latest firefly. Default data, metadata and rbd pools were created with size=3 and min_size=1 An additional pool rbd2 was created with size=2 and min_size=1 This would give me a warning status, saying that 64 pgs were active+clean and 192 active+degraded. (there are 64 pg per pool). I realized it was due to the size=3 in the three pools, so I changed that value to 2: # ceph osd pool set data size 2 # ceph osd pool set metadata size 2 # ceph osd pool set rbd size 2 Those 3 pools are empty. After those commands status would report 64 pgs active+clean, and 192 pgs active, with a warning saying 192 pgs were unclean. I have created a rbd block with: rbd create -p rbd --image test --size 1024 And now the status is: # ceph status cluster 3e91b908-2af3-4288-98a5-dbb77056ecc7 health HEALTH_WARN 192 pgs stuck unclean; recovery 2/99640 objects degraded (0.002%) monmap e3: 3 mons at {0=10.0.3.3:6789/0,1=10.0.3.1:6789/0,2=10.0.3.2:6789/0}, election epoch 32, quorum 0,1,2 1,2,0 osdmap e263: 2 osds: 2 up, 2 in pgmap v393763: 256 pgs, 4 pools, 194 GB data, 49820 objects 388 GB used, 116 GB / 505 GB avail 2/99640 objects degraded (0.002%) 192 active 64 active+clean Looking to an unclean non-empty pg: # ceph pg 2.14 query { state: active, epoch: 263, up: [ 0, 1], acting: [ 0, 1], actingbackfill: [ 0, 1], info: { pgid: 2.14, last_update: 263'1, last_complete: 263'1, log_tail: 0'0, last_user_version: 1, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 1, last_epoch_started: 136, last_epoch_clean: 136, last_epoch_split: 0, same_up_since: 135, same_interval_since: 135, same_primary_since: 11, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00}, stats: { version: 263'1, reported_seq: 306, reported_epoch: 263, state: active, last_fresh: 2014-12-10 12:53:37.766465, last_change: 2014-12-10 10:32:24.189000, last_active: 2014-12-10 12:53:37.766465, last_clean: 0.00, last_became_active: 0.00, last_unstale: 2014-12-10 12:53:37.766465, mapping_epoch: 128, log_start: 0'0, ondisk_log_start: 0'0, created: 1, last_epoch_clean: 136, parent: 0.0, parent_split_bits: 0, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00, log_size: 1, ondisk_log_size: 1, stats_invalid: 0, stat_sum: { num_bytes: 112, num_objects: 1, num_object_clones: 0, num_object_copies: 2, num_objects_missing_on_primary: 0, num_objects_degraded: 1, num_objects_unfound: 0
[ceph-users] Cluster with pgs in active (unclean) status
Hi all, I have a small ceph cluster with just 2 OSDs, latest firefly. Default data, metadata and rbd pools were created with size=3 and min_size=1 An additional pool rbd2 was created with size=2 and min_size=1 This would give me a warning status, saying that 64 pgs were active+clean and 192 active+degraded. (there are 64 pg per pool). I realized it was due to the size=3 in the three pools, so I changed that value to 2: # ceph osd pool set data size 2 # ceph osd pool set metadata size 2 # ceph osd pool set rbd size 2 Those 3 pools are empty. After those commands status would report 64 pgs active+clean, and 192 pgs active, with a warning saying 192 pgs were unclean. I have created a rbd block with: rbd create -p rbd --image test --size 1024 And now the status is: # ceph status cluster 3e91b908-2af3-4288-98a5-dbb77056ecc7 health HEALTH_WARN 192 pgs stuck unclean; recovery 2/99640 objects degraded (0.002%) monmap e3: 3 mons at {0=10.0.3.3:6789/0,1=10.0.3.1:6789/0,2=10.0.3.2:6789/0}, election epoch 32, quorum 0,1,2 1,2,0 osdmap e263: 2 osds: 2 up, 2 in pgmap v393763: 256 pgs, 4 pools, 194 GB data, 49820 objects 388 GB used, 116 GB / 505 GB avail 2/99640 objects degraded (0.002%) 192 active 64 active+clean Looking to an unclean non-empty pg: # ceph pg 2.14 query { state: active, epoch: 263, up: [ 0, 1], acting: [ 0, 1], actingbackfill: [ 0, 1], info: { pgid: 2.14, last_update: 263'1, last_complete: 263'1, log_tail: 0'0, last_user_version: 1, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 1, last_epoch_started: 136, last_epoch_clean: 136, last_epoch_split: 0, same_up_since: 135, same_interval_since: 135, same_primary_since: 11, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00}, stats: { version: 263'1, reported_seq: 306, reported_epoch: 263, state: active, last_fresh: 2014-12-10 12:53:37.766465, last_change: 2014-12-10 10:32:24.189000, last_active: 2014-12-10 12:53:37.766465, last_clean: 0.00, last_became_active: 0.00, last_unstale: 2014-12-10 12:53:37.766465, mapping_epoch: 128, log_start: 0'0, ondisk_log_start: 0'0, created: 1, last_epoch_clean: 136, parent: 0.0, parent_split_bits: 0, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00, log_size: 1, ondisk_log_size: 1, stats_invalid: 0, stat_sum: { num_bytes: 112, num_objects: 1, num_object_clones: 0, num_object_copies: 2, num_objects_missing_on_primary: 0, num_objects_degraded: 1, num_objects_unfound: 0, num_objects_dirty: 1, num_whiteouts: 0, num_read: 0, num_read_kb: 0, num_write: 1, num_write_kb: 1, num_scrub_errors: 0, num_shallow_scrub_errors: 0, num_deep_scrub_errors: 0, num_objects_recovered: 0, num_bytes_recovered: 0, num_keys_recovered: 0, num_objects_omap: 0, num_objects_hit_set_archive: 0}, stat_cat_sum: {}, up: [ 0, 1], acting: [ 0, 1], up_primary: 0, acting_primary: 0}, empty: 0, dne: 0, incomplete: 0, last_epoch_started: 136, hit_set_history: { current_last_update: 0'0, current_last_stamp: 0.00, current_info: { begin: 0.00, end: 0.00, version: 0'0}, history: []}}, peer_info: [ { peer: 1, pgid: 2.14, last_update: 263'1, last_complete: 263'1, log_tail: 0'0, last_user_version: 0, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 1, last_epoch_started: 136, last_epoch_clean: 136, last_epoch_split: 0, same_up_since: 0, same_interval_since: 0, same_primary_since: 0, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493,
[ceph-users] RESOLVED Re: Cluster with pgs in active (unclean) status
Hi all, I fixed the issue with the following commands: # ceph osd pool set data size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set data size 2 # ceph osd pool set metadata size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set metadata size 2 # ceph osd pool set rbd size 1 (wait some seconds for clean+active state of +64pgs) # ceph osd pool set rbd size 2 This now gives me: # ceph status cluster 3e91b908-2af3-4288-98a5-dbb77056ecc7 health HEALTH_OK monmap e3: 3 mons at {0=10.0.3.3:6789/0,1=10.0.3.1:6789/0,2=10.0.3.2:6789/0}, election epoch 32, quorum 0,1,2 1,2,0 osdmap e275: 2 osds: 2 up, 2 in pgmap v395557: 256 pgs, 4 pools, 194 GB data, 49820 objects 388 GB used, 116 GB / 505 GB avail 256 active+clean I'm still curious whether this can be fixed without this trick? Cheers Eneko On 10/12/14 13:14, Eneko Lacunza wrote: Hi all, I have a small ceph cluster with just 2 OSDs, latest firefly. Default data, metadata and rbd pools were created with size=3 and min_size=1 An additional pool rbd2 was created with size=2 and min_size=1 This would give me a warning status, saying that 64 pgs were active+clean and 192 active+degraded. (there are 64 pg per pool). I realized it was due to the size=3 in the three pools, so I changed that value to 2: # ceph osd pool set data size 2 # ceph osd pool set metadata size 2 # ceph osd pool set rbd size 2 Those 3 pools are empty. After those commands status would report 64 pgs active+clean, and 192 pgs active, with a warning saying 192 pgs were unclean. I have created a rbd block with: rbd create -p rbd --image test --size 1024 And now the status is: # ceph status cluster 3e91b908-2af3-4288-98a5-dbb77056ecc7 health HEALTH_WARN 192 pgs stuck unclean; recovery 2/99640 objects degraded (0.002%) monmap e3: 3 mons at {0=10.0.3.3:6789/0,1=10.0.3.1:6789/0,2=10.0.3.2:6789/0}, election epoch 32, quorum 0,1,2 1,2,0 osdmap e263: 2 osds: 2 up, 2 in pgmap v393763: 256 pgs, 4 pools, 194 GB data, 49820 objects 388 GB used, 116 GB / 505 GB avail 2/99640 objects degraded (0.002%) 192 active 64 active+clean Looking to an unclean non-empty pg: # ceph pg 2.14 query { state: active, epoch: 263, up: [ 0, 1], acting: [ 0, 1], actingbackfill: [ 0, 1], info: { pgid: 2.14, last_update: 263'1, last_complete: 263'1, log_tail: 0'0, last_user_version: 1, last_backfill: MAX, purged_snaps: [], history: { epoch_created: 1, last_epoch_started: 136, last_epoch_clean: 136, last_epoch_split: 0, same_up_since: 135, same_interval_since: 135, same_primary_since: 11, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00}, stats: { version: 263'1, reported_seq: 306, reported_epoch: 263, state: active, last_fresh: 2014-12-10 12:53:37.766465, last_change: 2014-12-10 10:32:24.189000, last_active: 2014-12-10 12:53:37.766465, last_clean: 0.00, last_became_active: 0.00, last_unstale: 2014-12-10 12:53:37.766465, mapping_epoch: 128, log_start: 0'0, ondisk_log_start: 0'0, created: 1, last_epoch_clean: 136, parent: 0.0, parent_split_bits: 0, last_scrub: 0'0, last_scrub_stamp: 2014-11-26 12:23:57.023493, last_deep_scrub: 0'0, last_deep_scrub_stamp: 2014-11-26 12:23:57.023493, last_clean_scrub_stamp: 0.00, log_size: 1, ondisk_log_size: 1, stats_invalid: 0, stat_sum: { num_bytes: 112, num_objects: 1, num_object_clones: 0, num_object_copies: 2, num_objects_missing_on_primary: 0, num_objects_degraded: 1, num_objects_unfound: 0, num_objects_dirty: 1, num_whiteouts: 0, num_read: 0, num_read_kb: 0, num_write: 1, num_write_kb: 1, num_scrub_errors: 0, num_shallow_scrub_errors: 0, num_deep_scrub_errors: 0, num_objects_recovered: 0, num_bytes_recovered: 0, num_keys_recovered: 0, num_objects_omap: 0, num_objects_hit_set_archive: 0}, stat_cat_sum: {}, up: [ 0, 1], acting: [ 0, 1], up_primary: 0, acting_primary: 0}, empty: 0, dne: 0
[ceph-users] Suitable SSDs for journal
Hi all, Does anyone know about a list of good and bad SSD disks for OSD journals? I was pointed to http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ But I was looking for something more complete? For example, I have a Samsung 840 Pro that gives me even worse performance than a Crucial m550... I even thought it was dying (but doesn't seem this is the case). Maybe creating a community-contributed list could be a good idea? Regards Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Suitable SSDs for journal
Thanks, will look back in the list archive. On 04/12/14 15:47, Nick Fisk wrote: Hi Eneko, There has been various discussions on the list previously as to the best SSD for Journal use. All of them have pretty much come to the conclusion that the Intel S3700 models are the best suited and in fact work out the cheapest in terms of write durability. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Eneko Lacunza Sent: 04 December 2014 14:35 To: Ceph Users Subject: [ceph-users] Suitable SSDs for journal Hi all, Does anyone know about a list of good and bad SSD disks for OSD journals? I was pointed to http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ But I was looking for something more complete? For example, I have a Samsung 840 Pro that gives me even worse performance than a Crucial m550... I even thought it was dying (but doesn't seem this is the case). Maybe creating a community-contributed list could be a good idea? Regards Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com